RNAErnie

Official implement of paper "Multi-purpose RNA Language Modeling with Motif-aware Pre-training and Type-guided Fine-tuning" with paddlepaddle.

This repository contains codes and pre-trained models for RNAErnie, which leverages RNA motifs as biological priors and proposes a motif-level random masking strategy to enhance pre-training tasks. Furthermore, RNAErnie improves sequence classfication, RNA-RNA interaction prediction, and RNA secondary structure prediction by fine-tuning or adapating on downstream tasks with two-stage type-guided learning. Our paper will be published soon.

Overview

RNAErnie

Update Log

2024.06.26: Considering most of the researchers will prefer to use transformers and pytorch as backend. So, I transfer my work to transformers and train a pytorch model from scratch. The new model is trained with more powerful settings: The max model length is up to 2048 now and the pretraining dataset is the newest version of rnacentral, which contains about 31 million RNA sequences after length filtering (<2048). This pytorch version model has been uploaded to huggingface at https://huggingface.co/WANGNingroci/RNAErnie and the training framework/tokenization is located at https://github.com/CatIIIIIIII/RNAErnie2. (NOTE: the tokenization is a little different from the original paddle implementation). Moreover, Multimolecule are implementing current most powerful RNA language model with transformers and pytorch. Our model also could be accessed at https://huggingface.co/multimolecule/rnaernie.
2024.05.13: 🎉🎉 Our paper has been published at https://www.nature.com/articles/s42256-024-00836-4.
2024.04.20: 🎉🎉 RNAErnie has been accepted by Nature Machine Intelligence! The paper will be released soon.
2024.03.21: Add DOI and citation.
2024.01.26: Add ad-hoc pre-training with additional classification task.
2024.01.23: Integrate AUC metric in base_classes.py for simpler usage; Add content and update log section in README.md.

If you have any questions, feel free to contact us by email: wangning.roci@gmail.com.

Installation

Create Environment with Conda

First, download the repository and create the environment.

git clone https://github.com/CatIIIIIIII/RNAErnie.git
cd ./RNAErnie
conda env create -f environment.yml

Then, activate the "RNAErnie" environment.

conda activate RNAErnie

or you could

Run in Docker

Step1: Prepare code

First clone the repository:

git clone https://github.com/CatIIIIIIII/RNAErnie.git

Step2: Prepare running environment

Here we provide two ways to load the docker image.

[Option1] You can directly access the docker image using this link:

https://hub.docker.com/r/nwang227/rnaernie

After docker sign in, you could pull the docker image using the following command:

sudo docker pull nwang227/rnaernie:1.1

NOTE:

If you encounter the errorunauthorized: authentication required, this means that you haven't logged in your docker account to access docker hub.

Sign up a docker account
Login with sudo docker login -u username --password-stdin
Then try to pull the image again.

If you encounter the error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docke daemon running?, this means that you haven't started the docker service.
- Start the docker service with systemctl start docker
- Then try to run the container again.

[Option2] Or you can download the image tar from Google Drive or use the url as follow

https://drive.google.com/file/d/1Lkgw7w9xGZQ02PnU3yk0cn1V9om2yfd3

and load by

sudo docker load --input rnaernie-1.1.tar

Step3: Run

Run the container with data volumn mounted:

sudo docker run --gpus all --name rnaernie_docker -it -v $PWD/RNAErnie:/home/ nwang227/rnaernie:1.1 /bin/bash

TODO: For python version conflict, RNA secondary structure prediction task is not available in docker image. We will fix in the future.

Pre-training

1. Data Preparation

You can download my selected (nts<512) pretraining dataset from Google Drive or from RNAcentral and place the .fasta files in the ./data/pre_random folder.

Then, you can use the following command to generate the pre-training data:

2. Pre-training

Pretrain RNAErnie on selected RNAcentral datasets (nts<=512) with the following command:

python run_pretrain.py \
    --output_dir=./output \
    --per_device_train_batch_size=50 \
    --learning_rate=0.0001 \
    --save_steps=1000

To use multi-gpu training, you can add the following arguments:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch run_pretrain.py

where CUDA_VISIBLE_DEVICES specifies the GPU ids you want to use.

3. Download Pre-trained Models

Our pre-trained model with BERT, ERNIE and MOTIF masking strategies could be downloaded from Google Drive and place the .pdparams and .json files in the ./output/BERT,ERNIE,MOTIF,PROMPT folder.

4. Visualization

You can visualize the pre-training process with the following command:

visualdl --logdir ./output/BERT,ERNIE,MOTIF,PROMPT/runs/you_date/

5. Extract RNA Sequence Embeddings

Then you could extract embeddings of given RNA sequences or from .fasta file with the following codes:

import paddle
from rna_ernie import BatchConverter
from paddlenlp.transformers import ErnieModel

# ========== Set device
paddle.set_device("gpu")

# ========== Prepare Data
data = [
    ("RNA1", "GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),
    ("RNA2", "GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
    ("RNA3", "CGAUUCNCGUUCCC--CCGCCUCCA"),
]
# data = "./data/ft/seq_cls/nRC/test.fa"

# ========== Batch Converter
batch_converter = BatchConverter(k_mer=1,
                                  vocab_path="./data/vocab/vocab_1MER.txt",
                                  batch_size=256,
                                  max_seq_len=512)

# ========== RNAErnie Model
rna_ernie = ErnieModel.from_pretrained("output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final/")
rna_ernie.eval()

# call batch_converter to convert sequences to batch inputs
for names, _, inputs_ids in batch_converter(data):
    with paddle.no_grad():
        # extract whole sequence embeddings
        embeddings = rna_ernie(inputs_ids)[0].detach()
        # extract [CLS] token embedding
        embeddings_cls = embeddings[:, 0, :]

Downstream Tasks

RNA sequence classification

1. Data Preparation

You can download training data from Google Drive and place them in the ./data/ft/seq_cls folder. Three datasets (nRC, lncRNA_H, lncRNA_M) are available for this task.

2. Fine-tuning

Fine-tune RNAErnie on RNA sequence classification task with the following command:

python run_seq_cls.py \
    --dataset=nRC \
    --dataset_dir=./data/ft/seq_cls \
    --model_name_or_path=./output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final \
    --train=True \
    --batch_size=50 \
    --num_train_epochs=100 \
    --learning_rate=0.0001 \
    --output=./output_ft/seq_cls

Moreover, to train on long ncRNA classification tasks, change augument --dataset to lncRNA_M or lncRNA_H, and you can add the --use_chunk=True argument to chunk and ensemble the whole sequence.

To use two-stage fine-tuning, you can add the --two_stage=True argument.

3. Evaluation

Or you could download our weights of RNAErnie on sequence classification tasks from Google Drive and place them in the ./output_ft/seq_cls folder.

Then you coul

RNAErnie

Install / Use

README

RNAErnie

Update Log

Installation

Create Environment with Conda

Run in Docker

Step1: Prepare code

Step2: Prepare running environment

Step3: Run

Pre-training

1. Data Preparation

2. Pre-training

3. Download Pre-trained Models

4. Visualization

5. Extract RNA Sequence Embeddings

Downstream Tasks

RNA sequence classification

1. Data Preparation

2. Fine-tuning

3. Evaluation