LAMAR

A Foundation Language Model for RNA Regulation

This repository contains codes and links of pre-trained weights for RNA foundation language model LAMAR. LAMAR outperformed benchmark models in various RNA regulation tasks, helping to decipher the regulation rules of RNA.

LAMAR was developed by Rnasys Lab and Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health (SINH), Chinese Academy of Sciences (CAS).

Citation

https://www.biorxiv.org/content/10.1101/2024.10.12.617732v2

Create environment

The environment can be created with LAMAR_requirements.txt.

git clone https://github.com/zhw-e8/LAMAR.git
cd ./LAMAR

conda create -n lamar python==3.11
conda activate lamar
pip install -r LAMAR_requirements.txt

The pretraining was conducted on A800 80GB GPUs, and the fine-tuning was conducted on the Sugon Z-100 16GB and Tesla V100 32GB clusters of GPUs.
The environments are a little different on different devices. And now the unified environment is provided.
Pretraining environment:
A800: environment_A800_pretrain.yml
Fine-tuning environment:
Sugon Z-100: environment_Z100_finetune.yml
V100 (ppc64le): environment_V100_finetune.yml

Required packages

accelerate >= 0.26.1
torch >= 1.13
transformers >= 4.32.1
datasets >= 2.12.0
pandas >= 2.0.3
safetensors >= 0.4.1

Usage

Install package

After creating the environment, LAMAR package can be installed.

pip install .

Download pretrained weights

The pretrained weights can be downloaded from https://huggingface.co/zhw-e8/LAMAR/tree/main.

Compute embeddings

Notice: In our model, the tokenizer, config and pretrained weights should be loaded locally. So, we encourage the users to specify the absolute path or ensure the correct relative path is used.

from LAMAR.modeling_nucESM2 import EsmModel
from transformers import AutoConfig, AutoTokenizer
from safetensors.torch import load_file, load_model
import torch


seq = "ATACGATGCTAGCTAGTGACTAGCTGATCGTAGCTG"
model_max_length = 1026
device = torch.device("cuda:0")
# instance tokenizer and config
tokenizer = AutoTokenizer.from_pretrained("tokenizer/single_nucleotide/", model_max_length=model_max_length)
config = AutoConfig.from_pretrained(
    "config/config_150M.json", vocab_size=len(tokenizer), pad_token_id=tokenizer.pad_token_id,
    mask_token_id=tokenizer.mask_token_id, token_dropout=False, positional_embedding_type='rotary', 
    hidden_size=768, intermediate_size=3072, num_attention_heads=12, num_hidden_layers=12
)
# intance the model and load pretrained weights
model = EsmModel(config)
weights = load_file('pretrain/saving_model/mammalian80D_4096len1mer1sw_80M/checkpoint-250000/model.safetensors')
weights_dict = {}
for k, v in weights.items():
    new_k = k.replace('esm.', '') if 'esm' in k else k
    weights_dict[new_k] = v
model.load_state_dict(weights_dict, strict=False)
model = model.to(device)
# Compute embeddings
model.eval()
with torch.no_grad():
    inputs = tokenizer(seq, return_tensors="pt")
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    outputs = model(
        input_ids=input_ids, 
        attention_mask=attention_mask
    )
    embedding = outputs.last_hidden_state[0, 1 : -1, :]

In our paper, we compared the embeddings of necleotides, functional elements and transcripts from pretrained and untrained LAMAR. The paths of scripts are as followed:

Compute embeddings of nucleotides: src/embedding/NucleotideEmbeddingMultipleTimes.ipynb
Compute embeddings of functional elements: src/embedding/FunctionalElementEmbedding.ipynb
Compute embeddings of transcripts: src/embedding/RNAEmbedding.ipynb
Compute embeddings of splice sites: src/embedding/SpliceSiteEmbedding.ipynb

Predict splice sites from pre-mRNA sequences

The paths of scripts:

Tokenization: src/SpliceSitePred/tokenize_data.ipynb
Fine-tune: src/SpliceSitePred/finetune.ipynb
Evaluate: src/SpliceSitePred/evaluation.ipynb

Predict the translation efficiencies of mRNAs based on 5' UTRs (HEK293 cell line)

The paths of scripts:

Tokenization: src/UTR5TEPred/tokenize_data.ipynb
Fine-tune: src/UTR5TEPred/finetune.ipynb
Evaluate: src/UTR5TEPred/evaluate.ipynb

Annotate the IRES

The paths of scripts:

Tokenization: src/IRESPred/tokenize_data.ipynb
Fine-tune: src/IRESPred/finetune.ipynb
Evaluate: src/IRESPred/evaluate.ipynb

Predict the half-lives of mRNAs based on 3' UTRs (BEAS-2B cell line)

The paths of scripts:

Tokenization: src/UTR3DegPred/tokenize_data.ipynb
Fine-tune: src/UTR3DegPred/finetune.ipynb
Evaluate: src/UTR3DegPred/evaluate.ipynb

Baseline methods

The performance of LAMAR was compared to baseline methods. The scripts: https://github.com/zhw-e8/LAMAR_baselines

LAMAR

Install / Use

README

LAMAR

Citation

Create environment

Required packages

Usage

Install package

Download pretrained weights

Compute embeddings

Predict splice sites from pre-mRNA sequences

Predict the translation efficiencies of mRNAs based on 5' UTRs (HEK293 cell line)

Annotate the IRES

Predict the half-lives of mRNAs based on 3' UTRs (BEAS-2B cell line)

Baseline methods