<div align="center"> <h1>🧬 RESM: RNA Evolution-Scale Modeling</h1>

</div>

📣 News

[2025/07] 🎉 We release the dataset and checkpoints on Zenodo!
[2025/07] 📊 Initial release the code of RESM-150M and RESM-650M models with comprehensive documentation.

⚡ Overview

RESM (RNA Evolution-Scale Modeling) is a state-of-the-art RNA language model that leverages protein language model knowledge to overcome RNA's inherent challenges. By mapping RNA sequences to pseudo-protein representations and adapting the ESM2 protein language model, RESM provides a robust foundation for deciphering RNA sequence-structure-function relationships.

Key Features:

Pseudo-protein Mapping: Novel approach to convert RNA's 4-letter alphabet into protein-like representations
Knowledge Transfer: Leverages the powerful representations learned by ESM protein language models
Dual-task Excellence: First RNA model to achieve state-of-the-art performance on both structural and functional prediction tasks
Zero-shot Capability: Outperforms 12 RNA language models in zero-shot evaluation without task-specific training
Benchmark Performance: Demonstrates superior results across 8 downstream tasks, surpassing 60+ models
Long RNA Breakthrough: 81% accuracy gain and 1000× speedup on sequences up to 4,000 nucleotides
Flexible Architecture: Available in 150M and 650M parameter versions

📥 Download URL

| Resource | Description | Size | Link | |----------|-------------|------|------| | Datasets | pre-training and downstream datasets | ~6.4GB | Download | | RESM-150M | model checkpoint | ~1.8GB | Download | | RESM-650M | model checkpoint | ~2.6GB | Download |

🚀 Quick Start

Prerequisites

Python 3.8+
PyTorch 1.10+
CUDA 11.0+ (for GPU support)

Installation

Clone the repository:

git clone https://github.com/yourusername/RESM.git
cd RESM

Create and activate conda environment:

# Create conda environment from yml file
conda env create -f environment.yml

# Activate the environment
conda activate resm

📊 Usage

Feature Extraction

Extract RNA embeddings and attention maps from your RNA sequences:

# For RESM-150M model (default paths)
python resm_inference.py \
    --base_model RESM_150M \
    --data_path /path/to/your/data \
    --output_dir /path/to/output \
    --device cuda

# For RESM-650M model (default paths)
python resm_inference.py \
    --base_model RESM_650M \
    --data_path /path/to/your/data \
    --output_dir /path/to/output \
    --device cuda

# Use custom checkpoint path
python resm_inference.py \
    --base_model RESM_150M \
    --model_path /path/to/custom/checkpoint.ckpt \
    --data_path /path/to/your/data \
    --output_dir /path/to/output \
    --device cuda

Input Data Format

The model expects RNA sequences in FASTA format or as a text file with RNA IDs. Place your data in the following structure:

data/
├── dsdata/
│   ├── msa/          # MSA files (optional, can use single sequences)
│   └── extract_ss_data_alphaid.txt  # List of RNA IDs

Output Format

The model outputs two types of features for each RNA sequence:

Embeddings (*_emb.npy):
- RESM-150M: Shape (L, 640) where L is sequence length
- RESM-650M: Shape (L, 1280) where L is sequence length
Attention Maps (*_atp.npy):
- RESM-150M: Shape (600, L, L) (30 layers × 20 heads)
- RESM-650M: Shape (660, L, L) (33 layers × 20 heads)

🏗️ Model Architecture

RESM builds upon ESM2 architecture with RNA-specific adaptations:

RESM-150M (Based on ESM2-150M)

Base Model: esm2_t30_150M_UR50D
Layers: 30 transformer layers
Embedding Dimension: 640
Attention Heads: 20
Parameters: ~150M

RESM-650M (Based on ESM2-650M)

Base Model: esm2_t33_650M_UR50D
Layers: 33 transformer layers
Embedding Dimension: 1280
Attention Heads: 20
Parameters: ~650M

🔍 Example Use Cases

RNA Secondary Structure Prediction: Use extracted attention maps for predicting RNA base pairs with state-of-the-art accuracy
RNA Function Classification: Leverage embeddings for functional annotation of novel RNA sequences
Gene Expression Prediction: Apply RESM features for mRNA expression level prediction
Ribosome Loading Efficiency: Predict translation efficiency from mRNA sequences
RNA Similarity Search: Compare RNA sequences using embedding similarity
Transfer Learning: Fine-tune on your specific RNA task for enhanced performance

📝 Citation

If you use RESM in your research, please cite our paper:

@article {Zhang2025.08.09.669469,
	author = {Zhang, Yikun and Zhang, Hao and Li, Guo-Wei and Wang, He and Zhang, Xing and Hong, Xu and Zhang, Tingting and Wen, Liangsheng and Zhao, Yu and Jiang, Jiuhong and Chen, Jie and Chen, Yanjun and Liu, Liwei and Zhan, Jian and Zhou, Yaoqi},
	title = {RESM: Capturing sequence and structure encoding of RNAs by mapped transfer learning from ESM (evolutionary scale modeling) protein language model},
	elocation-id = {2025.08.09.669469},
	year = {2025},
	doi = {10.1101/2025.08.09.669469},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {RNA sequences exhibit lower evolutionary conservation than proteins due to their informationally constrained four-letter alphabet, compared to the 20-letter code of proteins. More limited information makes unsupervised learning of structural and functional evolutionary patterns more challenging from single RNA sequences. We overcame this limitation by mapping RNA sequences to pseudo-protein sequences to allow effective transfer training from a protein language model (protein Evolution-Scale Model 2, protESM-2). The resulting RNA ESM (RESM) outperforms 12 existing RNA language models in zero-shot prediction, not only in sequence classification but also in RNA secondary structure and RNA-RNA interaction prediction. Further supervised fine-tuning demonstrates RESM{\textquoteright}s generalizability and superior performance over the existing models compared across multiple downstream tasks, including mRNA ribosome loading efficiency and gene expression prediction, despite RESM being trained exclusively on noncoding RNAs. Moreover, RESM can generalize to unseen sequences beyond its 1,024-nucleotide training limit, achieving 81.3\% improvement over state-of-the-art methods in supervised secondary structure prediction for RNAs up to 4,000 nucleotides, limited only by the available GPU memory, while providing \&gt;1000-fold speedup compared to MSA-based approaches. RESM provides a robust foundation for deciphering RNA sequence-structure-function relationships, with broad implications for RNA biology.Competing Interest StatementPatent applications related to RESM and downstream tasks were submitted by China Mobile Research Institute and Shenzhen Bay Laboratory. LW,YC, \&amp; TZ are affiliated with China Mobile Research Institute. YiZ, HW, JZ, \&amp; YaZ are affiliated with Shenzhen Bay Laboratory. JZ and YaZ are the CEO and the chair of the scientific advisory board for Ribopeutic, respectively. All other authors declare no competing interests.},
	URL = {https://www.biorxiv.org/content/early/2025/08/10/2025.08.09.669469},
	eprint = {https://www.biorxiv.org/content/early/2025/08/10/2025.08.09.669469.full.pdf},
	journal = {bioRxiv}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👍 Acknowledgments

ESM models The codebase we built upon.

🤝 Contributing

We welcome contributions! Please feel free to submit issues or pull requests.

📧 Contact

For questions or collaborations, please contact: yikun.zhang@stu.pku.edu.cn

RESM

Install / Use

README