RESM
This is the official codebase for RESM: Capturing sequence and structure encoding of RNAs by mapped transfer learning from ESM (evolutionary scale modeling) protein language model
Install / Use
/learn @yikunpku/RESMREADME
📣 News
- [2025/07] 🎉 We release the dataset and checkpoints on Zenodo!
- [2025/07] 📊 Initial release the code of RESM-150M and RESM-650M models with comprehensive documentation.
⚡ Overview
RESM (RNA Evolution-Scale Modeling) is a state-of-the-art RNA language model that leverages protein language model knowledge to overcome RNA's inherent challenges. By mapping RNA sequences to pseudo-protein representations and adapting the ESM2 protein language model, RESM provides a robust foundation for deciphering RNA sequence-structure-function relationships.
<div align="center"> <img src="figures/RESM.png" width="800px"> </div>Key Features:
- Pseudo-protein Mapping: Novel approach to convert RNA's 4-letter alphabet into protein-like representations
- Knowledge Transfer: Leverages the powerful representations learned by ESM protein language models
- Dual-task Excellence: First RNA model to achieve state-of-the-art performance on both structural and functional prediction tasks
- Zero-shot Capability: Outperforms 12 RNA language models in zero-shot evaluation without task-specific training
- Benchmark Performance: Demonstrates superior results across 8 downstream tasks, surpassing 60+ models
- Long RNA Breakthrough: 81% accuracy gain and 1000× speedup on sequences up to 4,000 nucleotides
- Flexible Architecture: Available in 150M and 650M parameter versions
📥 Download URL
| Resource | Description | Size | Link | |----------|-------------|------|------| | Datasets | pre-training and downstream datasets | ~6.4GB | Download | | RESM-150M | model checkpoint | ~1.8GB | Download | | RESM-650M | model checkpoint | ~2.6GB | Download |
🚀 Quick Start
Prerequisites
- Python 3.8+
- PyTorch 1.10+
- CUDA 11.0+ (for GPU support)
Installation
- Clone the repository:
git clone https://github.com/yourusername/RESM.git
cd RESM
- Create and activate conda environment:
# Create conda environment from yml file
conda env create -f environment.yml
# Activate the environment
conda activate resm
📊 Usage
Feature Extraction
Extract RNA embeddings and attention maps from your RNA sequences:
# For RESM-150M model (default paths)
python resm_inference.py \
--base_model RESM_150M \
--data_path /path/to/your/data \
--output_dir /path/to/output \
--device cuda
# For RESM-650M model (default paths)
python resm_inference.py \
--base_model RESM_650M \
--data_path /path/to/your/data \
--output_dir /path/to/output \
--device cuda
# Use custom checkpoint path
python resm_inference.py \
--base_model RESM_150M \
--model_path /path/to/custom/checkpoint.ckpt \
--data_path /path/to/your/data \
--output_dir /path/to/output \
--device cuda
Input Data Format
The model expects RNA sequences in FASTA format or as a text file with RNA IDs. Place your data in the following structure:
data/
├── dsdata/
│ ├── msa/ # MSA files (optional, can use single sequences)
│ └── extract_ss_data_alphaid.txt # List of RNA IDs
Output Format
The model outputs two types of features for each RNA sequence:
-
Embeddings (
*_emb.npy):- RESM-150M: Shape
(L, 640)where L is sequence length - RESM-650M: Shape
(L, 1280)where L is sequence length
- RESM-150M: Shape
-
Attention Maps (
*_atp.npy):- RESM-150M: Shape
(600, L, L)(30 layers × 20 heads) - RESM-650M: Shape
(660, L, L)(33 layers × 20 heads)
- RESM-150M: Shape
🏗️ Model Architecture
RESM builds upon ESM2 architecture with RNA-specific adaptations:
RESM-150M (Based on ESM2-150M)
- Base Model:
esm2_t30_150M_UR50D - Layers: 30 transformer layers
- Embedding Dimension: 640
- Attention Heads: 20
- Parameters: ~150M
RESM-650M (Based on ESM2-650M)
- Base Model:
esm2_t33_650M_UR50D - Layers: 33 transformer layers
- Embedding Dimension: 1280
- Attention Heads: 20
- Parameters: ~650M
🔍 Example Use Cases
- RNA Secondary Structure Prediction: Use extracted attention maps for predicting RNA base pairs with state-of-the-art accuracy
- RNA Function Classification: Leverage embeddings for functional annotation of novel RNA sequences
- Gene Expression Prediction: Apply RESM features for mRNA expression level prediction
- Ribosome Loading Efficiency: Predict translation efficiency from mRNA sequences
- RNA Similarity Search: Compare RNA sequences using embedding similarity
- Transfer Learning: Fine-tune on your specific RNA task for enhanced performance
📝 Citation
If you use RESM in your research, please cite our paper:
@article {Zhang2025.08.09.669469,
author = {Zhang, Yikun and Zhang, Hao and Li, Guo-Wei and Wang, He and Zhang, Xing and Hong, Xu and Zhang, Tingting and Wen, Liangsheng and Zhao, Yu and Jiang, Jiuhong and Chen, Jie and Chen, Yanjun and Liu, Liwei and Zhan, Jian and Zhou, Yaoqi},
title = {RESM: Capturing sequence and structure encoding of RNAs by mapped transfer learning from ESM (evolutionary scale modeling) protein language model},
elocation-id = {2025.08.09.669469},
year = {2025},
doi = {10.1101/2025.08.09.669469},
publisher = {Cold Spring Harbor Laboratory},
abstract = {RNA sequences exhibit lower evolutionary conservation than proteins due to their informationally constrained four-letter alphabet, compared to the 20-letter code of proteins. More limited information makes unsupervised learning of structural and functional evolutionary patterns more challenging from single RNA sequences. We overcame this limitation by mapping RNA sequences to pseudo-protein sequences to allow effective transfer training from a protein language model (protein Evolution-Scale Model 2, protESM-2). The resulting RNA ESM (RESM) outperforms 12 existing RNA language models in zero-shot prediction, not only in sequence classification but also in RNA secondary structure and RNA-RNA interaction prediction. Further supervised fine-tuning demonstrates RESM{\textquoteright}s generalizability and superior performance over the existing models compared across multiple downstream tasks, including mRNA ribosome loading efficiency and gene expression prediction, despite RESM being trained exclusively on noncoding RNAs. Moreover, RESM can generalize to unseen sequences beyond its 1,024-nucleotide training limit, achieving 81.3\% improvement over state-of-the-art methods in supervised secondary structure prediction for RNAs up to 4,000 nucleotides, limited only by the available GPU memory, while providing \>1000-fold speedup compared to MSA-based approaches. RESM provides a robust foundation for deciphering RNA sequence-structure-function relationships, with broad implications for RNA biology.Competing Interest StatementPatent applications related to RESM and downstream tasks were submitted by China Mobile Research Institute and Shenzhen Bay Laboratory. LW,YC, \& TZ are affiliated with China Mobile Research Institute. YiZ, HW, JZ, \& YaZ are affiliated with Shenzhen Bay Laboratory. JZ and YaZ are the CEO and the chair of the scientific advisory board for Ribopeutic, respectively. All other authors declare no competing interests.},
URL = {https://www.biorxiv.org/content/early/2025/08/10/2025.08.09.669469},
eprint = {https://www.biorxiv.org/content/early/2025/08/10/2025.08.09.669469.full.pdf},
journal = {bioRxiv}
}
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
👍 Acknowledgments
- ESM models The codebase we built upon.
🤝 Contributing
We welcome contributions! Please feel free to submit issues or pull requests.
📧 Contact
For questions or collaborations, please contact: yikun.zhang@stu.pku.edu.cn
