RNAPro
RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition.
Install / Use
/learn @NVIDIA-Digital-Bio/RNAProREADME
RNAPro: An accurate RNA structure prediction model by Kaggle synthesis
Model Description
RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition. RNAPro combines RNA-specific modules including template modeling, multiple sequence alignment (MSA), and a pretrained RNA foundation model, especially, RibonanzaNet2 with Protenix to enhance RNA structure prediction performance. Read more about the kaggle competition and model in the preprint.
Installation
Environment setup (conda example)
- Clone this repository and
cdinto it
git clone https://github.com/NVIDIA-Digital-Bio/RNAPro
cd ./RNAPro
- Create a conda environment
conda create -n rnapro python=3.12 -y
conda activate rnapro
pip install -r requirements.txt
- Install RNAPro
pip install -e .
Docker (Recommended for Training and Inference)
The code was developed using the nvcr.io/nvidia/pytorch:25.09-py3 docker image.
Run step 1., and inside the container, run step 3.
For more detailed instruction steps, check the <u>Docker Installation</u> guide.
Train Models
1. Data preparation
Data is expected to be in the release_data directory.
mkdir release_data
cd release_data
Download training & MSA data:
This will take some time and requires ~100GB of disk space.
cd release_data
mkdir kaggle; cd kaggle
curl -L -o stanford-rna-3d-folding-all-atom-train-data.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/stanford-rna-3d-folding-all-atom-train-data
unzip stanford-rna-3d-folding-all-atom-train-data.zip
rm stanford-rna-3d-folding-all-atom-train-data.zip
# Get MSAs from https://www.kaggle.com/datasets/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
curl -L -o stanford-rna-3d-folding.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
mkdir MSA_v2
unzip stanford-rna-3d-folding.zip "MSA_v2/*" -d /MSA_v2
rm stanford-rna-3d-folding.zip
cd ..
Download CCD cache:
To train the model, you will need the CCD cache. The CCD cache is generated by processing the file from https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz. You can generate the required files by running the command below, which will create the following files:
python3 preprocess/gen_ccd_cache.py
release_data/
└── ccd_cache/
├── components.cif
├── components.cif.rdkit_mol.pkl
└── clusters-by-entity-40.txt
Download Protenix pretrained checkpoints:
cd release_data
mkdir protenix_models; cd protenix_models
wget https://af3-dev.tos-cn-beijing.volces.com/release_model/protenix_base_default_v0.5.0.pt
cd ..
Download RibonanzaNet2 pretrained checkpoint:
cd release_data
mkdir ribonanzanet2_checkpoint; cd ribonanzanet2_checkpoint
curl -L -o ribonanzanet2.tar.gz https://www.kaggle.com/api/v1/models/shujun717/ribonanzanet2/pyTorch/alpha/1/download
tar -xzvf ribonanzanet2.tar.gz
rm ribonanzanet2.tar.gz
cd ..
Templates can be obtained in two ways:
- MMseqs2-based template identification: MMseqs2 3D RNA Template Identification
- Kaggle 1st place template-based approaches: RNA 3D Folds — TBM-only approach
Each of these notebooks generate a submission.csv file containing the templates. The csv file can be downloaded in the output section of the notebook.
You do not need to rerun the notebooks, unless you want to generate templates for new sequences.
To convert the csv into the training-ready or inference-ready binary format, use:
python preprocess/convert_templates_to_pt_files.py --input_csv <path/to/submission.csv> --output_name template_features.pt
The resulting .pt file(s) can be referenced via template_data in configs and used with use_template='ca_precomputed'.
Overview
After running the above steps, the repository structure should look like this:
release_data/
├── ccd_cache/
│ ├── clusters-by-entity-40.txt
│ ├── components.v20240608.cif
│ └── components.v20240608.cif.rdkit_mol.pkl
├── kaggle/
│ ├── MSA_v2/
│ ├── <training data files>
│ └── template_features.pt
├── protenix_models/
│ └── protenix_base_default_v0.5.0.pt
└── ribonanzanet2_checkpoint/
├── dropout.py
├── Network.py
├── pairwise.yaml
└── pytorch_model_fsdp.bin
2. Training
We provide the trained model checkpoint via NGC and HuggingFace (Public best and Private best).
We provide a convenience script for training. Please modify it according to your purpose:
sh rnapro_train_example.sh
Inference
Expected Input & Output Format
For details on the input format and output format, please refer to the overview.
Prepare inputs
-
Input csv files
- Prepare a CSV file with the columns: target_id and sequence.
-
RNA MSA
- MSAs are user-provided.
-
Templates (same as training)
- Obtain templates via either:
- MMseqs2-based identification:
https://www.kaggle.com/code/rhijudas/mmseqs2-3d-rna-template-identification - Kaggle 1st place TBM-only approach:
https://www.kaggle.com/code/jaejohn/rna-3d-folds-tbm-only-approach(generally stronger)
- MMseqs2-based identification:
- Each approach produces a
submission.csv. Convert it to.pt:python preprocess/convert_templates_to_pt_files.py --input_csv path/to/submission.csv --output_name path/to/template_features.pt --max_n 40- Use with
--use_template ca_precomputed --template_data path/to/template_features.pt.
- Obtain templates via either:
-
CCD cache (same as training) -
python preprocess/gen_ccd_cache.py -
Model weights are available via NGC and HuggingFace (Public best and Private best).
Inference via Bash Script
You can run inference via script:
bash rnapro_inference_example.sh
The script configures and forwards the following parameters to the CLI:
--model_name: Model config to use (e.g.,rnapro_base).--dump_dir: Directory where inference results are saved.--load_checkpoint_path: Path to the model checkpoint.pt.--seeds: Comma-separated seeds (default in example:42).--dtype: Precision (bf16orfp32).--use_msa: Enable MSAs (recommended for RNA).--rna_msa_dir: Directory containing precomputed MSAs.--use_template: Template mode (useca_precomputedfor prepared templates).--template_data: Path to.pttemplate file converted from submission.csv.--template_idx: Top-k template selection index: 0 -> top1, 1 -> top2, 2 -> top3, 3 -> top4, 4 -> top5--num_templates: Number of templates to use (e.g.,10).--model.N_cycle: Diffusion cycles (e.g.,10).--sample_diffusion.N_sample: Number of samples per seed (e.g.,1).--sample_diffusion.N_step: Diffusion steps (e.g.,200).--load_strict: Strict weight loading.--num_workers: Data loader workers.--triangle_attention/--triangle_multiplicative: Kernel backends (torch,cuequivariance, etc.).--sequences_csv: Optional CSV with headerssequence,target_idfor batched inference.--max_len: Maximum length of the sequence. Longer sequences will be skipped during inference (default:10000).--logger: Logger to use by the inference runner (default:logging). Supportsloggingandprint.--n_templates_inf: Number of inferences to do with different template combinations (default:5).
Acceleration
Training and inference can be accelerated using various optimized kernels (e.g., cuEquivariance, Triton, and specialized LayerNorm/attention backends). Refer to the Kernels Setup Guide for installation steps, supported options, and recommended configurations.
Configuration Notes
RibonanzaNet2
RNAPro uses a frozen pre-trained RNA foundation model RibonanzaNet2 as an encoder to extract RNA sequence and pairwise features. These are projected and injected into a RNA post-trained Protenix with learned gating and RNA templates. The RibonanzaNet2 module can be enabled or disabled:
--model.use_RibonanzaNet2 true
--model.ribonanza_net_path ./release_data/ribonanzanet2_checkpoint
Template Modes
| Mode | Use Case |
|------|----------|
| ca_precomputed | Inference with precomputed C1' templates |
| masked_templates | Training with masking ground truth if you do not have a template dataset |
--use_template ca_precomputed
--model.use_template ca_precomputed
Template Embedder Options
# Number of pairformer blocks in template embedder
--model.template_embedder.n_blocks 2
--num_templates 4
Acknowledgements
We thank Stanford Das Lab, HHMI, the co-hosts and winners of the Stanford RNA 3D Folding Kaggle competition for their collaboration in this research.
License
Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.<br> The source code is made available under Apache-2.0.<br> The model weights are made available under the NVIDIA Open Model License.
Related Skills
node-connect
340.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.2kCommit, push, and open a PR
