RNAPro

RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition.

Generate Convert Improve

Install / Use

/learn @NVIDIA-Digital-Bio/RNAPro

About this skill

Quality Score

0/100

README

RNAPro: An accurate RNA structure prediction model by Kaggle synthesis

Model Description

RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition. RNAPro combines RNA-specific modules including template modeling, multiple sequence alignment (MSA), and a pretrained RNA foundation model, especially, RibonanzaNet2 with Protenix to enhance RNA structure prediction performance. Read more about the kaggle competition and model in the preprint.

Installation

Environment setup (conda example)

Clone this repository and cd into it

git clone https://github.com/NVIDIA-Digital-Bio/RNAPro
cd ./RNAPro

Create a conda environment

conda create -n rnapro python=3.12 -y
conda activate rnapro
pip install -r requirements.txt

Install RNAPro

pip install -e .

Docker (Recommended for Training and Inference)

The code was developed using the nvcr.io/nvidia/pytorch:25.09-py3 docker image. Run step 1., and inside the container, run step 3.

For more detailed instruction steps, check the <u>Docker Installation</u> guide.

Train Models

1. Data preparation

Data is expected to be in the release_data directory.

mkdir release_data
cd release_data

Download training & MSA data:

This will take some time and requires ~100GB of disk space.

cd release_data
mkdir kaggle; cd kaggle
curl -L -o stanford-rna-3d-folding-all-atom-train-data.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/stanford-rna-3d-folding-all-atom-train-data
unzip stanford-rna-3d-folding-all-atom-train-data.zip
rm stanford-rna-3d-folding-all-atom-train-data.zip

# Get MSAs from https://www.kaggle.com/datasets/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
curl -L -o stanford-rna-3d-folding.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
mkdir MSA_v2
unzip stanford-rna-3d-folding.zip "MSA_v2/*" -d /MSA_v2
rm stanford-rna-3d-folding.zip
cd ..

Download CCD cache:

To train the model, you will need the CCD cache. The CCD cache is generated by processing the file from https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz. You can generate the required files by running the command below, which will create the following files:

python3 preprocess/gen_ccd_cache.py

release_data/
└── ccd_cache/
    ├── components.cif
    ├── components.cif.rdkit_mol.pkl
    └── clusters-by-entity-40.txt

Download Protenix pretrained checkpoints:

cd release_data
mkdir protenix_models; cd protenix_models
wget https://af3-dev.tos-cn-beijing.volces.com/release_model/protenix_base_default_v0.5.0.pt
cd ..

Download RibonanzaNet2 pretrained checkpoint:

cd release_data
mkdir ribonanzanet2_checkpoint; cd ribonanzanet2_checkpoint
curl -L -o ribonanzanet2.tar.gz https://www.kaggle.com/api/v1/models/shujun717/ribonanzanet2/pyTorch/alpha/1/download
tar -xzvf ribonanzanet2.tar.gz
rm ribonanzanet2.tar.gz
cd ..

Templates can be obtained in two ways:

MMseqs2-based template identification: MMseqs2 3D RNA Template Identification
Kaggle 1st place template-based approaches: RNA 3D Folds — TBM-only approach

Each of these notebooks generate a submission.csv file containing the templates. The csv file can be downloaded in the output section of the notebook. You do not need to rerun the notebooks, unless you want to generate templates for new sequences.

To convert the csv into the training-ready or inference-ready binary format, use:

python preprocess/convert_templates_to_pt_files.py --input_csv <path/to/submission.csv> --output_name template_features.pt

The resulting .pt file(s) can be referenced via template_data in configs and used with use_template='ca_precomputed'.

Overview

After running the above steps, the repository structure should look like this:

release_data/
├── ccd_cache/
│   ├── clusters-by-entity-40.txt
│   ├── components.v20240608.cif
│   └── components.v20240608.cif.rdkit_mol.pkl
├── kaggle/
│   ├── MSA_v2/
│   ├── <training data files>
│   └── template_features.pt
├── protenix_models/
│   └── protenix_base_default_v0.5.0.pt
└── ribonanzanet2_checkpoint/
    ├── dropout.py
    ├── Network.py
    ├── pairwise.yaml
    └── pytorch_model_fsdp.bin

2. Training

We provide the trained model checkpoint via NGC and HuggingFace (Public best and Private best).

We provide a convenience script for training. Please modify it according to your purpose:

sh rnapro_train_example.sh

Inference

Expected Input & Output Format

For details on the input format and output format, please refer to the overview.

Prepare inputs

Input csv files
- Prepare a CSV file with the columns: target_id and sequence.
RNA MSA
- MSAs are user-provided.
Templates (same as training)
- Obtain templates via either:
  - MMseqs2-based identification: https://www.kaggle.com/code/rhijudas/mmseqs2-3d-rna-template-identification
  - Kaggle 1st place TBM-only approach: https://www.kaggle.com/code/jaejohn/rna-3d-folds-tbm-only-approach (generally stronger)
- Each approach produces a submission.csv. Convert it to .pt:
  - python preprocess/convert_templates_to_pt_files.py --input_csv path/to/submission.csv --output_name path/to/template_features.pt --max_n 40
  - Use with --use_template ca_precomputed --template_data path/to/template_features.pt.
CCD cache (same as training) -python preprocess/gen_ccd_cache.py
Model weights are available via NGC and HuggingFace (Public best and Private best).

Inference via Bash Script

You can run inference via script:

bash rnapro_inference_example.sh

The script configures and forwards the following parameters to the CLI:

--model_name: Model config to use (e.g., rnapro_base).
--dump_dir: Directory where inference results are saved.
--load_checkpoint_path: Path to the model checkpoint .pt.
--seeds: Comma-separated seeds (default in example: 42).
--dtype: Precision (bf16 or fp32).
--use_msa: Enable MSAs (recommended for RNA).
--rna_msa_dir: Directory containing precomputed MSAs.
--use_template: Template mode (use ca_precomputed for prepared templates).
--template_data: Path to .pt template file converted from submission.csv.
--template_idx: Top-k template selection index: 0 -> top1, 1 -> top2, 2 -> top3, 3 -> top4, 4 -> top5
--num_templates: Number of templates to use (e.g., 10).
--model.N_cycle: Diffusion cycles (e.g., 10).
--sample_diffusion.N_sample: Number of samples per seed (e.g., 1).
--sample_diffusion.N_step: Diffusion steps (e.g., 200).
--load_strict: Strict weight loading.
--num_workers: Data loader workers.
--triangle_attention / --triangle_multiplicative: Kernel backends (torch, cuequivariance, etc.).
--sequences_csv: Optional CSV with headers sequence,target_id for batched inference.
--max_len: Maximum length of the sequence. Longer sequences will be skipped during inference (default: 10000).
--logger: Logger to use by the inference runner (default: logging). Supports logging and print.
--n_templates_inf: Number of inferences to do with different template combinations (default: 5).

Acceleration

Training and inference can be accelerated using various optimized kernels (e.g., cuEquivariance, Triton, and specialized LayerNorm/attention backends). Refer to the Kernels Setup Guide for installation steps, supported options, and recommended configurations.

Configuration Notes

RibonanzaNet2

RNAPro uses a frozen pre-trained RNA foundation model RibonanzaNet2 as an encoder to extract RNA sequence and pairwise features. These are projected and injected into a RNA post-trained Protenix with learned gating and RNA templates. The RibonanzaNet2 module can be enabled or disabled:

--model.use_RibonanzaNet2 true
--model.ribonanza_net_path ./release_data/ribonanzanet2_checkpoint

Template Modes

| Mode | Use Case | |------|----------| | ca_precomputed | Inference with precomputed C1' templates | | masked_templates | Training with masking ground truth if you do not have a template dataset |

--use_template ca_precomputed
--model.use_template ca_precomputed

Template Embedder Options

# Number of pairformer blocks in template embedder
--model.template_embedder.n_blocks 2
--num_templates 4

Acknowledgements

We thank Stanford Das Lab, HHMI, the co-hosts and winners of the Stanford RNA 3D Folding Kaggle competition for their collaboration in this research.

License

Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.<br> The source code is made available under Apache-2.0.<br> The model weights are made available under the NVIDIA Open Model License.

Related Skills

node-connect

340.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

340.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.2k

Commit, push, and open a PR