CodonFM
A family of codon-resolution language models trained on 130 million protein-coding sequences from over 20,000 species.
Install / Use
/learn @NVIDIA-Digital-Bio/CodonFMREADME
CodonFM: Foundation Models for Codon Sequences
CodonFM is a fully open-source suite of foundation models trained directly on codon sequences to learn contextual codon representations and enable downstream codon-aware tasks. We release the entire stack: code, training/finetuning/evaluation scripts, dockerized environments, experiment templates, and pre-trained model weights under an open license for transparent and reproducible use.
Our primary model family, Encodon, uses masked language modeling over codons with scalable architectures (80M to 1B) and efficient memmapped data pipelines. Public links to the pre-trained checkpoints are here: 80M, 600M, 1B, 1B-Cdwt.
The checkpoints can also be found on NGC here.
Methodology and Results
The pre-print of this work with detailed methodology and results can be found here
If you like this work please cite it as follows:
@article{codonfm_2025,
author = {Darabi+, Sajad and Cao+, Fan and Naghipourfar+, Mohsen and Rabi, Sara and Sethia, Ankit and Gion, Kyle and Grewal, Jasleen and Cohen, Jonathan and Greenleaf, William and Goodarzi*, Hani and Sundaram*, Laksshman},
title = {{Learning the language of codon translation with CodonFM}},
url = {https://research.nvidia.com/labs/dbr/assets/data/manuscripts/nv-codonfm-preprint.pdf},
year = {2025}
}
Note: Sajad Darabi, Fan Cao and Mohsen Naghipourfar are equal contributing first authors.
Corresponding Author: Hani Goodarzi and Laksshman Sundaram
Accelerated CodonFM
This repository contains the exact code used in the pre-print.
An accelerated version of the codebase is available in BioNeMo Framework Recipes, which uses TransformerEngine to accelerate training and inference. Accelerated checkpoints are available for all Encodon model variants: 80M, 600M, 1B, 1B-Cdwt.
Table of Contents
- Pre-trained Models
- Repository Structure
- Quickstart
- Data
- Running Training/Finetuning/Evaluation
- Using Wandb with CodonFM
- Testing
- License
- Contact
Pre-trained Models
The table below summarizes the set of open source pre-trained weights currently made available. All of the training scripts are contained in the directory experiment_scripts/pretraining/encodon_filtered/.
Model | Variant | Hidden size | Layers | Heads | Intermediate | Script | Checkpoint
|---|---|---|---|---|---|---|---
Encodon 80M | MLM (random p=0.15) | 1024 | 6 | 8 | 4096 | mlm/encodon_80m.sh | link
Encodon 600M | MLM (random p=0.15) | 2048 | 12 | 16 | 8192 | mlm/encodon_600m.sh | link
Encodon 1B | MLM (random p=0.15) | 2048 | 18 | 16 | 8192 | mlm/encodon_1b.sh | link
Encodon 1B (CDSWT) | MLM (codon frequency-weighted) | 2048 | 18 | 16 | 8192 | cdswt/encodon_1b.sh | link
Repository Structure
High-level overview (NerdTree-style):
codon-fm/
├── src/ — core library and CLI entrypoints
│ ├── runner.py — entry for pretrain/finetune/eval
│ ├── config.py — model/data/trainer configs
│ ├── tasks.py — pretraining/finetuning/eval tasks
│ ├── models/ — model definitions and components
│ ├── data/ — datamodules, datasets, preprocessing
│ │ └── preprocess/ — item level process items
│ ├── inference/ — inference wrappers and prediction definitions
│ ├── tokenizer/ — codon tokenizer and mappings
│ └── utils/ — logging, schedulers, writers, helpers
├── experiment_scripts/ — launch scripts for pre-training
│ └── pretraining/ — Encodon pretraining
├── data_scripts/ — data download and curation tools
├── notebooks/ — analysis and evaluation notebooks
├── env.example — sample env vars
└── README.md — repo guide
Quickstart
To run the scripts in this repository, we recommend using the provided Docker setup.
1. Clone the repository
git clone https://github.com/NVIDIA-Digital-Bio/CodonFM
cd codon-fm
2. Docker Setup
The fastest way to get up and running with CodonFM is through the Docker setup below. This is an interactive development environment, you can build and launch a container that mounts your local repository. This allows you to edit code locally and run it inside the container.
To build and launch the development container, simply run the following from the root folder:
bash run_dev.sh
This script will:
- Build the development Docker image using the
developmenttarget in theDockerfile. - Pass your user and group IDs to the container to avoid permission issues with mounted files.
- Stop and remove any existing container with the same name.
- Launch a new container with your local code mounted at
/workspace, GPU access, host networking, and common directories for data and SSH keys.
You can also customize the data and checkpoint directory paths by passing arguments:
bash run_dev.sh --data-dir /path/to/your/data --checkpoints-dir /path/to/your/checkpoints
You will be dropped into a bash shell inside the container as a non-root user.
Evaluation Notebooks 📓
A series of notebooks are provided in the notebooks directory show casing multiple use cases such as zero-shot variant prediction and finetuning on downstream tasks. See a brief overview below:
| Notebook | Description | |---|---| | 00-Mutation-Datasets-Preprocessing.ipynb | Prepare and harmonize mutation datasets used across evaluations. | | 0-Zero-Shot-Mutation-Variant-CancerHotspot.ipynb | Zero-shot variant effect scoring on Cancer Hotspots. | | 1-Zero-Shot-Mutation-Variant-DDD-ASD.ipynb | Zero-shot scoring on Deciphering Developmental Disorders (DDD) and autism spectrum disorder (ASD) cohort study, which catalogs genetic mutations linked to rare pediatric and developmental diseases, to evaluate separation of healthy versus disease coh on coding sequence context.| | 2-Zero-Shot-Mutation-Variant-Clinvar-Alphamissense.ipynb | Zero-shot evaluation on ClinVar missense variants classifying benign vs. pathogenic | | 3-Zero-Shot-Mutation-Variant-Clinvar-Synonymous.ipynb | Zero-shot evaluation on ClinVar synonymous variants evaluating how the models separate benign versus pathogenic synonymous mutations.| | 4-EnCodon-Downstream-Task-riboNN.ipynb | Predicts ribosome profiling signal intensity along coding sequences, evaluating how well models capture translation efficiency and codon-level regulation from sequence context. | | 5-EnCodon-Downstream-Task-mRFP-expression.ipynb | Predicts fluorescent protein expression levels (mRFP) from coding sequences, testing how accurately models capture codon-dependent effects on translation efficiency and protein abundance.| | 6-EnCodon-Downstream-Task-mRNA-stability.ipynb | Predicts mRNA stability from coding sequences evaluating how the models associate codon composition with stability of mRNA.|
Data 📊
Pre-training Dataset
The data curation tools live under data_scripts/data_curation/.
- Main entrypoint: open and run
data_scripts/data_curation/download_cds_clean.ipynb. It documents how to obtain coding sequences (CDS), process metadata, and produce curated outputs. - Filtering resources:
data_scripts/data_curation/taxids_to_remove_bac.jsonlists bacterial taxids to exclude during curation. - Recommended environment: use the provided dev container (
bash run_dev.sh), then open the notebook in Jupyter/VS Code and execute the cells.
Outputs from the notebook (cleaned CDS files and metadata tables) can be transformed into training-ready formats memmap creation script in src/data/data_scripts/ncbi_memmap_dataset_batched.py on the output of the src/data/data_curation/ notebook. This can then be consumed byCodonMemmapDataset.
Evaluation Datasets
- mRFP expression and mRNA stability:
- Open and run the notebooks
notebooks/5-EnCodon-Downstream-Task-mRFP-expression.ipynbandnotebooks/6-EnCodon-Downstream-Task-mRNA-stability.ipynb. These notebooks contain cells that download/prepare the datasets and guide you through executing the evaluations end-to-end.
- Open and run the notebooks
- Mean translation efficiency prediction task:
- Open and run the notebook
notebooks/4-EnCodon-Downstream-Task-riboNN.ipynb. It will download/prepare the downstream dataset and guide you through finetuning on this downstream task.
- Open and run the notebook
- Synonymous, DDD/ASD, and Cancer Hotspot variant datasets:
- Follow `notebooks/00-Mutation-Datasets-Preprocessing.i
