CodonFM: Foundation Models for Codon Sequences

CodonFM is a fully open-source suite of foundation models trained directly on codon sequences to learn contextual codon representations and enable downstream codon-aware tasks. We release the entire stack: code, training/finetuning/evaluation scripts, dockerized environments, experiment templates, and pre-trained model weights under an open license for transparent and reproducible use.

Our primary model family, Encodon, uses masked language modeling over codons with scalable architectures (80M to 1B) and efficient memmapped data pipelines. Public links to the pre-trained checkpoints are here: 80M, 600M, 1B, 1B-Cdwt.

The checkpoints can also be found on NGC here.

Methodology and Results

The pre-print of this work with detailed methodology and results can be found here

If you like this work please cite it as follows:

@article{codonfm_2025,
author = {Darabi+, Sajad and Cao+, Fan and Naghipourfar+, Mohsen and Rabi, Sara and Sethia, Ankit and Gion, Kyle and Grewal, Jasleen and Cohen, Jonathan and Greenleaf, William and Goodarzi*, Hani and Sundaram*, Laksshman},
title = {{Learning the language of codon translation with CodonFM}},
url = {https://research.nvidia.com/labs/dbr/assets/data/manuscripts/nv-codonfm-preprint.pdf},
year = {2025}
}

Note: Sajad Darabi, Fan Cao and Mohsen Naghipourfar are equal contributing first authors.

Corresponding Author: Hani Goodarzi and Laksshman Sundaram

Accelerated CodonFM

This repository contains the exact code used in the pre-print.

An accelerated version of the codebase is available in BioNeMo Framework Recipes, which uses TransformerEngine to accelerate training and inference. Accelerated checkpoints are available for all Encodon model variants: 80M, 600M, 1B, 1B-Cdwt.

Pre-trained Models
Repository Structure
Quickstart
Data
Running Training/Finetuning/Evaluation
Using Wandb with CodonFM
Testing
License
Contact

Pre-trained Models

The table below summarizes the set of open source pre-trained weights currently made available. All of the training scripts are contained in the directory experiment_scripts/pretraining/encodon_filtered/.

Model | Variant | Hidden size | Layers | Heads | Intermediate | Script | Checkpoint |---|---|---|---|---|---|---|--- Encodon 80M | MLM (random p=0.15) | 1024 | 6 | 8 | 4096 | mlm/encodon_80m.sh | link Encodon 600M | MLM (random p=0.15) | 2048 | 12 | 16 | 8192 | mlm/encodon_600m.sh | link Encodon 1B | MLM (random p=0.15) | 2048 | 18 | 16 | 8192 | mlm/encodon_1b.sh | link Encodon 1B (CDSWT) | MLM (codon frequency-weighted) | 2048 | 18 | 16 | 8192 | cdswt/encodon_1b.sh | link

Repository Structure

High-level overview (NerdTree-style):

codon-fm/
├── src/ — core library and CLI entrypoints
│   ├── runner.py — entry for pretrain/finetune/eval
│   ├── config.py — model/data/trainer configs
│   ├── tasks.py — pretraining/finetuning/eval tasks
│   ├── models/ — model definitions and components
│   ├── data/ — datamodules, datasets, preprocessing
│   │   └── preprocess/ — item level process items
│   ├── inference/ — inference wrappers and prediction definitions
│   ├── tokenizer/ — codon tokenizer and mappings
│   └── utils/ — logging, schedulers, writers, helpers
├── experiment_scripts/ — launch scripts for pre-training
│   └── pretraining/ — Encodon pretraining
├── data_scripts/ — data download and curation tools
├── notebooks/ — analysis and evaluation notebooks
├── env.example — sample env vars
└── README.md — repo guide

Quickstart

To run the scripts in this repository, we recommend using the provided Docker setup.

1. Clone the repository

git clone https://github.com/NVIDIA-Digital-Bio/CodonFM
cd codon-fm

2. Docker Setup

The fastest way to get up and running with CodonFM is through the Docker setup below. This is an interactive development environment, you can build and launch a container that mounts your local repository. This allows you to edit code locally and run it inside the container.

To build and launch the development container, simply run the following from the root folder:

bash run_dev.sh

This script will:

Build the development Docker image using the development target in the Dockerfile.
Pass your user and group IDs to the container to avoid permission issues with mounted files.
Stop and remove any existing container with the same name.
Launch a new container with your local code mounted at /workspace, GPU access, host networking, and common directories for data and SSH keys.

You can also customize the data and checkpoint directory paths by passing arguments:

bash run_dev.sh --data-dir /path/to/your/data --checkpoints-dir /path/to/your/checkpoints

You will be dropped into a bash shell inside the container as a non-root user.

Evaluation Notebooks 📓

A series of notebooks are provided in the notebooks directory show casing multiple use cases such as zero-shot variant prediction and finetuning on downstream tasks. See a brief overview below:

| Notebook | Description | |---|---| | 00-Mutation-Datasets-Preprocessing.ipynb | Prepare and harmonize mutation datasets used across evaluations. | | 0-Zero-Shot-Mutation-Variant-CancerHotspot.ipynb | Zero-shot variant effect scoring on Cancer Hotspots. | | 1-Zero-Shot-Mutation-Variant-DDD-ASD.ipynb | Zero-shot scoring on Deciphering Developmental Disorders (DDD) and autism spectrum disorder (ASD) cohort study, which catalogs genetic mutations linked to rare pediatric and developmental diseases, to evaluate separation of healthy versus disease coh on coding sequence context.| | 2-Zero-Shot-Mutation-Variant-Clinvar-Alphamissense.ipynb | Zero-shot evaluation on ClinVar missense variants classifying benign vs. pathogenic | | 3-Zero-Shot-Mutation-Variant-Clinvar-Synonymous.ipynb | Zero-shot evaluation on ClinVar synonymous variants evaluating how the models separate benign versus pathogenic synonymous mutations.| | 4-EnCodon-Downstream-Task-riboNN.ipynb | Predicts ribosome profiling signal intensity along coding sequences, evaluating how well models capture translation efficiency and codon-level regulation from sequence context. | | 5-EnCodon-Downstream-Task-mRFP-expression.ipynb | Predicts fluorescent protein expression levels (mRFP) from coding sequences, testing how accurately models capture codon-dependent effects on translation efficiency and protein abundance.| | 6-EnCodon-Downstream-Task-mRNA-stability.ipynb | Predicts mRNA stability from coding sequences evaluating how the models associate codon composition with stability of mRNA.|

Data 📊

Pre-training Dataset

The data curation tools live under data_scripts/data_curation/.

Main entrypoint: open and run data_scripts/data_curation/download_cds_clean.ipynb. It documents how to obtain coding sequences (CDS), process metadata, and produce curated outputs.
Filtering resources: data_scripts/data_curation/taxids_to_remove_bac.json lists bacterial taxids to exclude during curation.
Recommended environment: use the provided dev container (bash run_dev.sh), then open the notebook in Jupyter/VS Code and execute the cells.

Outputs from the notebook (cleaned CDS files and metadata tables) can be transformed into training-ready formats memmap creation script in src/data/data_scripts/ncbi_memmap_dataset_batched.py on the output of the src/data/data_curation/ notebook. This can then be consumed byCodonMemmapDataset.

Evaluation Datasets

mRFP expression and mRNA stability:
- Open and run the notebooks notebooks/5-EnCodon-Downstream-Task-mRFP-expression.ipynb and notebooks/6-EnCodon-Downstream-Task-mRNA-stability.ipynb. These notebooks contain cells that download/prepare the datasets and guide you through executing the evaluations end-to-end.
Mean translation efficiency prediction task:
- Open and run the notebook notebooks/4-EnCodon-Downstream-Task-riboNN.ipynb. It will download/prepare the downstream dataset and guide you through finetuning on this downstream task.
Synonymous, DDD/ASD, and Cancer Hotspot variant datasets:
- Follow `notebooks/00-Mutation-Datasets-Preprocessing.i

CodonFM

Install / Use

README