CodonTransformer
CodonTransformer (1M+ Downloads); The tool for codon optimization, optimizing DNA for protein expression
Install / Use
/learn @Adibvafa/CodonTransformerREADME
Table of Contents
Abstract
The genetic code is degenerate allowing a multitude of possible DNA sequences to encode the same protein. This degeneracy impacts the efficiency of heterologous protein production due to the codon usage preferences of each organism. The process of tailoring organism-specific synonymous codons, known as codon optimization, must respect local sequence patterns that go beyond global codon preferences. As a result, the search space faces a combinatorial explosion that makes exhaustive exploration impossible. Nevertheless, throughout the diverse life on Earth, natural selection has already optimized the sequences, thereby providing a rich source of data allowing machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life. The model demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers we used, and to a novel sequence representation that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a state-of-the-art codon optimization framework with a customizable open-access model and a user-friendly interface. <br></br>
Use Case
For a user-friendly interface, check out our Google Colab Notebook. <br></br> After installing CodonTransformer, you can use:
import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from CodonTransformer.CodonPrediction import predict_dna_sequence
from CodonTransformer.CodonJupyter import format_model_output
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer")
model = BigBirdForMaskedLM.from_pretrained("adibvafa/CodonTransformer").to(device)
# Set your input data
protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG"
organism = "Escherichia coli general"
# Predict with CodonTransformer
output = predict_dna_sequence(
protein=protein,
organism=organism,
device=device,
tokenizer=tokenizer,
model=model,
attention_type="original_full",
deterministic=True
)
print(format_model_output(output))
The output is:
-----------------------------
| Organism |
-----------------------------
Escherichia coli general
-----------------------------
| Input Protein |
-----------------------------
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG
-----------------------------
| Processed Input |
-----------------------------
M_UNK A_UNK L_UNK W_UNK M_UNK R_UNK L_UNK L_UNK P_UNK L_UNK L_UNK A_UNK L_UNK L_UNK A_UNK L_UNK W_UNK G_UNK P_UNK D_UNK P_UNK A_UNK A_UNK A_UNK F_UNK V_UNK N_UNK Q_UNK H_UNK L_UNK C_UNK G_UNK S_UNK H_UNK L_UNK V_UNK E_UNK A_UNK L_UNK Y_UNK L_UNK V_UNK C_UNK G_UNK E_UNK R_UNK G_UNK F_UNK F_UNK Y_UNK T_UNK P_UNK K_UNK T_UNK R_UNK R_UNK E_UNK A_UNK E_UNK D_UNK L_UNK Q_UNK V_UNK G_UNK Q_UNK V_UNK E_UNK L_UNK G_UNK G_UNK __UNK
-----------------------------
| Predicted DNA |
-----------------------------
ATGGCTTTATGGATGCGTCTGCTGCCGCTGCTGGCGCTGCTGGCGCTGTGGGGCCCGGACCCGGCGGCGGCGTTTGTGAATCAGCACCTGTGCGGCAGCCACCTGGTGGAAGCGCTGTATCTGGTGTGCGGTGAGCGCGGCTTCTTCTACACGCCCAAAACCCGCCGCGAAGCGGAAGATCTGCAGGTGGGCCAGGTGGAGCTGGGCGGCTAA
Generating Multiple Variable Sequences
Set deterministic=False to generate variable sequences. Control the variability using temperature:
temperature: (recommended between 0.2 and 0.8)- Lower values (e.g., 0.2): More conservative predictions
- Higher values (e.g., 0.8): More diverse predictions
Using high temperatures (e.g. more than 1) might result in prediction of DNA sequences that do not translate to the input protein.<br>
You can set match_protein=True to ensure predicted DNA sequences translate to the input protein.<br>
Generate multiple sequences by setting num_sequences to a value greater than 1.
<br><br>
Batch Inference
You can use the inference template to setup your dataset for batch inference in Google Colab. A sample dataset is provided under \demo . A typical inference might take 1-3 seconds based on available compute.
<br>Arguments of predict_dna_sequence
| Argument | Type | Description | Default |
|----------|------|-------------|---------|
| protein | str | Input protein sequence | Required |
| organism | Union[int, str] | Organism ID (integer) or name (string) (e.g., "Escherichia coli general") | Required |
| device | torch.device | PyTorch device object specifying whether to run on CPU or GPU | Required |
| tokenizer | Union[str, PreTrainedTokenizerFast, None] | Either a file path to load tokenizer from, a pre-loaded tokenizer object, or None to load from HuggingFace's "adibvafa/CodonTransformer" | None |
| model | Union[str, torch.nn.Module, None] | Either a file path to load model from, a pre-loaded model object, or None to load from HuggingFace's "adibvafa/CodonTransformer" | None |
| attention_type | str | Type of attention mechanism to use in model - 'block_sparse' for memory efficient or 'original_full' for standard attention | "original_full" |
| deterministic | bool | If True, uses deterministic decoding (picks most likely tokens). If False, samples tokens based on probabilities adjusted by temperature | True |
| temperature | float | Controls randomness in non-deterministic mode. Lower values (0.2) are conservative and pick high probability tokens, while higher values (0.8) allow more diversity. Must be positive | 0.2 |
| top_p | float | Nucleus sampling threshold - only tokens with cumulative probability up to this value are considered. Balances diversity and quality of predictions. Must be between 0 and 1 | 0.95 |
| num_sequences | int | Number of different DNA sequences to generate. Only works when deterministic=False. Each sequence will be sampled based on the temperature and top_p parameters. Must be positive | 1 |
| match_protein | bool | Constrains predictions to only use codons that translate back to the exact input protein sequence. Only recommended when using high temperatures or error prone input proteins (e.g. not starting with methionine or having numerous repetitions) | False |
Returns: Union[DNASequencePrediction, List[DNASequencePrediction]] containing predicted DNA sequence(s) and metadata.
<br>
Installation
Install CodonTransformer via pip:
pip install CodonTransformer
Or clone the repository:
git clone https://github.com/adibvafa/CodonTransformer.git
cd CodonTransformer
pip install -r requirements.txt
The package requires python>=3.9, supports all major operating systems, and takes about 10-30 seconds depending on already installed requirements, availabe here.
<br><br><br>
Finetuning CodonTransformer
To finetune CodonTransformer on your own data, follow these steps:
-
Prepare your dataset
Create a CSV file with the following columns:
dna: DNA sequences (string, preferably uppercase ATCG)protein: Protein sequences (string, preferably uppercase amino acid letters)organism: Target organism (string or int, must be fromORGANISM2IDinCodonUtils)
Note:
- Use organisms from the
FINE_TUNE_ORGANISMSlist for best results. - For E. coli, use
Escherichia coli general. - DNA sequences should ideally contain only A, T, C, and G. Ambiguous codons are replaced with 'UNK' for tokenization.
- Protein sequences should contain standard amino acid letters from
AMINO_ACIDSinCodonUtils. Ambiguous amino acids are replaced according to theAMBIGUOUS_AMINOACID_MAPinCodonUtils.
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
