Dplm
The Family of Diffusion Protein Language Models (DPLM)
Install / Use
/learn @bytedance/DplmREADME
The Family of Diffusion Protein Language Models (DPLM)
<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a> <a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a> <a href="https://hydra.cc/"><img alt="Config: Hydra" src="https://img.shields.io/badge/Config-Hydra-89b8cd"></a> <a href="https://github.com/ashleve/lightning-hydra-template"><img alt="Template" src="https://img.shields.io/badge/-Lightning--Hydra--Template-017F2F?style=flat&logo=github&labelColor=gray"></a><br>
Overview 🌟
This repository contains the official implementation of training and inference as well as the pre-trained weights for the Family of Diffusion Protein Language Models (DPLM), including:
DPLMfrom ICML'24 paper "Diffusion Language Models Are Versatile Protein Learners", which introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.DPLM-2from ICLR'25 paper "DPLM-2: A Multimodal Diffusion Protein Language Model", a multimodal protein foundation model that extends discrete diffusion protein language model to accommodate both sequences and structures.- ICML'25 spotlight paper "Elucidating the Design Space of Multimodal Protein Language Models", where we elucidate the challenges of structure modeling of multimodal protein language models (e.g., DPLM-2 and ESM3) and propose advanced designs for better structure modeling. We have released the finer-grained bit-based generative modeling (
DPLM-2 Bit). The full implementation of the paper will be released soon.
Key Features 🔑
Specifically, the DPLM family exhibits impressive performance in protein (structure and sequence) co-generation, any-to-any conditional generation (e.g., folding, inverse folding, and motif scaffolding), and representation learning. We develop DPLM based on the ByProt. This repository contains pretraining scripts for DPLM and running scripts for various protein generation and understanding tasks, as detailed below:
- Unconditional protein generation: DPLM is capable of unconditionally generating protein sequences with reasonable predicted structures. DPLM-2 can generate diverse and highly plausible proteins through simultaneous structure-sequence co-generation.
- Sequence-conditioned generation (forward folding): DPLM-2 can generate reasonable protein structure given the input protein sequence, achieving close performance with the strong folding model (e.g., ESMFold).
- Structure-conditioned generation (inverse folding): DPLM and DPLM-2 can produce sequences that can confidently fold into the given backbone structure.
- Motif scaffolding: DPLM can generate reasonable scaffold sequences given specific functional motifs. DPLM-2 achieves more successful motif scaffolding through multimodal motif conditioning.
- Representation learning: DPLM is a superior protein sequence representation learner, while DPLM-2 offers structure-aware protein represenrations, demonstrating impressive performance across a variety of protein predictive tasks.
- Controllable generation: DPLM enjoys plug-and-play programmability, generating samples satisfying provided secondary structure annotations.
TODOs
- [ ] Controllable/guided generation with discrete diffusion classifier guidance.
- [ ] Representation learning of DPLM-2
DPLM
"Diffusion Language Models Are Versatile Protein Learners." Wang et al., In ICML 2024

DPLM-2
"DPLM-2: A Multimodal Diffusion Protein Language Model." Wang et al., In ICLR 2025

Updates 📢
- [2025-07] We update the default sampling strategy of DPLM-2 to
annealing@2.0:0.1. - [2025-04] Our latest work DPLM-2.1, which focuses on analysis and better protein structure modeling of multimodal protein language models, is accepted to ICML'25 Spotlight! Check Elucidating the Design Space of Multimodal Protein Language Models. We have release the implementation of finer-grained and better structure modeling (DPLM-2 Bit). The full implementation will be released soon.
- [2024-10] Check out our new work DPLM-2, a multimodal protein foundation model that extends DPLM to simultaneously model, understand, and generate both sequences and structures!
- [2024-03] We release DPLM, a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences!
Table of Contents 📚
Quick Start
Installation
# clone project
git clone --recursive https://url/to/this/repo/dplm.git
cd dplm
# create conda virtual environment
env_name=dplm
conda create -n ${env_name} python=3.9 pip
conda activate ${env_name}
# automatically install everything else
bash scripts/install.sh
Load Pretrained Models
Users can load DPLM/DPLM-2 checkpoint by:
from byprot.models.dplm import DiffusionProteinLanguageModel as DPLM
from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
from byprot.models.dplm2 import DPLM2Bit
dplm = DPLM.from_pretrained("airkingbd/dplm_650m").cuda()
dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda()
dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()
Generation Examples
Protein sequence generation
from generate_dplm import initialize_generation
input_tokens = initialize_generation(
length=200,
num_seqs=5,
tokenizer=dplm.tokenizer,
device=next(dplm.parameters()).device
)
samples = dplm.generate(
input_tokens=input_tokens,
max_iter=500,
)
print([''.join(seq.split(' ')) for seq in dplm.tokenizer.batch_decode(samples, skip_special_tokens=True)])
Protein sequence-structure co-generation
User can check the generated sequence and structure in the ./generation-results folder.
from generate_dplm2 import initialize_generation, save_results
input_tokens = initialize_generation(
task="co_generation",
length=200,
num_seqs=5,
tokenizer=dplm2.tokenizer,
device=next(dplm2.parameters()).device
)[0]
samples = dplm2.generate(
input_tokens=input_tokens,
max_iter=500,
)
save_results(
outputs=samples,
task="co_generation",
save_dir="./generation-results/dplm2_generation",
tokenizer=dplm2.tokenizer,
struct_tokenizer=dplm2.struct_tokenizer, save_pdb=True
)
samples = dplm2_bit.generate(
input_tokens=input_tokens,
max_iter=500,
)
save_results(
outputs=samples,
task="co_generation",
save_dir="./generation-results/dplm2_bit_generation",
tokenizer=dplm2_bit.tokenizer,
struct_tokenizer=dplm2_bit.struct_tokenizer
)
Model Checkpoints
Access pretrained models in varying sizes:
| Model name | Model size | | ------------------------------------------------------------ | --------------- | | dplm-150m | 150M parameters | | dplm-650m | 650M parameters | | dplm-3b | 3B parameters | | dplm2-150m | 150M parameters | | dplm2-650m | 650M parameters | | dplm2-3b | 3B parameters | | dplm2-bit-650m | 650M parameters |
Advanced Usage
Training
<!-- omit in toc -->DPLM
<!-- omit in toc -->Dataset
We pretrain DPLM on the UniRef50 dataset, which contains about 42 million protein sequences. We obtain the preprocessed UniRef50 dataset provided by EvoDiff (Alamdari et al, 2023), which can be downloaded from this link. After downloading, please place the dataset in the ./data-bin/uniref50 folder.
We also provide the preprocessed dataset in HuggingFace datasets format, which we recommend to use. User can download the HF dataset locally in advance for faster loading by:
bash scripts/download_uniref50_hf.sh
<!-- omit in toc -->
Example of training
We train DPLM with approximately 1 million tokens per batch for 100,000 training
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
last30days-skill
17.6kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
