Dplm

The Family of Diffusion Protein Language Models (DPLM)

Generate Convert Improve

Install / Use

/learn @bytedance/Dplm

About this skill

Quality Score

0/100

README

The Family of Diffusion Protein Language Models (DPLM)

Overview 🌟

This repository contains the official implementation of training and inference as well as the pre-trained weights for the Family of Diffusion Protein Language Models (DPLM), including:

DPLM from ICML'24 paper "Diffusion Language Models Are Versatile Protein Learners", which introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
DPLM-2 from ICLR'25 paper "DPLM-2: A Multimodal Diffusion Protein Language Model", a multimodal protein foundation model that extends discrete diffusion protein language model to accommodate both sequences and structures.
ICML'25 spotlight paper "Elucidating the Design Space of Multimodal Protein Language Models", where we elucidate the challenges of structure modeling of multimodal protein language models (e.g., DPLM-2 and ESM3) and propose advanced designs for better structure modeling. We have released the finer-grained bit-based generative modeling (DPLM-2 Bit). The full implementation of the paper will be released soon.

Key Features 🔑

Specifically, the DPLM family exhibits impressive performance in protein (structure and sequence) co-generation, any-to-any conditional generation (e.g., folding, inverse folding, and motif scaffolding), and representation learning. We develop DPLM based on the ByProt. This repository contains pretraining scripts for DPLM and running scripts for various protein generation and understanding tasks, as detailed below:

Unconditional protein generation: DPLM is capable of unconditionally generating protein sequences with reasonable predicted structures. DPLM-2 can generate diverse and highly plausible proteins through simultaneous structure-sequence co-generation.
Sequence-conditioned generation (forward folding): DPLM-2 can generate reasonable protein structure given the input protein sequence, achieving close performance with the strong folding model (e.g., ESMFold).
Structure-conditioned generation (inverse folding): DPLM and DPLM-2 can produce sequences that can confidently fold into the given backbone structure.
Motif scaffolding: DPLM can generate reasonable scaffold sequences given specific functional motifs. DPLM-2 achieves more successful motif scaffolding through multimodal motif conditioning.
Representation learning: DPLM is a superior protein sequence representation learner, while DPLM-2 offers structure-aware protein represenrations, demonstrating impressive performance across a variety of protein predictive tasks.
Controllable generation: DPLM enjoys plug-and-play programmability, generating samples satisfying provided secondary structure annotations.

TODOs

[ ] Controllable/guided generation with discrete diffusion classifier guidance.
[ ] Representation learning of DPLM-2

DPLM

"Diffusion Language Models Are Versatile Protein Learners." Wang et al., In ICML 2024

DPLM

DPLM-2

"DPLM-2: A Multimodal Diffusion Protein Language Model." Wang et al., In ICLR 2025

DPLM-2

Updates 📢

[2025-07] We update the default sampling strategy of DPLM-2 to annealing@2.0:0.1.
[2025-04] Our latest work DPLM-2.1, which focuses on analysis and better protein structure modeling of multimodal protein language models, is accepted to ICML'25 Spotlight! Check Elucidating the Design Space of Multimodal Protein Language Models. We have release the implementation of finer-grained and better structure modeling (DPLM-2 Bit). The full implementation will be released soon.
[2024-10] Check out our new work DPLM-2, a multimodal protein foundation model that extends DPLM to simultaneously model, understand, and generate both sequences and structures!
[2024-03] We release DPLM, a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences!

Quick Start

Installation

# clone project
git clone --recursive https://url/to/this/repo/dplm.git
cd dplm

# create conda virtual environment
env_name=dplm

conda create -n ${env_name} python=3.9 pip
conda activate ${env_name}

# automatically install everything else
bash scripts/install.sh

Load Pretrained Models

Users can load DPLM/DPLM-2 checkpoint by:

from byprot.models.dplm import DiffusionProteinLanguageModel as DPLM
from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
from byprot.models.dplm2 import DPLM2Bit

dplm = DPLM.from_pretrained("airkingbd/dplm_650m").cuda()
dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda()
dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()

Generation Examples

Protein sequence generation

from generate_dplm import initialize_generation

input_tokens = initialize_generation(
  length=200,
  num_seqs=5,
  tokenizer=dplm.tokenizer,
  device=next(dplm.parameters()).device
)
samples = dplm.generate(
  input_tokens=input_tokens,
  max_iter=500,
)
print([''.join(seq.split(' ')) for seq in dplm.tokenizer.batch_decode(samples, skip_special_tokens=True)])

Protein sequence-structure co-generation

User can check the generated sequence and structure in the ./generation-results folder.

from generate_dplm2 import initialize_generation, save_results

input_tokens = initialize_generation(
  task="co_generation",
  length=200,
  num_seqs=5,
  tokenizer=dplm2.tokenizer,
  device=next(dplm2.parameters()).device
)[0]

samples = dplm2.generate(
  input_tokens=input_tokens,
  max_iter=500,
)
save_results(
    outputs=samples,
    task="co_generation",
    save_dir="./generation-results/dplm2_generation",
    tokenizer=dplm2.tokenizer,
    struct_tokenizer=dplm2.struct_tokenizer, save_pdb=True
)

samples = dplm2_bit.generate(
  input_tokens=input_tokens,
  max_iter=500,
)
save_results(
    outputs=samples,
    task="co_generation",
    save_dir="./generation-results/dplm2_bit_generation",
    tokenizer=dplm2_bit.tokenizer,
    struct_tokenizer=dplm2_bit.struct_tokenizer
)

Model Checkpoints

Access pretrained models in varying sizes:

| Model name | Model size | | ------------------------------------------------------------ | --------------- | | dplm-150m | 150M parameters | | dplm-650m | 650M parameters | | dplm-3b | 3B parameters | | dplm2-150m | 150M parameters | | dplm2-650m | 650M parameters | | dplm2-3b | 3B parameters | | dplm2-bit-650m | 650M parameters |

Advanced Usage

Training

DPLM

Dataset

We pretrain DPLM on the UniRef50 dataset, which contains about 42 million protein sequences. We obtain the preprocessed UniRef50 dataset provided by EvoDiff (Alamdari et al, 2023), which can be downloaded from this link. After downloading, please place the dataset in the ./data-bin/uniref50 folder.

We also provide the preprocessed dataset in HuggingFace datasets format, which we recommend to use. User can download the HF dataset locally in advance for faster loading by:

bash scripts/download_uniref50_hf.sh

Example of training

We train DPLM with approximately 1 million tokens per batch for 100,000 training

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding

last30days-skill

17.6k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary