SkillAgentSearch skills...

Dplm

The Family of Diffusion Protein Language Models (DPLM)

Install / Use

/learn @bytedance/Dplm
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- <div align="center"> --> <!-- omit in toc -->

The Family of Diffusion Protein Language Models (DPLM)

<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a> <a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a> <a href="https://hydra.cc/"><img alt="Config: Hydra" src="https://img.shields.io/badge/Config-Hydra-89b8cd"></a> <a href="https://github.com/ashleve/lightning-hydra-template"><img alt="Template" src="https://img.shields.io/badge/-Lightning--Hydra--Template-017F2F?style=flat&logo=github&labelColor=gray"></a><br>

Overview 🌟

This repository contains the official implementation of training and inference as well as the pre-trained weights for the Family of Diffusion Protein Language Models (DPLM), including:

Key Features 🔑

Specifically, the DPLM family exhibits impressive performance in protein (structure and sequence) co-generation, any-to-any conditional generation (e.g., folding, inverse folding, and motif scaffolding), and representation learning. We develop DPLM based on the ByProt. This repository contains pretraining scripts for DPLM and running scripts for various protein generation and understanding tasks, as detailed below:

  • Unconditional protein generation: DPLM is capable of unconditionally generating protein sequences with reasonable predicted structures. DPLM-2 can generate diverse and highly plausible proteins through simultaneous structure-sequence co-generation.
  • Sequence-conditioned generation (forward folding): DPLM-2 can generate reasonable protein structure given the input protein sequence, achieving close performance with the strong folding model (e.g., ESMFold).
  • Structure-conditioned generation (inverse folding): DPLM and DPLM-2 can produce sequences that can confidently fold into the given backbone structure.
  • Motif scaffolding: DPLM can generate reasonable scaffold sequences given specific functional motifs. DPLM-2 achieves more successful motif scaffolding through multimodal motif conditioning.
  • Representation learning: DPLM is a superior protein sequence representation learner, while DPLM-2 offers structure-aware protein represenrations, demonstrating impressive performance across a variety of protein predictive tasks.
  • Controllable generation: DPLM enjoys plug-and-play programmability, generating samples satisfying provided secondary structure annotations.

TODOs

  • [ ] Controllable/guided generation with discrete diffusion classifier guidance.
  • [ ] Representation learning of DPLM-2

DPLM

"Diffusion Language Models Are Versatile Protein Learners." Wang et al., In ICML 2024

DPLM

DPLM-2

"DPLM-2: A Multimodal Diffusion Protein Language Model." Wang et al., In ICLR 2025

DPLM-2

Updates 📢

  • ​[2025-07]​ We update the default sampling strategy of DPLM-2 to annealing@2.0:0.1.
  • ​[2025-04]​​ Our latest work DPLM-2.1, which focuses on analysis and better protein structure modeling of multimodal protein language models, is accepted to ICML'25 Spotlight! Check Elucidating the Design Space of Multimodal Protein Language Models. We have release the implementation of finer-grained and better structure modeling (DPLM-2 Bit). The full implementation will be released soon.
  • ​[2024-10]​​ Check out our new work DPLM-2, a multimodal protein foundation model that extends DPLM to simultaneously model, understand, and generate both sequences and structures!
  • ​[2024-03]​​ We release DPLM, a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences!

Table of Contents 📚

Quick Start

Installation

# clone project
git clone --recursive https://url/to/this/repo/dplm.git
cd dplm

# create conda virtual environment
env_name=dplm

conda create -n ${env_name} python=3.9 pip
conda activate ${env_name}

# automatically install everything else
bash scripts/install.sh

Load Pretrained Models

Users can load DPLM/DPLM-2 checkpoint by:

from byprot.models.dplm import DiffusionProteinLanguageModel as DPLM
from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
from byprot.models.dplm2 import DPLM2Bit

dplm = DPLM.from_pretrained("airkingbd/dplm_650m").cuda()
dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda()
dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()

Generation Examples

Protein sequence generation

from generate_dplm import initialize_generation

input_tokens = initialize_generation(
  length=200,
  num_seqs=5,
  tokenizer=dplm.tokenizer,
  device=next(dplm.parameters()).device
)
samples = dplm.generate(
  input_tokens=input_tokens,
  max_iter=500,
)
print([''.join(seq.split(' ')) for seq in dplm.tokenizer.batch_decode(samples, skip_special_tokens=True)])

Protein sequence-structure co-generation

User can check the generated sequence and structure in the ./generation-results folder.

from generate_dplm2 import initialize_generation, save_results

input_tokens = initialize_generation(
  task="co_generation",
  length=200,
  num_seqs=5,
  tokenizer=dplm2.tokenizer,
  device=next(dplm2.parameters()).device
)[0]

samples = dplm2.generate(
  input_tokens=input_tokens,
  max_iter=500,
)
save_results(
    outputs=samples,
    task="co_generation",
    save_dir="./generation-results/dplm2_generation",
    tokenizer=dplm2.tokenizer,
    struct_tokenizer=dplm2.struct_tokenizer, save_pdb=True
)

samples = dplm2_bit.generate(
  input_tokens=input_tokens,
  max_iter=500,
)
save_results(
    outputs=samples,
    task="co_generation",
    save_dir="./generation-results/dplm2_bit_generation",
    tokenizer=dplm2_bit.tokenizer,
    struct_tokenizer=dplm2_bit.struct_tokenizer
)

Model Checkpoints

Access pretrained models in varying sizes:

| Model name | Model size | | ------------------------------------------------------------ | --------------- | | dplm-150m | 150M parameters | | dplm-650m | 650M parameters | | dplm-3b | 3B parameters | | dplm2-150m | 150M parameters | | dplm2-650m | 650M parameters | | dplm2-3b | 3B parameters | | dplm2-bit-650m | 650M parameters |

Advanced Usage

Training

<!-- omit in toc -->

DPLM

<!-- omit in toc -->

Dataset

We pretrain DPLM on the UniRef50 dataset, which contains about 42 million protein sequences. We obtain the preprocessed UniRef50 dataset provided by EvoDiff (Alamdari et al, 2023), which can be downloaded from this link. After downloading, please place the dataset in the ./data-bin/uniref50 folder.

We also provide the preprocessed dataset in HuggingFace datasets format, which we recommend to use. User can download the HF dataset locally in advance for faster loading by:

bash scripts/download_uniref50_hf.sh
<!-- omit in toc -->

Example of training

We train DPLM with approximately 1 million tokens per batch for 100,000 training

Related Skills

View on GitHub
GitHub Stars316
CategoryEducation
Updated6d ago
Forks42

Languages

Python

Security Score

95/100

Audited on Mar 27, 2026

No findings