Evodiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models

Generate Convert Improve

Install / Use

/learn @microsoft/Evodiff

About this skill

Quality Score

0/100

README

EvoDiff

Description

In this work, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.

We evaluate our sequence and MSA models – EvoDiff-Seq and EvoDiff-MSA, respectively – across a range of generation tasks to demonstrate their power for controllable protein design. Below, we provide documentation for running our models.

EvoDiff is described in this preprint; if you use the code from this repository or the results, please cite the preprint.

Evodiff
Table of contents
Installation
- Datasets
- Loading pretrained models
Available models
Unconditional generation
- Unconditional sequence generation
- Unconditional MSA generation
Conditional sequence generation
Analysis
Downloading generated sequences
Docker

Installation

To download our code, we recommend creating a clean conda environment with python v3.9.0, and installing pytorch (we have tested up to v2.7.0)

conda create --name evodiff python=3.9
pip3 install torch

In that new environment, install EvoDiff (torch-scatter may take a while):

pip install evodiff

For the bleeding edge version use:

pip install git+https://github.com/microsoft/evodiff.git # bleeding edge, current repo main branch

Examples

We provide a notebook with installation guidance that can be found in examples/evodiff.ipynb. It also includes examples on how to generate a smaller number of sequences and MSAs using our models. We recommend following this notebook if you would like to use our models to generate proteins.

EvoDiff can be deployed on Azure AI Foundry. We provide a notebook with instructions here: examples/evodiff_Azure_AI_Foundry.ipynb.

Thanks to Colby Ford, EvoDiff is available as a space on huggingface

Datasets

We obtain sequences from the Uniref50 dataset, which contains approximately 42 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the OpenFold dataset, which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. The intrinsically disordered regions (IDR) data was obtained from the Reverse Homology GitHub.

For the scaffolding structural motifs task, we use the baselines compiled in RFDiffusion. We provide pdb and fasta files used for conditionally generating sequences in the examples/scaffolding-pdbs folder. We also provide pdb files used for conditionally generating MSAs in the examples/scaffolding-msas folder.

To access the UniRef50 test sequences, use the following code:

test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences

The filenames for train and validation Openfold splits are saved in data/valid_msas.csv and data/train_msas.csv

Loading pretrained models

To load a model:

from evodiff.pretrained import OA_DM_38M

model, collater, tokenizer, scheme = OA_DM_38M()

Available evodiff models are:

D3PM_BLOSUM_640M()
D3PM_BLOSUM_38M()
D3PM_UNIFORM_640M()
D3PM_UNIFORM_38M()
OA_DM_640M()
OA_DM_38M()
MSA_D3PM_BLOSUM_RANDSUB()
MSA_D3PM_BLOSUM_MAXSUB()
MSA_D3PM_UNIFORM_RANDSUB()
MSA_D3PM_UNIFORM_MAXSUB()
MSA_OA_DM_RANDSUB()
MSA_OA_DM_MAXSUB()

It is also possible to load our LRAR baseline models:

LR_AR_640M()
LR_AR_38M()

Note: if you want to download a BLOSUM model, you will first need to download data/blosum62-special-MSA.mat.

Available models

We investigated two types of forward processes for diffusion over discrete data modalities to determine which would be most effective. In order-agnostic autoregressive diffusion OADM, one amino acid is converted to a special mask token at each step in the forward process. After $T=L$ steps, where $L$ is the length of the sequence, the entire sequence is masked. We additionally designed discrete denoising diffusion probabilistic models D3PM for protein sequences. In EvoDiff-D3PM, the forward process corrupts sequences by sampling mutations according to a transition matrix, such that after $T$ steps the sequence is indistinguishable from a uniform sample over the amino acids. In the reverse process for both, a neural network model is trained to undo the previous corruption. The trained model can then generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids for EvoDiff-OADM or EvoDiff-D3PM, respectively. We trained all EvoDiff sequence models on 42M sequences from UniRef50 using a dilated convolutional neural network architecture introduced in the CARP protein masked language model. We trained 38M-parameter and 640M-parameter versions for each forward corruption scheme and for left-to-right autoregressive (LRAR) decoding.

To explicitly leverage evolutionary information, we designed and trained EvoDiff MSA models using the MSA Transformer architecture on the OpenFold dataset. To do so, we subsampled MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences ("Random") or by greedily maximizing for sequence diversity ("Max"). Within each subsampling strategy, we then trained EvoDiff MSA models with the OADM and D3PM corruption schemes.

Unconditional sequence generation

Unconditional generation with EvoDiff-Seq

EvoDiff can generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids. All available models can be used to unconditionally generate new sequences, without needing to download the training datasets.

To unconditionally generate 100 sequences from EvoDiff-Seq, run the following script:

python evodiff/generate.py --model-type oa_dm_38M --num-seqs 100

The default model type is oa_dm_640M, other evodiff models available are:

oa_dm_38M
d3pm_blosum_38M
d3pm_blosum_640M
d3pm_uniform_38M
d3pm_uniform_640M

Our LRAR baseline models are also available:

lr_ar_38M
lr_ar_640M

An example of unconditionally generating a sequence of a specified length can be found in this notebook.

To evaluate the generated sequences, we implement our self-consistency Omegafold ESM-IF pipeline, as shown in analysis/self_consistency_analysis.py. To use this evaluation script, you must have the dependencies listed under the Installation section installed.

Unconditional generation with EvoDiff-MSA

To explicitly leverage evolutionary information, we design and train EvoDiff-MSA models using the MSA Transformer architecture on the OpenFold dataset. To do so, we subsample MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences (“Random”) or by greedily maximizing for sequence diversity (“Max”).

It is possible to unconditionally generate an entire MSA, using the following script:

python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming

The default model type is msa_oa_dm_maxsub, which is EvoDiff-MSA-OADM trained on Max subsampled sequences, and the other available evodiff models are:

EvoDiff-MSA OADM trained on random subsampled sequences: msa_oa_dm_randsub
EvoDiff-MSA D3PM-BLOSUM trained on Max subsampled sequences:msa_d3pm_blosum_maxsub
EvoDiff-MSA D3PM-BLOSUM trained on random subsampled

Related Skills

node-connect

343.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

90.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

microsoft

View profile

View on GitHub

GitHub Stars665

CategoryDevelopment

Updated8h ago

Forks110

microsoft/evodiff

Languages

Python

Security Score

100/100

Audited on Mar 31, 2026

No findings

Evodiff

Install / Use

README

EvoDiff

Description

Table of contents

Installation

Examples

Datasets

Loading pretrained models

Available models

Unconditional sequence generation

Unconditional generation with EvoDiff-Seq

Unconditional generation with EvoDiff-MSA

Related Skills