Proteina

Proteina is a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture.

Generate Convert Improve

Install / Use

/learn @NVIDIA-Digital-Bio/Proteina

About this skill

Quality Score

0/100

README

Proteina: Scaling Flow-based Protein Structure Generative Models (ICLR 2025 Oral Paper)

<div align="center"> <a href="https://tomasgeffner.github.io/" target="_blank">Tomas Geffner*</a> &emsp; · &emsp; <a href="https://kdidi.netlify.app/" target="_blank">Kieran Didi*</a> &emsp; · &emsp; <a href="https://oxer11.github.io/" target="_blank">Zuobai Zhang*</a> &emsp; · &emsp; <a href="https://scholar.google.com/citations?user=KBn52kYAAAAJ&hl=en" target="_blank">Danny Reidenbach</a> <a href="https://scholar.google.com/citations?hl=en&user=wGjVFHIAAAAJ&view_op=list_works&sortby=pubdate" target="_blank">Zhonglin Cao</a> &emsp; · &emsp; <a href="https://people.csail.mit.edu/jyim/" target="_blank">Jason Yim</a> &emsp; · &emsp; <a href="https://mariogeiger.ch/" target="_blank">Mario Geiger</a> &emsp; · &emsp; <a href="https://christian.dallago.us/" target="_blank">Christian Dallago</a> <a href="https://scholar.google.ch/citations?hl=en&user=LUXL9FoAAAAJ" target="_blank">Emine Kucukbenli</a> &emsp; · &emsp; <a href="http://latentspace.cc/" target="_blank">Arash Vahdat</a> &emsp; · &emsp; <a href="https://karstenkreis.github.io/" target="_blank">Karsten Kreis*</a> <a href="https://openreview.net/forum?id=TVQLu34bdw" target="_blank">Link to paper</a> &emsp; · &emsp; <a href="https://research.nvidia.com/labs/genair/proteina/" target="_blank">Project Page</a> *core contributor </div> <div align="center"> <img width="600" alt="teaser" src="assets/overview.png"/> </div>

Abstract. Diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteina, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to $5\times$ as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteina achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high-level secondary-structure guidance as well as low-level fold-specific generation.

Setup

For environment setup mamba or micromamba is recommended, but alternatively conda can also be used as a drop-in replacement (substitute mamba with conda).

mamba env create -f environment.yaml
conda activate proteina_env
pip install -e .

Create a file .env in the root directory of the repo with the single line

DATA_PATH=/directory/where/you/store/files

which will be loaded automatically when running our code.

Training and sampling models requires the files in proteina_additional_files.zip, which include:

D_FS_afdb_cath_codes.pth: CATH code distribution based on protein length, used to sample the model conditioned on fold classes for a given length.
fold_class_mappings_C_selected_A_T_cath_codes.pth: (Artificial) uniform CATH code distribution, effectively independent of protein length, used to sample long proteins for specific CATH codes. This can be considered as an experimental feature, since not all fold types naturally occur at all possible protein lengths (see "Generation for specific CATH codes" section below).
D_FS_eval_ca_features.pth: Features extracted from the Gearnet classifier for all samples in the $\mathcal{D}_\textrm{FS}$ dataset, used to compute our proposed metrics.
pdb_eval_ca_features.pth: Features extracted from the Gearnet classifier for samples in the PDB dataset, used to compute our proposed metrics.
gearnet_ca.pth: Weights for the Gearnet classifier for proteins represented by their alpha carbons.
cath_label_mapping.pt: Mapping from CATH code to an index (integer) used by our networks.

Additionally, the dataloaders (see section "Dataloaders" below) require the files in proteina_training_data_indices.zip, which include:

d_fs_index.txt: File containing the indices of the AlphaFold Database that correspond to our $\mathcal{D}_\textrm{FS}$ dataset.
d_21M_index.txt: File containing the indices of the AlphaFold Database that correspond to our $\mathcal{D}_{\textrm{21M}}$ dataset.
seq_d_21M.fasta: File containing all sequences of our $\mathcal{D}_{\textrm{21M}}$ dataset.
cluster_seqid_0.5_d_21M.fasta: File containing the cluster representatives of our $\mathcal{D}_{\textrm{21M}}$ dataset.
cluster_seqid_0.5_d_21M.tsv: File containing information about the clustering of our $\mathcal{D}_{\textrm{21M}}$ dataset.

Once the two files are uncompressed, the resulting files should be stored as follows (the code relies on the files being in their appropriate locations under DATA_PATH):

$DATA_PATH
    - metric_factory
        - features
            - D_FS_eval_ca_features.pth
            - D_FS_afdb_cath_codes.pth
            - pdb_eval_ca_features.pth
            - fold_class_mappings_C_selected_A_T_cath_codes.pth
        - model_weights
            - gearnet_ca.pth
    - pdb_raw
        - cath_label_mapping.pt
    - d_FS
        - d_FS_index.txt
    - d_21M
        - d_21M_index.txt
        - seq_d_21M.fasta
        - cluster_seqid_0.5_d_21M.fasta
        - cluster_seqid_0.5_d_21M.tsv

Dataloaders

We provide minimal dataloader implementations that allow training on different subsets of the PDB as well as on custom datasets from sources like the AFDB such as $\mathcal{D}{\textrm{FS}}$ and $\mathcal{D}{\textrm{21M}}$ from our paper. Here we describe how to use these minimal dataloaders; however, if you are interested in space- and time-efficient dataprocessing and loading, you can have a look at libraries such as webdataset and FoldComp. This becomes especially important for larger datasets like $\mathcal{D}_{\textrm{21M}}$.

PDB dataloader

To use the PDB dataloader, you can for example use the pdb_train.yaml file, which we provide as part of our configs directory, in the following way:

import os
import hydra
import lightning as L

L.seed_everything(43)
version_base = hydra.__version__
config_path = </path/to/datasets_configs>
hydra.initialize_config_dir(config_dir=f"{config_path}/pdb", version_base=version_base)

cfg = hydra.compose(
    config_name="pdb_train",
    return_hydra_config=True,
)
pdb_datamodule = hydra.utils.instantiate(cfg.datamodule)
pdb_datamodule.prepare_data()
pdb_datamodule.setup("fit")
pdb_train_dataloader = pdb_datamodule.train_dataloader()

With this, the dataloader selects all PDB chains according to the selection criteria specified in the yaml file, downloads, processes and splits the data and generates ready-to-use dataloaders and datamodules. For simple demonstration we subsample the dataset in pdb_train.yaml via the fraction attribute; for the full dataset change this value to 1.

Custom data dataloader

If you do not directly want to download data from the PDB, but instead want to load a custom dataset of pdb files you have at a certain location, you can also use our dataloaders. For example, you can use the script_utils/download_afdb_data.sh script to download data from the AFDB based on the specified IDs (we provide the ID files for our two datasets $\mathcal{D}{\textrm{FS}}$ and $\mathcal{D}{\textrm{21M}}$). For example, for $\mathcal{D}_{\textrm{FS}}$ the command could look like

bash script_utils/download_afdb_data.sh $DATA_PATH/d_FS/d_FS_index.txt $DATA_PATH/d_FS

This will download the specified PDB files from the AFDB to the specified folder, and in this folder all your PDB files are stored in a folder called raw. The dataloaders will process your pdb files into fast-to-load pkl files, split the data according to the config file and allow sampling from them. You can follow similar instructions as described above for the PDB dataloaders to create the dataloaders themselves, but use the config file d_FS.yaml for example and inside it adjust your data_dir to where you saved the data as well as the different dataloading and splitting options you prefer.

Model training

Unconditional training

To run model training, run python proteinfoundation/train.py. This will start training according to the configuration specified in configs/experiment_config/training_ca.yaml. You can set things like the dataset, the number of GPUs, the logger and other options there.

If you want to quickly test things with a local single GPU run, you can run python proteinfoundation/train.py --single --nolog --show_prog_bar. This will ignore the GPU options in the config and set a single GPU, disable loggers like WandB, and show a progress bar in

Related Skills

node-connect

352.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。