FlowDock
A geometric flow matching model for generative protein-ligand docking and affinity prediction. (ISMB 2025)
Install / Use
/learn @BioinfoMachineLearning/FlowDockREADME
FlowDock
<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a> <a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a> <a href="https://hydra.cc/"><img alt="Config: Hydra" src="https://img.shields.io/badge/Config-Hydra-89b8cd"></a>
<!-- <a href="https://github.com/ashleve/lightning-hydra-template"><img alt="Template" src="https://img.shields.io/badge/-Lightning--Hydra--Template-017F2F?style=flat&logo=github&labelColor=gray"></a><br> --> <img src="./img/FlowDock.png" width="600"> </div>Description
This is the official codebase of the paper
FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction
[arXiv] [ISMB] [Neurosnap] [Tamarind Bio]
<div align="center">

Contents
- FlowDock
- Description
- Contents
- Installation
- How to prepare data for
FlowDock - How to train
FlowDock - How to evaluate
FlowDock - How to create comparative plots of benchmarking results
- How to predict new protein-ligand complex structures and their affinities using
FlowDock - For developers
- Docker
- Acknowledgements
- License
- Citing this work
Installation
<details>Install Mamba
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh # accept all terms and install to the default location
rm Miniforge3-$(uname)-$(uname -m).sh # (optionally) remove installer after using it
source ~/.bashrc # alternatively, one can restart their shell session to achieve the same result
Install dependencies
# clone project
git clone https://github.com/BioinfoMachineLearning/FlowDock
cd FlowDock
# create conda environment
mamba env create -f environments/flowdock_environment.yaml
conda activate FlowDock # NOTE: one still needs to use `conda` to (de)activate environments
pip3 install -e . # install local project as package
pip3 install prody==2.4.1 --no-dependencies # install ProDy without NumPy dependency
Download checkpoints
# pretrained NeuralPLexer weights
cd checkpoints/
wget https://zenodo.org/records/10373581/files/neuralplexermodels_downstream_datasets_predictions.zip
unzip neuralplexermodels_downstream_datasets_predictions.zip
rm neuralplexermodels_downstream_datasets_predictions.zip
cd ../
# pretrained FlowDock weights
wget https://zenodo.org/records/15066450/files/flowdock_checkpoints.tar.gz
tar -xzf flowdock_checkpoints.tar.gz
rm flowdock_checkpoints.tar.gz
Download preprocessed datasets
# cached input data for training/validation/testing
wget "https://mailmissouri-my.sharepoint.com/:u:/g/personal/acmwhb_umsystem_edu/ER1hctIBhDVFjM7YepOI6WcBXNBm4_e6EBjFEHAM1A3y5g?download=1"
tar -xzf flowdock_data_cache.tar.gz
rm flowdock_data_cache.tar.gz
# cached data for PDBBind, Binding MOAD, DockGen, and the PDB-based van der Mers (vdM) dataset
wget https://zenodo.org/records/15066450/files/flowdock_pdbbind_data.tar.gz
tar -xzf flowdock_pdbbind_data.tar.gz
rm flowdock_pdbbind_data.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_moad_data.tar.gz
tar -xzf flowdock_moad_data.tar.gz
rm flowdock_moad_data.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_dockgen_data.tar.gz
tar -xzf flowdock_dockgen_data.tar.gz
rm flowdock_dockgen_data.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_pdbsidechain_data.tar.gz
tar -xzf flowdock_pdbsidechain_data.tar.gz
rm flowdock_pdbsidechain_data.tar.gz
</details>
How to prepare data for FlowDock
<details>
NOTE: The following steps (besides downloading PDBBind and Binding MOAD's PDB files) are only necessary if one wants to fully process each of the following datasets manually. Otherwise, preprocessed versions of each dataset can be found on Zenodo.
Download data
# fetch preprocessed PDBBind and Binding MOAD (as well as the optional DockGen and vdM datasets)
cd data/
wget "https://mailmissouri-my.sharepoint.com/:u:/g/personal/acmwhb_umsystem_edu/EXesf4oh6ztOusGqFcDyqP0Bvk-LdJ1DagEl8GNK-HxDtg?download=1"
wget https://zenodo.org/records/10656052/files/BindingMOAD_2020_processed.tar
wget https://zenodo.org/records/10656052/files/DockGen.tar
wget https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02.tar.gz
mv EXesf4oh6ztOusGqFcDyqP0Bvk-LdJ1DagEl8GNK-HxDtg?download=1 PDBBind.tar.gz
tar -xzf PDBBind.tar.gz
tar -xf BindingMOAD_2020_processed.tar
tar -xf DockGen.tar
tar -xzf pdb_2021aug02.tar.gz
rm PDBBind.tar.gz BindingMOAD_2020_processed.tar DockGen.tar pdb_2021aug02.tar.gz
mkdir pdbbind/ moad/ pdbsidechain/
mv PDBBind_processed/ pdbbind/
mv BindingMOAD_2020_processed/ moad/
mv pdb_2021aug02/ pdbsidechain/
cd ../
Lastly, to finetune FlowDock using the PLINDER dataset, one must first prepare this data for training
# fetch PLINDER data (NOTE: requires ~1 hour to download and ~750G of storage)
export PLINDER_MOUNT="$(pwd)/data/PLINDER"
mkdir -p "$PLINDER_MOUNT" # create the directory if it doesn't exist
plinder_download -y
Generating ESM2 embeddings for each protein (optional, cached input data available on SharePoint)
To generate the ESM2 embeddings for the protein inputs, first create all the corresponding FASTA files for each protein sequence
python flowdock/data/components/esm_embedding_preparation.py --dataset pdbbind --data_dir data/pdbbind/PDBBind_processed/ --out_file data/pdbbind/pdbbind_sequences.fasta
python flowdock/data/components/esm_embedding_preparation.py --dataset moad --data_dir data/moad/BindingMOAD_2020_processed/pdb_protein/ --out_file data/moad/moad_sequences.fasta
python flowdock/data/components/esm_embedding_preparation.py --dataset dockgen --data_dir data/DockGen/processed_files/ --out_file data/DockGen/dockgen_sequences.fasta
python flowdock/data/components/esm_embedding_preparation.py --dataset pdbsidechain --data_dir data/pdbsidechain/pdb_2021aug02/pdb/ --out_file data/pdbsidechain/pdbsidechain_sequences.fasta
Then, generate all ESM2 embeddings in batch using the ESM repository's helper script
python flowdock/data/components/esm_embedding_extraction.py esm2_t33_650M_UR50D data/pdbbind/pdbbind_sequences.fasta data/pdbbind/embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096 --cuda_device_index 0
python flowdock/data/components/esm_embedding_extraction.py esm2_t33_650M_UR50D data/moad/moad_sequences.fasta data/moad/embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096 --cuda_device_index 0
python flowdock/data/components/esm_embedding_extraction.py esm2_t33_650M_UR50D data/DockGen/dockgen_sequences.fasta data/DockGen/embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096 --cuda_device_index 0
python flowdock/data/components/esm_embedding_extraction.py esm2_t33_650M_UR50D data/pdbsidechain/pdbsidechain_sequences.fasta data/pdbsidechain/embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096 --cuda_device_index 0
Predicting apo protein structures using ESMFold (optional, cached data available on Zenodo)
To generate the apo version of each protein structure,
first create ESMFold-ready versions of the combined FASTA files
prepared above by the script esm_embedding_preparation.py
for the PDBBind, Binding MOAD, DockGen, and PDBSidechain datasets, respectively
python flowdock/data/components/esmfold_sequence_preparation.py dataset=pdbbind
python flowdock/data/components/esmfold_sequence_preparation.py dataset=moad
python flowdock/data/components/esmfold_sequence_preparation.py dataset=dockgen
python flowdock/data/components/esmfold_sequence_preparation.py dataset=pdbsidechain
Then, predict each apo protein structure using ESMFold's batch inference script
# Note: Having a CUDA-enabled device available when running this script is highly recommended
python flowdock/data/components/esmfold_batch_structure_prediction.py -i data/pdbbind/pdbbind_esmfold_sequences.fasta -o data/pdbbind/pdbbind_esmfold_structures --cuda-device-index 0 --skip-existing
python flowdock/dat
