ETFlow
Source code for Equivariant Flow Matching for Molecular Conformer Generation
Install / Use
/learn @shenoynikhil/ETFlowREADME
ET-Flow: Equivariant Flow Matching for Molecule Conformer Generation
<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a>
<a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a>
Implementation of Equivariant Flow Matching for Molecule Conformer Generation by M Hassan, N Shenoy, J Lee, H Stark, S Thaler and D Beaini. The paper was accepted at NeurIPS 2024.
ET-Flow is a state-of-the-art generative model for generating small molecule conformations using equivariant transformers and flow matching.
Install ET-Flow
We are now available on PyPI. Easily install the package using the following command:
pip install etflow
Note: If there are issues with pytorch_cluster/pytorch_geometric and pytorch, it might be easier to install pytorch first and then the etflow package via pip.
Generating Conformations for Custom Smiles
Option 1: Load the model config and checkpoint with automatic download and caching. See (tutorial.ipynb) or use the following snippet to load the model and generate conformations for custom smiles input.
from etflow import BaseFlow
model = BaseFlow.from_default(model="drugs-o3")
# prediction 3 conformations for one molecule given by smiles
smiles = 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'
output = model.predict([smiles], num_samples=3, as_mol=True)
mol = output[smiles] # rdkit mol object
# if we want just positions as numpy array
output = model.predict([smiles], num_samples=3)
output[smiles] # np.ndarray with shape (num_samples, num_atoms, 3)
# for prediction on more than 1 smiles
smiles_1 = ...
smiles_2 = ...
output = model.predict([smiles_1, smiles_2], num_samples=3, as_mol=True)
We currently support the following configurations and checkpoint:
drugs-o3qm9-o3drugs-so3
Option 2: Load the model config, download checkpoints from the following zenodo link and load it manually into the model config. We have a sample notebook (generate_confs.ipynb) to generate conformations for custom smiles input. One needs to pass the config and corresponding checkpoint path in order as additional inputs.
Note: Scaffold Splits and Checkpoints are stored at the following zenodo link.
Setup Dev Environment
Run the following commands to setup the environment:
conda env create -n etflow -f env.yml
conda activate etflow
# to install the etflow package
python3 -m pip install -e .
Preprocessing Data
[!IMPORTANT] I have changed some parts of the data preprocessing scripts to make it more efficient. However, these changes might mean that the configs might not lead to the same results as the one reported in the paper. I am working on reproducing the results with the new preprocessed data format. Thanks for your patience.
To pre-process the data, perform the following steps,
- Download the raw GEOM and unzip the raw data using the following commands,
DATA_DIR=</path_to_data>
wget https://dataverse.harvard.edu/api/access/datafile/4327252 -O $DATA_DIR/rdkit_folder.tar
tar -xvf $DATA_DIR/rdkit_folder.tar -C $DATA_DIR
For the splits and test mols, download the files from the torsional diffusion and extract them to the respective folders inside $DATA_DIR. Ideally it should look like the following (after extracting the zip files),
$DATA_DIR/
├── QM9/
└── DRUGS/
└── XL/
Make sure to set the environment variable DATA_DIR to the path of the data directory with export DATA_DIR=</path_to_data>.
- Process the data for
ET-Flowtraining. All preprocessed data will be created inside aprocessedfolder inside this directory.
python scripts/prepare_data.py -p $DATA_DIR/rdkit_folder
This should create a processed folder inside $DATA_DIR with the preprocessed data.
Training
We provide our configs for training on the GEOM-DRUGS and the GEOM-QM9 datasets in various configurations. Run the following commands once datasets are preprocessed and the environment is set up:
python scripts/train.py -c configs/drugs-base.yaml
The following two configs from the configs/ directory can be used for replicating paper results:
drugs-base.yaml: ET-Flow trained on GEOM-DRUGS datasetqm9-base.yaml: ET-Flow trained on GEOM-QM9 dataset
Evaluation
Evaluation happens in 2 steps as follows,
- Generating Conformations To run the evaluation on either GEOM or QM9 given a config and a checkpoint, run the following command,
# here n: number of inference steps for flow matching
python scripts/eval.py --config=<config-path> --checkpoint=<checkpoint-path>
To run the evaluation on GEOM-XL (a test-set containing much larger molecules), run the following command,
python scripts/eval_xl.py --config=<config-path> --checkpoint=<checkpoint-path>
- Evaluating Conformations with RMSD Metrics
The above sample generation script should created a
generated_files.pklat the following path,logs/samples/<config-path>/<data-time>/flow_nsteps_{value-passed-above}/generated_files.pkl. With the given path, we can get the various RMSD metrics using,
python scripts/eval_cov_mat.py --path=<path-to-generated-files.pkl> --num_workers=10
Acknowledgements
Our codebase is built using the following open-source contributions,
Contact
For further questions, feel free to raise an issue.
Citation
@misc{hassan2024etflow,
title={ET-Flow: Equivariant Flow-Matching for Molecular Conformer Generation},
author={Majdi Hassan and Nikhil Shenoy and Jungyoon Lee and Hannes Stark and Stephan Thaler and Dominique Beaini},
year={2024},
eprint={2410.22388},
archivePrefix={arXiv},
primaryClass={q-bio.QM},
url={https://arxiv.org/abs/2410.22388},
}
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
