DrugCLIP
[NeurIPS 2023] DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
Install / Use
/learn @bowen-gao/DrugCLIPREADME
DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
<!-- [[Code](xxxx - Overview)] -->
Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at Neural Information Processing Systems, 2023. Currently the code is a raw version, will be updated ASAP. If you have any inquiries, feel free to contact billgao0111@gmail.com
Requirements
same as Uni-Mol
rdkit version should be 2022.9.5
Data and checkpoints
https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing
It currently includes the train data, the trained checkpoint and the test data for DUD-E
Training data
The dataset for training is included in google drive: train_no_test_af.zip. It contains several files:
dick_pkt.txt: dictionary for pocket atom types
dict_mol.txt: dictionary for molecule atom types
train.lmdb: train dataset
valid.lmdb: validation dataset
Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below:
"atoms": "atom types for each atom in the ligand"
"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10"
"pocket_atoms": "atom types for each atom in the pocket"
"pocket_coordinates": "3D coordinates for each atom in the pocket"
"mol": "RDKit molecule object for the ligand"
"smi": "SMILES string for the ligand"
"pocket": "pdbid of the pocket",
The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations.
Test data
DUD-E
DUD-E
├── gene id
│ ├── receptor.pdb
│ ├── crystal_ligand.mol2
│ ├── actives_final.ism
│ ├── decoys_final.ism
│ ├── mols.lmdb (containing all actives and decoys)
│ ├── pocket.lmdb
PCBA
lit_pcba
├── target name
│ ├── PDBID_protein.mol2
│ ├── PDBID_ligand.mol2
│ ├── actives.smi
│ ├── inactives.smi
│ ├── mols.lmdb (containing all actives and inactives)
│ ├── pocket.lmdb
Data preprocessing
see py_scripts/write_dude_multi.py
HomoAug
Please refer to HomoAug directory for details
Train
bash drugclip.sh
Test
bash test.sh
Retrieval
bash retrieval.sh
In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir.
Citation
If you find our work useful, please cite our paper:
@inproceedings{gao2023drugclip,
author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan},
title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening},
booktitle = {NeurIPS 2023},
year = {2023},
url = {https://openreview.net/forum?id=lAbCgNcxm7},
}
