PsiProtein
Deep Learning Library and Benchmark for Protein Sequence Learning (Bioinformatics 2025)
Install / Use
/learn @jiaqingxie/PsiProteinREADME
<h3 align="center"> [DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning](https://arxiv.org/abs/2410.02023, Bioinformatics) </h3> <h4 align="center"> Applications in Protein Property Prediction, Localization Prediction, Protein-Protein Interaction, Antigen Epitope Prediction, Antibody Paratope Prediction, Antibody Developability Prediction, and more. </h4>
Introduction
Understanding proteomics is critical for advancing biology, genomics, and medicine. Proteins perform essential roles, such as catalyzing biochemical reactions and providing immune responses. With the rise of 3D databases like AlphaFold 2.0, machine learning has become a powerful tool for studying protein mechanisms.
Why DeepProtein?
Deep learning has revolutionized tasks such as:
- Protein-protein interaction
- Protein folding
- Protein-ligand interaction
- Protein function and property prediction
However, current benchmarks often focus on sequential methods like CNNs and transformers, overlooking graph-based models and lacking user-friendly interfaces.
What is DeepProtein 2.0?
DeepProtein 2.0 is the torch-first runtime of DeepProtein. It keeps the benchmark and task coverage of the original project while moving the maintained graph path onto torch-geometric instead of DGL.
In practice, the current 2.0 line focuses on:
- Torch-only core runtime: the maintained graph workflow now runs through
PyG_GCN,PyG_GAT,PyG_GraphSAGE,PyG_GIN,PyG_ChebNet, andPyG_TAGConv. - Unified task coverage: single-protein, pair/PPI, and residue-level tasks stay under one library surface.
- Practical training entry points: CLI scripts, dataset loaders, and smoke tests are aligned to the v2 runtime.
More broadly, DeepProtein remains a comprehensive deep learning library and benchmark designed to fill these gaps:
- Comprehensive Benchmarking: Evaluating CNNs, RNNs, transformers, and GNNs on 7 essential protein learning tasks, such as function prediction and antibody developability.
- User-friendly Interface: Simplifying execution with one command for all tasks.
- Enhanced Accessibility: Extensive documentation and tutorials for reproducible research.
News
- [04/25] DeepProtein is accepted at Bioinformatics (under publication).
- [03/25] DeepProtein now published three notebooks of dataset loading, training and inference with DeepProtT5 (colab).
- [03/25] DeepProtein is now under the second round review at Bioinformatics.
- [03/25] DeepProtein now supports Fold and Secondary Structure Dataset
- [03/25] DeepProtein: Files under the train folders are now simplified, also code in Readme.md file.
- [03/25] DeepProtein has now released DeepProtT5 Series Models, which can be found at https://huggingface.co/collections/jiaxie/protlm-67bba5b973db936ce90e7d54
- [02/25] DeepProtein now supported BioMistral, BioT5+, ChemLLM_7B, ChemDFM, and LlaSMol on some tasks
- [12/24] The documentation of DeepProtein is still under construction. It's at https://deepprotein.readthedocs.io/
- [12/24] DeepProtein is going to be supported with pretrained shallow DL models.
- [12/24] DeepProtein now supports BioMistral-7B model, working on [BioT5+, BioT5, ChemLLM, and LlaSMol]
- [12/24] DeepProtein now supports four new Protein Language Models: ESM-1-650M, ESM-2-650M, Prot-Bert and Prot-T5 Models for Protein Function Prediction.
- [11/24] DeepProtein is accepted at NeurIPS AI4DrugX as Spotlight. It's under revision at Bioinformatics.
Installation
The commands below reflect the recommended DeepProtein 2.0 environment. The project still shares several dependencies with DeepPurpose, but the maintained graph backend in v2 is torch-geometric.
conda create -n DeepProtein python=3.9
conda activate DeepProtein
pip install git+https://github.com/bp-kelley/descriptastorus
pip install lmdb seaborn wandb pydantic DeepPurpose
pip install transformers bitsandbytes
pip install accelerate>=0.26.0
pip install SentencePiece einops rdchiral peft
pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2
pip install datasets
conda install -c conda-forge pytdc
A version of torch 2.1+ is required to be installed since Jul requires a version of torch >=2.1.0.
- If you want to use GPU, then first find a matched torch version. We give an example of torch 2.3.0 with cuda 11.8:
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118 pip install torch-geometric - If you are not using a GPU, then follow this:
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cpu pip install torch-geometric
DeepProtein 2.0 removes the DGL requirement from the core package. Phases II and III add torch-geometric support for both single-protein and pair/PPI graph encoders, including PyG_GCN, PyG_GAT, PyG_GraphSAGE, PyG_GIN, PyG_ChebNet, and PyG_TAGConv. Phase V adds optional Laplacian positional encoding for the PyG_* graph path through compute_pos_enc=True. Legacy DGL_* graph encoders are kept as unsupported compatibility stubs.
Demos
Checkout some demos & tutorials to start, which are available in Google Colab:
| Name | Description | |--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------| | Dataset Tutorial | Tutorial on how to use the dataset loader and read customized data | | Single Protein Regression | Example of CNN on Beta-lactamase property prediction | | Single Protein Classification | Example of ProtT5 on SubCellular property prediction | | Protein Pair Regression | Example of Transformer on PPI Affinity prediction | | Protein Pair Classification | Example of ProtT5 on Human_PPI Affinity prediction | | Residual-Level Classification | Example of Token_CNN on PDB prediction | | Inference of DeepProtT5 models on all above tasks | Example of DeepProtT5 on Fold Structure prediction | | Personalized data | Example of personalized data load and train |
Example
We give two examples for each case study. One is trained with fixed parameters (a) and one is trained with argument. The argument list is given below.
| Argument | Description | |-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | target_encoding | DeepProtein 2.0 supports torch-only protein encoders such as 'CNN', 'Transformer', 'CNN_RNN', 'prot_t5', 'esm_1b', 'esm_2', and 'prot_bert', plus torch-geometric graph encoders 'PyG_GCN', 'PyG_GAT', 'PyG_GraphSAGE', 'PyG_GIN', 'PyG_ChebNet', and 'PyG_TAGConv' for both single-protein and pair/PPI tasks. Legacy graph encoders such as 'DGL_GCN', 'DGL_GAT', 'DGL_AttentiveFP', 'DGL_NeuralFP', 'DGL_MPNN', 'PAGTN', and 'Graphormer' are not available in the v2 runtime. For residue level tasks, the protein encoding list is ['Token_CNN', 'Token_CNN_RNN, 'Token_Transformer'] | | seed | For paper: 7 / 42 /100. You could try your own seed. | | wandb_proj | The name of your wandb project that you wish to save the results into.
