DeepProtein
Deep Learning Library and Benchmark for Protein Sequence Learning (Bioinformatics 2025)
Install / Use
/learn @jiaqingxie/DeepProteinREADME
<h3 align="center"> [DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning](https://arxiv.org/abs/2410.02023, Bioinformatics) </h3> <h4 align="center"> Applications in Protein Property Prediction, Localization Prediction, Protein-Protein Interaction, Antigen Epitope Prediction, Antibody Paratope Prediction, Antibody Developability Prediction, and more. </h4>
Introduction
Understanding proteomics is critical for advancing biology, genomics, and medicine. Proteins perform essential roles, such as catalyzing biochemical reactions and providing immune responses. With the rise of 3D databases like AlphaFold 2.0, machine learning has become a powerful tool for studying protein mechanisms.
Why DeepProtein?
Deep learning has revolutionized tasks such as:
- Protein-protein interaction
- Protein folding
- Protein-ligand interaction
- Protein function and property prediction
However, current benchmarks often focus on sequential methods like CNNs and transformers, overlooking graph-based models and lacking user-friendly interfaces.
What is DeepProtein?
DeepProtein is a comprehensive deep learning library and benchmark designed to fill these gaps:
- Comprehensive Benchmarking: Evaluating CNNs, RNNs, transformers, and GNNs on 7 essential protein learning tasks, such as function prediction and antibody developability.
- User-friendly Interface: Simplifying execution with one command for all tasks.
- Enhanced Accessibility: Extensive documentation and tutorials for reproducible research.
News
- [04/25] DeepProtein is accepted at Bioinformatics (under publication).
- [03/25] DeepProtein now published three notebooks of dataset loading, training and inference with DeepProtT5 (colab).
- [03/25] DeepProtein is now under the second round review at Bioinformatics.
- [03/25] DeepProtein now supports Fold and Secondary Structure Dataset
- [03/25] DeepProtein: Files under the train folders are now simplified, also code in Readme.md file.
- [03/25] DeepProtein has now released DeepProtT5 Series Models, which can be found at https://huggingface.co/collections/jiaxie/protlm-67bba5b973db936ce90e7d54
- [02/25] DeepProtein now supported BioMistral, BioT5+, ChemLLM_7B, ChemDFM, and LlaSMol on some tasks
- [12/24] The documentation of DeepProtein is still under construction. It's at https://deepprotein.readthedocs.io/
- [12/24] DeepProtein is going to be supported with pretrained shallow DL models.
- [12/24] DeepProtein now supports BioMistral-7B model, working on [BioT5+, BioT5, ChemLLM, and LlaSMol]
- [12/24] DeepProtein now supports four new Protein Language Models: ESM-1-650M, ESM-2-650M, Prot-Bert and Prot-T5 Models for Protein Function Prediction.
- [11/24] DeepProtein is accepted at NeurIPS AI4DrugX as Spotlight. It's under revision at Bioinformatics.
Installation
We recommend you follow the instructions on how DeepPurpose's dependencies are installed.
conda create -n DeepProtein python=3.9
conda activate DeepProtein
pip install git+https://github.com/bp-kelley/descriptastorus
pip install lmdb seaborn wandb pydantic DeepPurpose
pip install transformers bitsandbytes
pip install accelerate>=0.26.0
pip install SentencePiece einops rdchiral peft
pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2
pip install datasets
conda install -c conda-forge pytdc
A version of torch 2.1+ is required to be installed since Jul requires a version of torch >=2.1.0.
- If you want to use GPU, then first find a matched torch version, then install duel with cuda version. We give an example of torch 2.3.0 with cuda 11.8:
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118 pip install dgl -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html - If you are not using a GPU, then follow this:
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cpu pip install dgl -f https://data.dgl.ai/wheels/torch-2.3/repo.html
Demos
Checkout some demos & tutorials to start, which are available in Google Colab:
| Name | Description | |--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------| | Dataset Tutorial | Tutorial on how to use the dataset loader and read customized data | | Single Protein Regression | Example of CNN on Beta-lactamase property prediction | | Single Protein Classification | Example of ProtT5 on SubCellular property prediction | | Protein Pair Regression | Example of Transformer on PPI Affinity prediction | | Protein Pair Classification | Example of ProtT5 on Human_PPI Affinity prediction | | Residual-Level Classification | Example of Token_CNN on PDB prediction | | Inference of DeepProtT5 models on all above tasks | Example of DeepProtT5 on Fold Structure prediction | | Personalized data | Example of personalized data load and train |
Example
We give two examples for each case study. One is trained with fixed parameters (a) and one is trained with argument. The argument list is given below.
| Argument | Description | |-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | target_encoding | 'CNN' / 'Transformer' for sequential learning, or 'DGL_GCN' for 'DGL_AttentiveFP' for structure learning. Current available protein encoding belongs to this full list: ['CNN', 'Transformer', 'CNN_RNN', 'DGL_GCN', 'DGL_GAT', 'DGL_AttentiveFP', 'DGL_NeuralFP', 'DGL_MPNN', 'PAGTN', 'Graphormer', 'prot_t5', 'esm_1b', 'esm_2', 'prot_bert']. For residue level tasks, the protein encoding list is ['Token_CNN', 'Token_CNN_RNN, 'Token_Transformer'] | | seed | For paper: 7 / 42 /100. You could try your own seed. | | wandb_proj | The name of your wandb project that you wish to save the results into. | | lr | Learning rate. We recommend 1e-4 for non-GNN learning and 1e-5 for GNN learning. | | epochs | Number of training epochs. Generally setting 60 - 100 epochs leads to convergence. | | compute_pos_enc * | Compute positional enc
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
13.8kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
000-main-rules
Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce
