SkillAgentSearch skills...

DeepProtein

Deep Learning Library and Benchmark for Protein Sequence Learning (Bioinformatics 2025)

Install / Use

/learn @jiaqingxie/DeepProtein

README

<p align="center"><img src="figs/deeppurpose_pp_logo.png" alt="DeepProtein Logo" width="400px" /></p>
<h3 align="center"> [DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning](https://arxiv.org/abs/2410.02023, Bioinformatics) </h3> <h4 align="center"> Applications in Protein Property Prediction, Localization Prediction, Protein-Protein Interaction, Antigen Epitope Prediction, Antibody Paratope Prediction, Antibody Developability Prediction, and more. </h4>

Introduction

Understanding proteomics is critical for advancing biology, genomics, and medicine. Proteins perform essential roles, such as catalyzing biochemical reactions and providing immune responses. With the rise of 3D databases like AlphaFold 2.0, machine learning has become a powerful tool for studying protein mechanisms.

Why DeepProtein?

Deep learning has revolutionized tasks such as:

  1. Protein-protein interaction
  2. Protein folding
  3. Protein-ligand interaction
  4. Protein function and property prediction

However, current benchmarks often focus on sequential methods like CNNs and transformers, overlooking graph-based models and lacking user-friendly interfaces.


What is DeepProtein?

DeepProtein is a comprehensive deep learning library and benchmark designed to fill these gaps:

  1. Comprehensive Benchmarking: Evaluating CNNs, RNNs, transformers, and GNNs on 7 essential protein learning tasks, such as function prediction and antibody developability.
  2. User-friendly Interface: Simplifying execution with one command for all tasks.
  3. Enhanced Accessibility: Extensive documentation and tutorials for reproducible research.
<p align="center"><img src="figs/DeepProtein.jpg" alt="DeepProtein Approach" /></p>

News

  • [04/25] DeepProtein is accepted at Bioinformatics (under publication).
  • [03/25] DeepProtein now published three notebooks of dataset loading, training and inference with DeepProtT5 (colab).
  • [03/25] DeepProtein is now under the second round review at Bioinformatics.
  • [03/25] DeepProtein now supports Fold and Secondary Structure Dataset
  • [03/25] DeepProtein: Files under the train folders are now simplified, also code in Readme.md file.
  • [03/25] DeepProtein has now released DeepProtT5 Series Models, which can be found at https://huggingface.co/collections/jiaxie/protlm-67bba5b973db936ce90e7d54
  • [02/25] DeepProtein now supported BioMistral, BioT5+, ChemLLM_7B, ChemDFM, and LlaSMol on some tasks
  • [12/24] The documentation of DeepProtein is still under construction. It's at https://deepprotein.readthedocs.io/
  • [12/24] DeepProtein is going to be supported with pretrained shallow DL models.
  • [12/24] DeepProtein now supports BioMistral-7B model, working on [BioT5+, BioT5, ChemLLM, and LlaSMol]
  • [12/24] DeepProtein now supports four new Protein Language Models: ESM-1-650M, ESM-2-650M, Prot-Bert and Prot-T5 Models for Protein Function Prediction.
  • [11/24] DeepProtein is accepted at NeurIPS AI4DrugX as Spotlight. It's under revision at Bioinformatics.

Installation

We recommend you follow the instructions on how DeepPurpose's dependencies are installed.

conda create -n DeepProtein python=3.9
conda activate DeepProtein
pip install git+https://github.com/bp-kelley/descriptastorus
pip install lmdb seaborn wandb pydantic DeepPurpose
pip install transformers bitsandbytes 
pip install accelerate>=0.26.0
pip install SentencePiece einops rdchiral peft
pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2
pip install datasets
conda install -c conda-forge pytdc

A version of torch 2.1+ is required to be installed since Jul requires a version of torch >=2.1.0.

  1. If you want to use GPU, then first find a matched torch version, then install duel with cuda version. We give an example of torch 2.3.0 with cuda 11.8:
    pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
    pip install  dgl -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html
    
  2. If you are not using a GPU, then follow this:
    pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cpu
    pip install  dgl -f https://data.dgl.ai/wheels/torch-2.3/repo.html
    

Demos

Checkout some demos & tutorials to start, which are available in Google Colab:

| Name | Description | |--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------| | Dataset Tutorial | Tutorial on how to use the dataset loader and read customized data | | Single Protein Regression | Example of CNN on Beta-lactamase property prediction | | Single Protein Classification | Example of ProtT5 on SubCellular property prediction | | Protein Pair Regression | Example of Transformer on PPI Affinity prediction | | Protein Pair Classification | Example of ProtT5 on Human_PPI Affinity prediction | | Residual-Level Classification | Example of Token_CNN on PDB prediction | | Inference of DeepProtT5 models on all above tasks | Example of DeepProtT5 on Fold Structure prediction | | Personalized data | Example of personalized data load and train |

Example

We give two examples for each case study. One is trained with fixed parameters (a) and one is trained with argument. The argument list is given below.

| Argument | Description | |-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | target_encoding | 'CNN' / 'Transformer' for sequential learning, or 'DGL_GCN' for 'DGL_AttentiveFP' for structure learning. Current available protein encoding belongs to this full list: ['CNN', 'Transformer', 'CNN_RNN', 'DGL_GCN', 'DGL_GAT', 'DGL_AttentiveFP', 'DGL_NeuralFP', 'DGL_MPNN', 'PAGTN', 'Graphormer', 'prot_t5', 'esm_1b', 'esm_2', 'prot_bert']. For residue level tasks, the protein encoding list is ['Token_CNN', 'Token_CNN_RNN, 'Token_Transformer'] | | seed | For paper: 7 / 42 /100. You could try your own seed. | | wandb_proj | The name of your wandb project that you wish to save the results into. | | lr | Learning rate. We recommend 1e-4 for non-GNN learning and 1e-5 for GNN learning. | | epochs | Number of training epochs. Generally setting 60 - 100 epochs leads to convergence. | | compute_pos_enc * | Compute positional enc

Related Skills

View on GitHub
GitHub Stars42
CategoryEducation
Updated9d ago
Forks2

Languages

Jupyter Notebook

Security Score

95/100

Audited on Mar 19, 2026

No findings