DeepProtein

Deep Learning Library and Benchmark for Protein Sequence Learning (Bioinformatics 2025)

Generate Convert Improve

Install / Use

/learn @jiaqingxie/DeepProtein

About this skill

Quality Score

0/100

README

<h3 align="center"> [DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning](https://arxiv.org/abs/2410.02023, Bioinformatics) </h3> <h4 align="center"> Applications in Protein Property Prediction, Localization Prediction, Protein-Protein Interaction, Antigen Epitope Prediction, Antibody Paratope Prediction, Antibody Developability Prediction, and more. </h4>

Introduction

Understanding proteomics is critical for advancing biology, genomics, and medicine. Proteins perform essential roles, such as catalyzing biochemical reactions and providing immune responses. With the rise of 3D databases like AlphaFold 2.0, machine learning has become a powerful tool for studying protein mechanisms.

Why DeepProtein?

Deep learning has revolutionized tasks such as:

Protein-protein interaction
Protein folding
Protein-ligand interaction
Protein function and property prediction

However, current benchmarks often focus on sequential methods like CNNs and transformers, overlooking graph-based models and lacking user-friendly interfaces.

What is DeepProtein?

DeepProtein is a comprehensive deep learning library and benchmark designed to fill these gaps:

Comprehensive Benchmarking: Evaluating CNNs, RNNs, transformers, and GNNs on 7 essential protein learning tasks, such as function prediction and antibody developability.
User-friendly Interface: Simplifying execution with one command for all tasks.
Enhanced Accessibility: Extensive documentation and tutorials for reproducible research.

News

[04/25] DeepProtein is accepted at Bioinformatics (under publication).
[03/25] DeepProtein now published three notebooks of dataset loading, training and inference with DeepProtT5 (colab).
[03/25] DeepProtein is now under the second round review at Bioinformatics.
[03/25] DeepProtein now supports Fold and Secondary Structure Dataset
[03/25] DeepProtein: Files under the train folders are now simplified, also code in Readme.md file.
[03/25] DeepProtein has now released DeepProtT5 Series Models, which can be found at https://huggingface.co/collections/jiaxie/protlm-67bba5b973db936ce90e7d54
[02/25] DeepProtein now supported BioMistral, BioT5+, ChemLLM_7B, ChemDFM, and LlaSMol on some tasks
[12/24] The documentation of DeepProtein is still under construction. It's at https://deepprotein.readthedocs.io/
[12/24] DeepProtein is going to be supported with pretrained shallow DL models.
[12/24] DeepProtein now supports BioMistral-7B model, working on [BioT5+, BioT5, ChemLLM, and LlaSMol]
[12/24] DeepProtein now supports four new Protein Language Models: ESM-1-650M, ESM-2-650M, Prot-Bert and Prot-T5 Models for Protein Function Prediction.
[11/24] DeepProtein is accepted at NeurIPS AI4DrugX as Spotlight. It's under revision at Bioinformatics.

Installation

We recommend you follow the instructions on how DeepPurpose's dependencies are installed.

conda create -n DeepProtein python=3.9
conda activate DeepProtein
pip install git+https://github.com/bp-kelley/descriptastorus
pip install lmdb seaborn wandb pydantic DeepPurpose
pip install transformers bitsandbytes 
pip install accelerate>=0.26.0
pip install SentencePiece einops rdchiral peft
pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2
pip install datasets
conda install -c conda-forge pytdc

A version of torch 2.1+ is required to be installed since Jul requires a version of torch >=2.1.0.

If you want to use GPU, then first find a matched torch version, then install duel with cuda version. We give an example of torch 2.3.0 with cuda 11.8:

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install  dgl -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html

If you are not using a GPU, then follow this:

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cpu
pip install  dgl -f https://data.dgl.ai/wheels/torch-2.3/repo.html

Demos

Checkout some demos & tutorials to start, which are available in Google Colab:

| Name | Description | |--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------| | Dataset Tutorial | Tutorial on how to use the dataset loader and read customized data | | Single Protein Regression | Example of CNN on Beta-lactamase property prediction | | Single Protein Classification | Example of ProtT5 on SubCellular property prediction | | Protein Pair Regression | Example of Transformer on PPI Affinity prediction | | Protein Pair Classification | Example of ProtT5 on Human_PPI Affinity prediction | | Residual-Level Classification | Example of Token_CNN on PDB prediction | | Inference of DeepProtT5 models on all above tasks | Example of DeepProtT5 on Fold Structure prediction | | Personalized data | Example of personalized data load and train |

Example

We give two examples for each case study. One is trained with fixed parameters (a) and one is trained with argument. The argument list is given below.

| Argument | Description | |-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | target_encoding | 'CNN' / 'Transformer' for sequential learning, or 'DGL_GCN' for 'DGL_AttentiveFP' for structure learning. Current available protein encoding belongs to this full list: ['CNN', 'Transformer', 'CNN_RNN', 'DGL_GCN', 'DGL_GAT', 'DGL_AttentiveFP', 'DGL_NeuralFP', 'DGL_MPNN', 'PAGTN', 'Graphormer', 'prot_t5', 'esm_1b', 'esm_2', 'prot_bert']. For residue level tasks, the protein encoding list is ['Token_CNN', 'Token_CNN_RNN, 'Token_Transformer'] | | seed | For paper: 7 / 42 /100. You could try your own seed. | | wandb_proj | The name of your wandb project that you wish to save the results into. | | lr | Learning rate. We recommend 1e-4 for non-GNN learning and 1e-5 for GNN learning. | | epochs | Number of training epochs. Generally setting 60 - 100 epochs leads to convergence. | | compute_pos_enc * | Compute positional enc

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

13.8k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

000-main-rules

Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce

jiaqingxie

View profile

View on GitHub

GitHub Stars42

CategoryEducation

Updated9d ago

Forks2

jiaqingxie/DeepProtein

Languages

Jupyter Notebook

Security Score

95/100

Audited on Mar 19, 2026

No findings