SkillAgentSearch skills...

SaProt

Saprot: Protein Language Model with Structural Alphabet (AA+3Di)

Install / Use

/learn @westlake-repl/SaProt

README

SaProt: Protein Language Modeling with Structure-aware Vocabulary (AA+3Di)

<a href="https://www.biorxiv.org/content/10.1101/2023.10.01.560349v3"><img src="https://img.shields.io/badge/Paper-bioRxiv-green" style="max-width: 100%;"></a> <a href="https://huggingface.co/westlake-repl/SaProt_650M_AF2"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-red?label=Model" style="max-width: 100%;"></a> <a href="https://portal.valencelabs.com/blogs/post/saprot-protein-language-modeling-with-structure-aware-vocabulary-uyLPrUZqyDF60Yr" alt="blog"><img src="https://img.shields.io/badge/Blog-Portal-violet" /></a> <a href="https://zhuanlan.zhihu.com/p/664754366" alt="zhihu"><img src="https://img.shields.io/badge/Zhihu-知乎-blue" /></a>

🔴 Note: SaProt (35M and 650M) requires structural (SA token) input for optimal performance. Its AA-only sequence mode works but should be finetuned - its frozen embeddings work only for SA input, not AA sequences! In contrast, SaProt-1.3B (SaProt_1.3B_AF2 & SaProt_1.3B_AFDB_OMG_NCBI) performs well with both SA tokens and AA-only sequences. However, with high-quality structural input, SaProt is expected to surpass ESM2 and its own AA-only mode in most tasks.

The repository is an official implementation of SaProt: Protein Language Modeling with Structure-aware Vocabulary.

We are pleased to announce that ColabSaprot v2 and SaprotHub are now ready for use. Go.

Successful wet-lab results by ColabSaprot from community.

If you have any question about the paper or the code, feel free to raise an issue or directly email us!

We offer two PHD positions to international student applicants each year, China! see here or contact Prof. Fajie Yuan directly.

<details open><summary><b>Table of contents</b></summary> </details>

News

  • 2025/10/24:: SaProt, ColabSaprot and SaprotHub are now published in Nature Biotechnology, see here.
  • 2025/01/01: SaProt has been extensively validated by multiple wet lab experiments see our work SaprotHub
  • 2024/12/09: We released Saprot 1.3B version! Download it from HuggingFace 1.3B-AF2 and HuggingFace 1.3B-AF2+OMG+NCBI. Saprot 1.3B is better than the original SaProt 650M in the aa-sequence-only tasks.
  • 2024/05/13: We developed SaprotHub to make protein language model training accessible to all biologists. Go.
  • 2024/05/13: SaProt ranked #1st on the public ProteinGym benchmark in April2024, while other top-ranked models are hybrid and mutation-specialized model.🎉🎉🎉! See here.
  • 2024/04/18: We found a slight difference for EC and GO evaluation and updated the re-evaluated results (see issue #23 for details).
  • 2024/03/08: We uploaded a simple function to make zero-shot prediction of mutational effect (see example below).
  • 2024/01/17: Our paper has been accepted as ICLR 2024 spotlight 🎉🎉🎉!
  • 2023/10/30: We release a pre-trained SaProt 35M model and a 35M residue-sequence-only version of SaProt (for comparison)! The residue-sequence-only SaProt (without 3Di token) performs highly similar to the official ESM-2 35M model. (see Results below).
  • 2023/10/30: We released the results by using ESMFold structures. See Table below

Overview

Environment installation

Create a virtual environment

conda create -n SaProt python=3.10
conda activate SaProt

Install packages

bash environment.sh  

Prepare the SaProt model

We provide two ways to use SaProt, including through huggingface class and through the same way in esm github. Users can choose either one to use.

Model checkpoints

| Name | Size | Dataset | | ------------------------------------------------------------ | --------------- | --------------------------------------------------------- | | SaProt_35M_AF2 | 35M parameters | 40M AF2 structures | | SaProt_650M_PDB | 650M parameters | 40M AF2 structures (phase1) + 60K PDB structures (phase2) | | SaProt_650M_AF2 | 650M parameters | 40M AF2 structures | | SaProt_1.3B_AFDB_OMG_NCBI | 1.3B parameters | 40M AF2 structures + 200M OMG_prot50 + 150M NCBI (70% identity filtering)|

New experimental results

Some experimental results are listed below. For more details, please refer to our paper. For supervised fine-tuning tasks, the datasets were split based on 30% sequence identity.

35M Model

| Model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :--------------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | ESM-2 (35M) | 0.722 | 0.339 | 0.669 | 80.79 | 73.08 | 0.825 | 0.616 | 0.416 | 0.404 | 76.58 | 91.60 | | SaProt-Seq (35M) | 0.738 | 0.337 | 0.672 | 80.56 | 73.23 | 0.821 | 0.608 | 0.413 | 0.403 | 76.67 | 91.16 | | SaProt (35M) | 0.794 | 0.392 | 0.692 | 81.11 | 74.29 | 0.847 | 0.642 | 0.431 | 0.418 | 78.09 | 91.97 |

650M Model

| Model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :-----------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | ESM-2 (650M) | 0.862 | 0.475 | 0.680 | 76.67 | 71.56 | 0.868 | 0.670 | 0.473 | 0.470 | 82.09 | 91.96 | | SaProt (650M) | 0.909 | 0.478 | 0.724 | 86.41 | 75.75 | 0.882 | 0.682 | 0.486 | 0.479 | 85.57 | 93.55 |

AlphaFold2 vs. ESMFold

We compare structures predicted by AF2 or ESMFold, which is shown below:

| model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :--------------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | SaProt (ESMFold) | 0.896 | 0.455 | 0.717 | 85.78 | 74.10 | 0.871 | 0.678 | 0.480 | 0.474 | 82.82 | 93.19 | | SaProt (AF2) | 0.909 | 0.478

Related Skills

View on GitHub
GitHub Stars576
CategoryEducation
Updated2h ago
Forks68

Languages

Python

Security Score

100/100

Audited on Mar 25, 2026

No findings