SaProt
Saprot: Protein Language Model with Structural Alphabet (AA+3Di)
Install / Use
/learn @westlake-repl/SaProtREADME
SaProt: Protein Language Modeling with Structure-aware Vocabulary (AA+3Di)
<a href="https://www.biorxiv.org/content/10.1101/2023.10.01.560349v3"><img src="https://img.shields.io/badge/Paper-bioRxiv-green" style="max-width: 100%;"></a> <a href="https://huggingface.co/westlake-repl/SaProt_650M_AF2"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-red?label=Model" style="max-width: 100%;"></a> <a href="https://portal.valencelabs.com/blogs/post/saprot-protein-language-modeling-with-structure-aware-vocabulary-uyLPrUZqyDF60Yr" alt="blog"><img src="https://img.shields.io/badge/Blog-Portal-violet" /></a> <a href="https://zhuanlan.zhihu.com/p/664754366" alt="zhihu"><img src="https://img.shields.io/badge/Zhihu-知乎-blue" /></a>
🔴 Note: SaProt (35M and 650M) requires structural (SA token) input for optimal performance. Its AA-only sequence mode works but should be finetuned - its frozen embeddings work only for SA input, not AA sequences! In contrast, SaProt-1.3B (SaProt_1.3B_AF2 & SaProt_1.3B_AFDB_OMG_NCBI) performs well with both SA tokens and AA-only sequences. However, with high-quality structural input, SaProt is expected to surpass ESM2 and its own AA-only mode in most tasks.
The repository is an official implementation of SaProt: Protein Language Modeling with Structure-aware Vocabulary.
We are pleased to announce that ColabSaprot v2 and SaprotHub are now ready for use. Go.
Successful wet-lab results by ColabSaprot from community.
If you have any question about the paper or the code, feel free to raise an issue or directly email us!
<details open><summary><b>Table of contents</b></summary>We offer two PHD positions to international student applicants each year, China! see here or contact Prof. Fajie Yuan directly.
- News
- Overview
- Environment installation
- Prepare the SaProt model
- Load SaProt
- Convert protein structure into structure-aware sequence
- Predict mutational effect
- Get protein embeddings
- Perform protein inverse folding
- Prepare dataset
- Pre-train SaProt
- Fine-tune SaProt
- Evaluate zero-shot performance
- Citation
News
- 2025/10/24:: SaProt, ColabSaprot and SaprotHub are now published in Nature Biotechnology, see here.
- 2025/01/01: SaProt has been extensively validated by multiple wet lab experiments see our work SaprotHub
- 2024/12/09: We released Saprot 1.3B version! Download it from HuggingFace 1.3B-AF2 and HuggingFace 1.3B-AF2+OMG+NCBI. Saprot 1.3B is better than the original SaProt 650M in the aa-sequence-only tasks.
- 2024/05/13: We developed SaprotHub to make protein language model training accessible to all biologists. Go.
- 2024/05/13: SaProt ranked #1st on the public ProteinGym benchmark in April2024, while other top-ranked models are hybrid and mutation-specialized model.🎉🎉🎉! See here.
- 2024/04/18: We found a slight difference for EC and GO evaluation and updated the re-evaluated results (see issue #23 for details).
- 2024/03/08: We uploaded a simple function to make zero-shot prediction of mutational effect (see example below).
- 2024/01/17: Our paper has been accepted as ICLR 2024 spotlight 🎉🎉🎉!
- 2023/10/30: We release a pre-trained SaProt 35M model and a 35M residue-sequence-only version of SaProt (for comparison)! The residue-sequence-only SaProt (without 3Di token) performs highly similar to the official ESM-2 35M model. (see Results below).
- 2023/10/30: We released the results by using ESMFold structures. See Table below
Overview

Environment installation
Create a virtual environment
conda create -n SaProt python=3.10
conda activate SaProt
Install packages
bash environment.sh
Prepare the SaProt model
We provide two ways to use SaProt, including through huggingface class and through the same way in esm github. Users can choose either one to use.
Model checkpoints
| Name | Size | Dataset | | ------------------------------------------------------------ | --------------- | --------------------------------------------------------- | | SaProt_35M_AF2 | 35M parameters | 40M AF2 structures | | SaProt_650M_PDB | 650M parameters | 40M AF2 structures (phase1) + 60K PDB structures (phase2) | | SaProt_650M_AF2 | 650M parameters | 40M AF2 structures | | SaProt_1.3B_AFDB_OMG_NCBI | 1.3B parameters | 40M AF2 structures + 200M OMG_prot50 + 150M NCBI (70% identity filtering)|
New experimental results
Some experimental results are listed below. For more details, please refer to our paper. For supervised fine-tuning tasks, the datasets were split based on 30% sequence identity.
35M Model
| Model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :--------------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | ESM-2 (35M) | 0.722 | 0.339 | 0.669 | 80.79 | 73.08 | 0.825 | 0.616 | 0.416 | 0.404 | 76.58 | 91.60 | | SaProt-Seq (35M) | 0.738 | 0.337 | 0.672 | 80.56 | 73.23 | 0.821 | 0.608 | 0.413 | 0.403 | 76.67 | 91.16 | | SaProt (35M) | 0.794 | 0.392 | 0.692 | 81.11 | 74.29 | 0.847 | 0.642 | 0.431 | 0.418 | 78.09 | 91.97 |
650M Model
| Model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :-----------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | ESM-2 (650M) | 0.862 | 0.475 | 0.680 | 76.67 | 71.56 | 0.868 | 0.670 | 0.473 | 0.470 | 82.09 | 91.96 | | SaProt (650M) | 0.909 | 0.478 | 0.724 | 86.41 | 75.75 | 0.882 | 0.682 | 0.486 | 0.479 | 85.57 | 93.55 |
AlphaFold2 vs. ESMFold
We compare structures predicted by AF2 or ESMFold, which is shown below:
| model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :--------------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | SaProt (ESMFold) | 0.896 | 0.455 | 0.717 | 85.78 | 74.10 | 0.871 | 0.678 | 0.480 | 0.474 | 82.82 | 93.19 | | SaProt (AF2) | 0.909 | 0.478
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
sec-edgar-agentkit
10AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.
last30days-skill
5.9kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
