SaProt

Saprot: Protein Language Model with Structural Alphabet (AA+3Di)

Generate Convert Improve

Install / Use

/learn @westlake-repl/SaProt

About this skill

Quality Score

0/100

README

SaProt: Protein Language Modeling with Structure-aware Vocabulary (AA+3Di)

🔴 Note: SaProt (35M and 650M) requires structural (SA token) input for optimal performance. Its AA-only sequence mode works but should be finetuned - its frozen embeddings work only for SA input, not AA sequences! In contrast, SaProt-1.3B (SaProt_1.3B_AF2 & SaProt_1.3B_AFDB_OMG_NCBI) performs well with both SA tokens and AA-only sequences. However, with high-quality structural input, SaProt is expected to surpass ESM2 and its own AA-only mode in most tasks.

The repository is an official implementation of SaProt: Protein Language Modeling with Structure-aware Vocabulary.

We are pleased to announce that ColabSaprot v2 and SaprotHub are now ready for use. Go.

Successful wet-lab results by ColabSaprot from community.

If you have any question about the paper or the code, feel free to raise an issue or directly email us!

We offer two PHD positions to international student applicants each year, China! see here or contact Prof. Fajie Yuan directly.

<details open><summary><b>Table of contents</b></summary>

News
Overview
Environment installation
Prepare the SaProt model
- Model checkpoints
- New experimental results
Load SaProt
- Hugging Face model
- Load SaProt using esm repository
Convert protein structure into structure-aware sequence
Predict mutational effect
Get protein embeddings
Perform protein inverse folding
Prepare dataset
- Pre-training dataset
- Downstream tasks
Pre-train SaProt
Fine-tune SaProt
Evaluate zero-shot performance
Citation

</details>

News

2025/10/24:: SaProt, ColabSaprot and SaprotHub are now published in Nature Biotechnology, see here.
2025/01/01: SaProt has been extensively validated by multiple wet lab experiments see our work SaprotHub
2024/12/09: We released Saprot 1.3B version! Download it from HuggingFace 1.3B-AF2 and HuggingFace 1.3B-AF2+OMG+NCBI. Saprot 1.3B is better than the original SaProt 650M in the aa-sequence-only tasks.
2024/05/13: We developed SaprotHub to make protein language model training accessible to all biologists. Go.
2024/05/13: SaProt ranked #1st on the public ProteinGym benchmark in April2024, while other top-ranked models are hybrid and mutation-specialized model.🎉🎉🎉! See here.
2024/04/18: We found a slight difference for EC and GO evaluation and updated the re-evaluated results (see issue #23 for details).
2024/03/08: We uploaded a simple function to make zero-shot prediction of mutational effect (see example below).
2024/01/17: Our paper has been accepted as ICLR 2024 spotlight 🎉🎉🎉!
2023/10/30: We release a pre-trained SaProt 35M model and a 35M residue-sequence-only version of SaProt (for comparison)! The residue-sequence-only SaProt (without 3Di token) performs highly similar to the official ESM-2 35M model. (see Results below).
2023/10/30: We released the results by using ESMFold structures. See Table below

Overview

Environment installation

Create a virtual environment

conda create -n SaProt python=3.10
conda activate SaProt

Install packages

bash environment.sh

Prepare the SaProt model

We provide two ways to use SaProt, including through huggingface class and through the same way in esm github. Users can choose either one to use.

Model checkpoints

| Name | Size | Dataset | | ------------------------------------------------------------ | --------------- | --------------------------------------------------------- | | SaProt_35M_AF2 | 35M parameters | 40M AF2 structures | | SaProt_650M_PDB | 650M parameters | 40M AF2 structures (phase1) + 60K PDB structures (phase2) | | SaProt_650M_AF2 | 650M parameters | 40M AF2 structures | | SaProt_1.3B_AFDB_OMG_NCBI | 1.3B parameters | 40M AF2 structures + 200M OMG_prot50 + 150M NCBI (70% identity filtering)|

New experimental results

Some experimental results are listed below. For more details, please refer to our paper. For supervised fine-tuning tasks, the datasets were split based on 30% sequence identity.

35M Model

| Model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :--------------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | ESM-2 (35M) | 0.722 | 0.339 | 0.669 | 80.79 | 73.08 | 0.825 | 0.616 | 0.416 | 0.404 | 76.58 | 91.60 | | SaProt-Seq (35M) | 0.738 | 0.337 | 0.672 | 80.56 | 73.23 | 0.821 | 0.608 | 0.413 | 0.403 | 76.67 | 91.16 | | SaProt (35M) | 0.794 | 0.392 | 0.692 | 81.11 | 74.29 | 0.847 | 0.642 | 0.431 | 0.418 | 78.09 | 91.97 |

650M Model

| Model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :-----------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | ESM-2 (650M) | 0.862 | 0.475 | 0.680 | 76.67 | 71.56 | 0.868 | 0.670 | 0.473 | 0.470 | 82.09 | 91.96 | | SaProt (650M) | 0.909 | 0.478 | 0.724 | 86.41 | 75.75 | 0.882 | 0.682 | 0.486 | 0.479 | 85.57 | 93.55 |

AlphaFold2 vs. ESMFold

We compare structures predicted by AF2 or ESMFold, which is shown below:

| model | ClinVar | ProteinGym | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | :--------------: | :---------: | :------------: | :-----------------: | :----------: | :-------------------: |:---------:|:---------:|:---------:|:---------:| :---------------------: | :----------------: | | | AUC | Spearman's ρ | Spearman's ρ | Acc% | Acc% | Fmax | Fmax | Fmax | Fmax | Acc% | Acc% | | SaProt (ESMFold) | 0.896 | 0.455 | 0.717 | 85.78 | 74.10 | 0.871 | 0.678 | 0.480 | 0.474 | 82.82 | 93.19 | | SaProt (AF2) | 0.909 | 0.478

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

groundhog

399

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

sec-edgar-agentkit

AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.

last30days-skill

5.9k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary