SkillAgentSearch skills...

SaprotHub

Making Protein Language Modeling Accessible to All Biologists

Install / Use

/learn @westlake-repl/SaprotHub

README

Democratizing Protein Language Model Training, Sharing and Collaboration (Nature Biotechnology)

<a href="https://doi.org/10.1101/2024.05.24.595648"><img src="https://img.shields.io/badge/Paper-bioRxiv-green" style="max-width: 100%;"></a> <a href="https://huggingface.co/SaProtHub"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-red?label=SaprotHub" style="max-width: 100%;"></a> <a href="https://colab.research.google.com/github/westlake-repl/SaprotHub/blob/main/colab/SaprotHub_v2.ipynb"><img src="Figure/colab-badge.svg" style="max-width: 100%;"></a> <a href="https://cbirt.net/no-coding-required-saprothub-brings-protein-modeling-to-every-biologist/" alt="blog"><img src="https://img.shields.io/badge/Blog-Medium-purple" /></a> <a href="https://x.com/sokrypton/status/1795525127653986415"><img src="https://img.shields.io/badge/Twitter-blue?logo=twitter" style="max-width: 100%;"></a>

The repository is an official implementation of Democratizing Protein Language Model Training, Sharing and Collaboration. Making protein language models fine-tuning and sharing accessible to all - no coding or advanced machine learning expertise required!

Access via ColabSaprot (35M,650M), ColabSeprot (including ColabESM1b (650M), ColabESM2 (35M, 150M, 650M), ColabProTrek (35M, 650M), ColabProtBERT (420M)), ColabESMC (300M,600M), and ColabESM3 (1.4B) ! Note that ColabSaprot and ColabESM3 also supports training and prediction using only amino acid sequences.

Watch the ColabSaprot video tutorial.

Go to the OPMC website.

Successful wet-lab results by ColabSaprot from community.

Finding models through SaprotHub Search, sharing your models via ColabSaprot to boost visibility and citations.

News

  • 2026/01/28: We are delighted to announce that ColabESM3 and ColabESMC are now ready for use.

  • 2025/10/24: ColabSaprot and SaprotHub are published in Nature Biotechnology, see here.

  • 2024/05/06: We are delighted to announce that ColabSeprot and SeprotHub are now ready for use. ColabSeprot offers access to several state-of-the-art sequence-only Protein Language Models (PLMs), including ESM1b, ESM2, ProTrek, and ProtBert, enabling streamlined fine-tuning and prediction for protein sequence analysis.

  • 2025/01/01: We are delighted to announce that ColabSaprot v2 (New!!) and SaprotHub are now ready for use. [Note ColabSaprot v1 will no longer be updated.] Check out the video tutorial and document.

Discussion Groups

Slack

2025/03/21: We’ve just set up a slack group for discussing ColabSaprot or asking any questions, you can join here!

WeChat

2024/12/01: You can scan the QR code to join our wechat groups (We currently host 5 active weichat groups with a total membership of over 1000 researchers and practitioners):

<img src="https://github.com/user-attachments/assets/2ae718ba-3087-491b-869b-b4ba2c137386" width="50%">

We have 2 PhD positions for international students at Westlake University! see here.

Deploy ColabSaprot on local server (for linux os and Windows)

For users who want to deploy ColabSaprot on their local server, please refer to here.

PLM members of OPMC

Open Protein Modeling Consortium (OPMC)

The Open Protein Modeling Consortium (OPMC) is a collaborative initiative aimed at bringing together the efforts of the protein research community. Its mission is to foster the sharing and co-development of resources, with an emphasis on individually trained decentralized models, helping to advance protein modeling through collective contributions. OPMC provides platforms/tools that support diverse protein prediction tasks, striving to make advanced protein modeling more accessible to researchers, regardless of their level of expertise in machine learning.

visit our OPMC website here

FAQs

Q1: It seems like OPMC and SaprotHub are intertwined but not exactly the same.

Yes, OPMC is a grand goal, and in this paper, it is primarily presented as a concept and vision. The paper introduces OPMC and implements SaprotHub as a pioneering example to drive the initial realization of OPMC. Achieving a broader implementation of OPMC requires continuous efforts from the entire community.

Our OPMC members are now building other ColabPLMs, including ColabESM1b, ColabESM2, ColabProtBert, ColabT5, ColabProTrek, etc.

Q2: I'm very interested in the OPMC side of this project? Would I be able to support OMPC independently?

Yes, you can. OPMC is not tied exclusively to SaprotHub. SaprotHub serves as an initial implementation case within the broader OPMC concept. We also welcome the inclusion of new protein models in OPMC. There are generally two ways to contribute: either independently of SaprotHub, such as building ESMHub or ProtTransHub, or by joining SaprotHub. While SaprotHub is named after its first model, Saprot, it is not limited to Saprot alone and welcomes the inclusion of other language models. The concept of OPMC originated from the SaprotHub paper, so if you would like your protein model to be part of OPMC or if you adopt the similar construction approach of SaprotHub, we encourage you to cite the source paper. Also see Q9.

Q3: What's the relation between OPMC and the OpenFold Consortium?

The goal of the OpenFold Consortium is to develop free and open-source software tools. This differs from the goals of OPMC. OPMC aims to make it easy for all biologists (especially those without machine learning backgrounds and coding skills) to train their own protein models, and to share these models with the community members, allowing for integration and collaborative development on top of the existing community models.

Additionally, so far, the OpenFold Consortium seems to be focusing more on protein structure prediction, while OPMC is more focused on protein function prediction. Furthermore, the number of protein function task categories is far greater than the number of structure tasks. As a result, biologists often have to fine-tune large pre-trained protein models based on their own training data, which is a key feature of OPMC.

Q4: Is the idea to create a company that provides the resources for biologists to do model training? I'm unsure the vision here, since a lot of model training is resource and data constrained. It would be hard to create something where "every biologist to train their own AI models with just a few clicks." Who would provide the resources in this case?

No, the primary motivation behind OPMC is to enable biologists to participate in protein model training and collaborative development, without direct involvement of creating a company or commercial operation.

Currently, we do not provide free training resources. Users have the option to purchase GPUs, such as the A100, on platforms like Colab. OPMC primarily supports fine-tuning tasks or direct prediction tasks for protein language models, rather than pre-training. These tasks typically do not require excessively expensive computational power. With a budget of around $1

Related Skills

View on GitHub
GitHub Stars158
CategoryDevelopment
Updated4d ago
Forks19

Languages

Jupyter Notebook

Security Score

100/100

Audited on Mar 24, 2026

No findings