ProtTrans
ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Install / Use
/learn @agemagician/ProtTransREADME
ProtTrans is providing state of the art pre-trained models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using various Transformer models.
Have a look at our paper ProtTrans: cracking the language of life’s code through self-supervised deep learning and high performance computing for more information about our work.
<br/> <p align="center"> <img width="70%" src="https://github.com/agemagician/ProtTrans/raw/master/images/transformers_attention.png" alt="ProtTrans Attention Visualization"> </p> <br/>This repository will be updated regulary with new pre-trained models for proteins as part of supporting bioinformatics community in general, and Covid-19 research specifically through our Accelerate SARS-CoV-2 research with transfer learning using pre-trained language modeling models project.
Table of Contents
- ⌛️ News
- 🚀 Installation
- 🚀 Quick Start
- ⌛️ Models Availability
- ⌛️ Dataset Availability
- 🚀 Usage
- 📊 Original downstream Predictions
- 📊 Followup use-cases
- 📊 Comparisons to other tools
- ❤️ Community and Contributions
- 📫 Have a question?
- 🤝 Found a bug?
- ✅ Requirements
- 🤵 Team
- 💰 Sponsors
- 📘 License
- ✏️ Citation
<a name="news"></a>
⌛️ News
- 2025/01/22: Continue pre-training & evo-tuning shows how to either continue pre-training of ProtT5 on new protein sequences using ProtT5's original pre-training task. This includes continue pre-training on a set of homologous sequences (aka evo-tuning.)
- 2023/07/14: FineTuning with LoRA provides a notebooks on how to fine-tune ProtT5 on both, per-residue and per-protein tasks, using Low-Rank Adaptation (LoRA) for efficient finetuning (thanks @0syrys !).**
- 2022/11/18: Availability: LambdaPP offers a simple web-service to access ProtT5-based predictions and UniProt now offers to download pre-computed ProtT5 embeddings for a subset of selected organisms.
<a name="install"></a>
🚀 Installation
All our models are available via huggingface/transformers:
pip install torch
pip install transformers
pip install sentencepiece
For more details, please follow the instructions for transformers installations.
A recently introduced change in the T5-tokenizer results in UnboundLocalError: cannot access local variable 'sentencepiece_model_pb2 and can either be fixed by installing this PR or by manually installing:
pip install protobuf
If you are using a transformer version after this PR, you will see this warning.
Explicitly setting legacy=True will result in expected behavor and will avoid the warning. You can also safely ignore the warning as legacy=True is the default.
<a name="quick"></a>
🚀 Quick Start
Example for how to derive embeddings from our best-performing protein language model, ProtT5-XL-U50 (aka ProtT5); also available as colab:
from transformers import T5Tokenizer, T5EncoderModel
import torch
import re
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)
# Load the model
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)
# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
if device == torch.device("cpu"):
model.to(torch.float32)
# prepare your protein sequences as a list
sequence_examples = ["PRTEINO", "SEQWENCE"]
# replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
# generate embeddings
with torch.no_grad():
embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)
# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens ([0,:7])
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)
# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
We also have a script which simplifies deriving per-residue and per-protein embeddings from ProtT5 for a given FASTA file:
python prott5_embedder.py --input sequences/some.fasta --output embeddings/residue_embeddings.h5
python prott5_embedder.py --input sequences/some.fasta --output embeddings/protein_embeddings.h5 --per_protein 1
<a name="models"></a>
⌛️ Models Availability
| Model | Hugging Face | Zenodo | Colab | | ----------------------------- | :------------------------------------------------------------------------: |:---------------------------------------------:|---------------------------------------------:| | ProtT5-XL-UniRef50 (also ProtT5-XL-U50) | Download | Download | Colab| | ProtT5-XL-BFD | Download | Download | | ProtT5-XXL-UniRef50 | Download | Download | | ProtT5-XXL-BFD | Download | Download | | ProtBert-BFD | Download | Download | | ProtBert | Download | Download | | ProtAlbert | Download | Download | | ProtXLNet | Download | Download | | ProtElectra-Generator-BFD | Download | Download | | ProtElectra-Discriminator-BFD | Download | Download |
<a name="datasets"></a>
⌛️ Datasets Availability
| Dataset | Dropbox |
| ----------------------------- | :---------------------------------------------------------------------------: |
| NEW364 | Download |
| Netsurfp2 | Download |
| CASP12 | Download |
| CB513 | Download |
| TS115 | Download |
| DeepLoc Train | Download |
| DeepLoc Test | Download |
<a name="usage"></a>
🚀 Usage
How to use ProtT
Related Skills
node-connect
338.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.4kCommit, push, and open a PR
