Tape
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.
Install / Use
/learn @songlab-cal/TapeREADME
Tasks Assessing Protein Embeddings (TAPE)
Data, weights, and code for running the TAPE benchmark on a trained protein embedding. We provide a pretraining corpus, five supervised downstream tasks, pretrained language model weights, and benchmarking code. This code has been updated to use pytorch - as such previous pretrained model weights and code will not work. The previous tensorflow TAPE repository is still available at https://github.com/songlab-cal/tape-neurips2019.
This repository is not an effort to maintain maximum compatibility and reproducability with the original paper, but is instead meant to facilitate ease of use and future development (both for us, and for the community). Although we provide much of the same functionality, we have not tested every aspect of training on all models/downstream tasks, and we have also made some deliberate changes. Therefore, if your goal is to reproduce the results from our paper, please use the original code.
Our paper is available at https://arxiv.org/abs/1906.08230.
Some documentation is incomplete. We will try to fill it in over time, but if there is something you would like an explanation for, please open an issue so we know where to focus our effort!
Update 09/26/2020: We no longer recommend trying to train directly with TAPE's training code. It will likely still work for some time, but will not be updated for future pytorch versions. Internally, we have been working with different frameworks for training (specifically Pytorch Lightning and Fairseq). We strongly recommend using a framework like these, as it offloads the requirement of maintaining compatability with Pytorch versions. TAPE models will continue to be available, and if the code is working for you, feel free to use it. However we will not be fixing issues regarding multi-GPU errors, OOM erros, etc during training.
Contents
Installation
We recommend that you install tape into a python virtual environment using
$ pip install tape_proteins
Examples
Huggingface API for Loading Pretrained Models
We build on the excellent huggingface repository and use this as an API to define models, as well as to provide pretrained models. By using this API, pretrained models will be automatically downloaded when necessary and cached for future use.
import torch
from tape import ProteinBertModel, TAPETokenizer
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac') # iupac is the vocab for TAPE models, use unirep for the UniRep model
# Pfam Family: Hexapep, Clan: CL0536
sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor([tokenizer.encode(sequence)])
output = model(token_ids)
sequence_output = output[0]
pooled_output = output[1]
# NOTE: pooled_output is *not* trained for the transformer, do not use
# w/o fine-tuning. A better option for now is to simply take a mean of
# the sequence output
Currently available pretrained models are:
If there is a particular pretrained model that you would like to use, please open an issue and we will try to add it!
Embedding Proteins with a Pretrained Model
Given an input fasta file, you can generate a .npz file containing embedding proteins via the tape-embed command.
Suppose this is our input fasta file:
>seq1
GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ
>seq2
RTIKVRILHAIGFEGGLMLLTIPMVAYAMDMTLFQAILLDLSMTTCILVYTFIFQWCYDILENR
Then we could embed it with the UniRep babbler-1900 model like so:
tape-embed unirep my_input.fasta output_filename.npz babbler-1900 --tokenizer unirep
There is no need to download the pretrained model manually - it will be automatically downloaded if needed. In addition, note the change of tokenizer to the unirep tokenizer. UniRep uses a different vocabulary, and so requires this tokenzer. If you get a cublas runtime error, please double check that you changed tokenizer correctly.
The embed function is fully batched and will automatically distribute across as many GPUs as the machine has available. On a Titan Xp, it can process around 200 sequences / second.
Once we have the output file, we can load it into numpy like so:
arrays = np.load('output_filename.npz', allow_pickle=True)
list(arrays.keys()) # Will output the name of the keys in your fasta file (or if unnamed then '0', '1', ...)
arrays[<protein_id>] # Returns a dictionary with keys 'pooled' and 'avg', (or 'seq' if using the --full_sequence_embed flag)
By default to save memory TAPE returns the average of the sequence embedding along with the pooled embedding generated through the pooling function. For some models (like UniRep), the pooled embedding is trained, and so can be used out of the box. For other models (like the transformer), the pooled embedding is not trained, and so the average embedding should be used. We will be looking into methods of self-supervised training the pooled embedding for all models in the future.
If you would like the full embedding rather than the average embedding, this can be specified to tape-embed by passing the --full_sequence_embed flag.
Training a Language Model
Tape provides two commands for training, tape-train and tape-train-distributed. The first command uses standard pytorch data distribution to distributed across all available GPUs. The second one uses torch.distributed.launch-style multiprocessing to distributed across the number of specified GPUs (and could also be used for distributing across multiple nodes). We generally recommend using the second command, as it can provide a 10-15% speedup, but both will work.
To train the transformer on masked language modeling, for example, you could run this
tape-train-distributed transformer masked_language_modeling --batch_size BS --learning_rate LR --fp16 --warmup_steps WS --nproc_per_node NGPU --gradient_accumulation_steps NSTEPS
There are a number of features used in training:
* Distributed training via multiprocessing
* Half-precision training
* Gradient accumulation
* Gradient-allreduce post accumulation
* Automatic batch by sequence length
The first feature you are likely to need is the gradient_accumulation_steps. TAPE specifies a relatively high batch size (1024) by default. This is the batch size that will be used per backwards pass. This number will be divided by the number of GPUs as well as the gradient accumulation steps. So with a batch size of 1024, 2 GPUs, and 1 gradient accumulation step, you will do 512 examples per GPU. If you run out of memory (and you likely will), TAPE provides a clear error message and will tell you to increase the gradient accumulation steps.
There are additional features as well that are not talked about here. See tape-train-distributed --help for a list of all commands.
Evaluating a Language Model
Once you've trained a language model, you'll have a pretrained weight file located in the results folder. To evaluate this model, you can do one of two things. One option is to directly evaluate the language modeling accuracy / perplexity. tape-train will report the perplexity over the training and validation set at the end of each epoch. However, we find empirically that language modeling accuracy and perplexity are poor measures of performance on downstream tasks. Therefore, to evaluate the language model we strongly recommend training your model on one or all of our provided tasks.
Training a Downstream Model
Training a model on a downstream task can also be done with the tape-train command. Simply use the same syntax as with training a language model, adding the flag --from_pretrained <path_to_your_saved_results>. To train a pretrained transformer on secondary structure prediction, for example, you would run
tape-train-distributed transformer secondary_structure \
--from_pretrained results/<path_to_folder> \
--batch_size BS \
--learning_rate LR \
--fp16 \
--warmup_steps WS \
--nproc_per_node NGPU \
--gradient_accumulation_steps NSTEPS \
--num_train_epochs NEPOCH \
--eval_freq EF \
--save_freq SF
For training a downstream model, you will likely need to experiment with hyperparameters to achieve the best results (optimal hyperparameters vary per-task and per-model). The set of parameters to consider are
* Batch size
* Learning rate
* Warmup steps
* Num train epochs
These can all have significant effects on performance, and by default are set to maximize
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
13.8kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
000-main-rules
Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce
