SkillAgentSearch skills...

Tape

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.

Install / Use

/learn @songlab-cal/Tape

README

Tasks Assessing Protein Embeddings (TAPE)

Data, weights, and code for running the TAPE benchmark on a trained protein embedding. We provide a pretraining corpus, five supervised downstream tasks, pretrained language model weights, and benchmarking code. This code has been updated to use pytorch - as such previous pretrained model weights and code will not work. The previous tensorflow TAPE repository is still available at https://github.com/songlab-cal/tape-neurips2019.

This repository is not an effort to maintain maximum compatibility and reproducability with the original paper, but is instead meant to facilitate ease of use and future development (both for us, and for the community). Although we provide much of the same functionality, we have not tested every aspect of training on all models/downstream tasks, and we have also made some deliberate changes. Therefore, if your goal is to reproduce the results from our paper, please use the original code.

Our paper is available at https://arxiv.org/abs/1906.08230.

Some documentation is incomplete. We will try to fill it in over time, but if there is something you would like an explanation for, please open an issue so we know where to focus our effort!

Update 09/26/2020: We no longer recommend trying to train directly with TAPE's training code. It will likely still work for some time, but will not be updated for future pytorch versions. Internally, we have been working with different frameworks for training (specifically Pytorch Lightning and Fairseq). We strongly recommend using a framework like these, as it offloads the requirement of maintaining compatability with Pytorch versions. TAPE models will continue to be available, and if the code is working for you, feel free to use it. However we will not be fixing issues regarding multi-GPU errors, OOM erros, etc during training.

Contents

Installation

We recommend that you install tape into a python virtual environment using

$ pip install tape_proteins

Examples

Huggingface API for Loading Pretrained Models

We build on the excellent huggingface repository and use this as an API to define models, as well as to provide pretrained models. By using this API, pretrained models will be automatically downloaded when necessary and cached for future use.

import torch
from tape import ProteinBertModel, TAPETokenizer
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')  # iupac is the vocab for TAPE models, use unirep for the UniRep model

# Pfam Family: Hexapep, Clan: CL0536
sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor([tokenizer.encode(sequence)])
output = model(token_ids)
sequence_output = output[0]
pooled_output = output[1]

# NOTE: pooled_output is *not* trained for the transformer, do not use
# w/o fine-tuning. A better option for now is to simply take a mean of
# the sequence output

Currently available pretrained models are:

  • bert-base (Transformer model)
  • babbler-1900 (UniRep model)
  • xaa, xab, xac, xad, xae (trRosetta model)

If there is a particular pretrained model that you would like to use, please open an issue and we will try to add it!

Embedding Proteins with a Pretrained Model

Given an input fasta file, you can generate a .npz file containing embedding proteins via the tape-embed command.

Suppose this is our input fasta file:

>seq1
GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ
>seq2
RTIKVRILHAIGFEGGLMLLTIPMVAYAMDMTLFQAILLDLSMTTCILVYTFIFQWCYDILENR

Then we could embed it with the UniRep babbler-1900 model like so:

tape-embed unirep my_input.fasta output_filename.npz babbler-1900 --tokenizer unirep

There is no need to download the pretrained model manually - it will be automatically downloaded if needed. In addition, note the change of tokenizer to the unirep tokenizer. UniRep uses a different vocabulary, and so requires this tokenzer. If you get a cublas runtime error, please double check that you changed tokenizer correctly.

The embed function is fully batched and will automatically distribute across as many GPUs as the machine has available. On a Titan Xp, it can process around 200 sequences / second.

Once we have the output file, we can load it into numpy like so:

arrays = np.load('output_filename.npz', allow_pickle=True)

list(arrays.keys())  # Will output the name of the keys in your fasta file (or if unnamed then '0', '1', ...)

arrays[<protein_id>]  # Returns a dictionary with keys 'pooled' and 'avg', (or 'seq' if using the --full_sequence_embed flag)

By default to save memory TAPE returns the average of the sequence embedding along with the pooled embedding generated through the pooling function. For some models (like UniRep), the pooled embedding is trained, and so can be used out of the box. For other models (like the transformer), the pooled embedding is not trained, and so the average embedding should be used. We will be looking into methods of self-supervised training the pooled embedding for all models in the future.

If you would like the full embedding rather than the average embedding, this can be specified to tape-embed by passing the --full_sequence_embed flag.

Training a Language Model

Tape provides two commands for training, tape-train and tape-train-distributed. The first command uses standard pytorch data distribution to distributed across all available GPUs. The second one uses torch.distributed.launch-style multiprocessing to distributed across the number of specified GPUs (and could also be used for distributing across multiple nodes). We generally recommend using the second command, as it can provide a 10-15% speedup, but both will work.

To train the transformer on masked language modeling, for example, you could run this

tape-train-distributed transformer masked_language_modeling --batch_size BS --learning_rate LR --fp16 --warmup_steps WS --nproc_per_node NGPU --gradient_accumulation_steps NSTEPS

There are a number of features used in training:

* Distributed training via multiprocessing
* Half-precision training
* Gradient accumulation
* Gradient-allreduce post accumulation
* Automatic batch by sequence length

The first feature you are likely to need is the gradient_accumulation_steps. TAPE specifies a relatively high batch size (1024) by default. This is the batch size that will be used per backwards pass. This number will be divided by the number of GPUs as well as the gradient accumulation steps. So with a batch size of 1024, 2 GPUs, and 1 gradient accumulation step, you will do 512 examples per GPU. If you run out of memory (and you likely will), TAPE provides a clear error message and will tell you to increase the gradient accumulation steps.

There are additional features as well that are not talked about here. See tape-train-distributed --help for a list of all commands.

Evaluating a Language Model

Once you've trained a language model, you'll have a pretrained weight file located in the results folder. To evaluate this model, you can do one of two things. One option is to directly evaluate the language modeling accuracy / perplexity. tape-train will report the perplexity over the training and validation set at the end of each epoch. However, we find empirically that language modeling accuracy and perplexity are poor measures of performance on downstream tasks. Therefore, to evaluate the language model we strongly recommend training your model on one or all of our provided tasks.

Training a Downstream Model

Training a model on a downstream task can also be done with the tape-train command. Simply use the same syntax as with training a language model, adding the flag --from_pretrained <path_to_your_saved_results>. To train a pretrained transformer on secondary structure prediction, for example, you would run

tape-train-distributed transformer secondary_structure \
	--from_pretrained results/<path_to_folder> \
	--batch_size BS \
	--learning_rate LR \
	--fp16 \
  	--warmup_steps WS \
  	--nproc_per_node NGPU \
  	--gradient_accumulation_steps NSTEPS \
  	--num_train_epochs NEPOCH \
  	--eval_freq EF \
  	--save_freq SF

For training a downstream model, you will likely need to experiment with hyperparameters to achieve the best results (optimal hyperparameters vary per-task and per-model). The set of parameters to consider are

* Batch size
* Learning rate
* Warmup steps
* Num train epochs

These can all have significant effects on performance, and by default are set to maximize

Related Skills

View on GitHub
GitHub Stars734
CategoryEducation
Updated6d ago
Forks134

Languages

Python

Security Score

100/100

Audited on Mar 22, 2026

No findings