SkillAgentSearch skills...

Protnote

ProtNote is a multimodal deep learning model that leverages free-form text to enable both supervised and zero-shot protein function prediction

Install / Use

/learn @microsoft/Protnote
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ProtNote

Description

Understanding protein sequence-function relationships is essential for advancing protein biology and engineering. However, fewer than 1% of known protein sequences have human-verified functions, and scientists continually update the set of possible functions. While deep learning methods have demonstrated promise for protein function prediction, current models are limited to predicting only those functions on which they were trained. Here, we introduce ProtNote, a multimodal deep learning model that leverages free-form text to enable both supervised and zero-shot protein function prediction. ProtNote not only maintains near state-of-the-art performance for annotations in its train set, but also generalizes to unseen and novel functions in zero-shot test settings. We envision that ProtNote will enhance protein function discovery by enabling scientists to use free text inputs, without restriction to predefined labels – a necessary capability for navigating the dynamic landscape of protein biology.

<p align="center"> <img src="img/main_fig.jpg" /> </p>

Table of Contents

<!-- markdown-toc -i README.md --> <!-- toc --> <!-- tocstop -->

Installation

git clone https://github.com/microsoft/protnote.git
cd protnote
conda env create -f environment.yml
conda activate protnote
pip install -e ./  # make sure ./ is the dir including setup.py

Config

Most hyperparameters and paths are managed through the base_config.yaml. Whenever reasonable, we enforce certain files to be in specific directories to increase consistency and reproducibility. In general, we adhere to the following data argument naming conventions in scripts:

  • Argument ending in "dir" corresponds to the full path of the folder where a file is located. E.g., data/swissprot/
  • Argument ending in "path" corresponds to the full file path. E.g., data/swissprot/myfile.fasta
  • Argument ending in "file" corresponds to the full file name alone (including the extension). E.g., myfile.fasta. This is used for files with enforced location within the data folder structure

Notation

The following notes will make it easier to navigate these instructions:

  • We denote user-defined inputs using all-caps text surrounded by curly braces. For example, {DATA_SET_PATH}, should be replaced by the user with a dataset path like path/to/dataset, without the curly braces.
  • We refer to specific keys in the base_config.yaml file using this format: KEY_NAME. For example, TRAIN_DATA_PATH and TEST_DATA_PATH refer to the paths for the datasets used to train and test ProtNote in the supervised setting.

Data

We train and test ProtNote with protein sequences from the SwissProt section of UniProt, corresponding to sequences with human-verified funcitons. Further, we evaluate ProtNote on different zero shot scenarios, including prediction of unseen/novel GO terms and of EC Numbers -- a type of annotation which the model was not trained on.

All the data to train and run inference with ProtNote can is available in the data.zip file (17.6 GB) that can be downloaded from Zenodo using the following command from the protnote root folder:

sudo apt-get install unzip
curl -O https://zenodo.org/records/13897920/files/data.zip?download=1
unzip data.zip

The data folder has the following structure:

  • data/
    • annotations/: contains the text descriptions of all the GO and EC annotations for the 2019 and 2024 releases used for ProtNote.
    • embeddings/: stores the text description embeddings that are cached during training.
    • models/: holds ProtNote and ProteInfer weights for multiple seeds.
    • swissprot/: contains all SwissProt fasta files.
    • vocabularies/: holds the 2019 and 2024 GO graphs in a simple json format, which relates each annotation with its parents.
    • zero_shot/: contains the datasets used in the zero-shot evaluation setting.

The names of the main datasets used in the paper are listed below. These names correspond (in most cases) to the keys in paths/data_paths in the base_config.yaml.

  • {TRAIN,VAL,TEST}_DATA_PATH: correspond to the train, validation, and test sets used for training ProtNote. These are consistent with ProteInfer datasets.
  • TEST_DATA_PATH_ZERO_SHOT: zero-shot dataset for unseen, novel GO terms.
  • TEST_DATA_PATH_ZERO_SHOT_LEAF_NODES: zero-shot dataset for unseen, novel GO terms, but only for the leaf nodes of the GO graph.
  • TEST_EC_DATA_PATH_ZERO_SHOT: zero-shot dataset of EC numbers, a dataset and type of annotation which ProtNote was not trained on.
  • TEST_2024_PINF_VOCAB_DATA_PATH: TEST_DATA_PATH updated with the July 2024 GO annotations, but only including GO terms in the ProteInfer vocabulary. This dataset was used to isolate and quantify the impact of the changes in GO.
  • test_*_GO.fasta: creates smaller test sets for runtime calculations.
  • TEST_TOP_LABELS_DATA_PATH: a subset of TEST_DATA_PATH, based on a sample of sequences and only the most frequent GO terms. This dataset was used for the embedding space analysis.

Train and run inference with ProtNote

To train and test with ProtNote you will need: ProtNote weights, an annotations file, generated function description text embeddings, and train/validation/test datasets.

You can use the main.py script for both training and inference. Refer to Inference and Training for details.

ProtNote weights

There are five sets of weights (one for each seed) available in data/models/ProtNote, with the pattern: data/models/ProtNote/seed_replicates_v9_{SEED}_sum_last_epoch.pt, where {SEED} can be any of 12,22,32,42,52. The model weights are passed through the argument --model-file.

Annotations file

This is a pickle file storing a pandas dataframe with the annotations and their text descriptions. The dataframe's index should be the function IDs, and the dataframe should have at least three columns: "label", "name", "synonym_exact". In the Gene Ontology, each term has a short description called "name", a long description called "label", and a list of equivalent descriptions called "synonym_exact". If using ProtNote for zero-shot inference on annotations other than GO annotations, the values of the "label" and "name" columns can be identical, while the values for the "synonym_exact" column can be empty lists.

To seamlessly create the annotations file for GO annotations or EC numbers, we provide the download_GO_annotations.py and download_EC_annotations.py scripts. To get the GO annotations run:

python bin/download_GO_annotations.py --url {GO_ANNOTATIONS_RELEASE_URL} --output-file {OUTPUT_FILE_NAME}

Where {GO_ANNOTATIONS_RELEASE_URL} is a specific GO release (e.g., https://release.geneontology.org/2024-06-17/ontology/go.obo) and {OUTPUT_FILE_NAME} is the name of the annotations file that will be stored in data/annotations/ (e.g., go_annotations_jul_2024.pkl).

To download the latest EC annotations, run:

python bin/download_EC_annotations.py

Function description text embeddings

For each sequence, ProtNote computes the likelihood that it is annotated with any of the available functional annotations in the dataset. To avoid repeatedly embedding the same functional text descriptions for every sequence, we calculate the text embeddings once and cache them for use during inference and training. This allows us to perform only num_labels forward passes through the text encoder, instead of num_sequences × num_labels.

To generate the embeddings that we used to train ProtNote, execute the following code:

python bin/generate_label_embeddings.py --base-label-embedding-path {EMBEDDING_PATH_CONFIG_KEY} --annotations-path-name {ANNOTATIONS_PATH_CONFIG_KEY} --add-instruction --account-for-sos
  • {EMBEDDING_PATH_CONFIG_KEY}: should be a key from the config that specifies the "base" path name where the embeddings will be stored. It's called "base" because {EMBEDDING_PATH_CONFIG_KEY} will be modified based on some of the arguments passed to the script, such as the pooling method.
  • {ANNOTATIONS_PATH_CONFIG_KEY}: the pkl file in data/annotations/ containing the text descriptions and created in the previous Annotations file step.

There are other arguments set to the following defaults we used during training and inference:

  • --label-encoder-checkpoint: defaults to intfloat/multilingual-e5-large-instruct, which is the M

Related Skills

View on GitHub
GitHub Stars59
CategoryEducation
Updated19d ago
Forks6

Languages

Jupyter Notebook

Security Score

95/100

Audited on Mar 9, 2026

No findings