Protnote
ProtNote is a multimodal deep learning model that leverages free-form text to enable both supervised and zero-shot protein function prediction
Install / Use
/learn @microsoft/ProtnoteREADME
ProtNote
Description
Understanding protein sequence-function relationships is essential for advancing protein biology and engineering. However, fewer than 1% of known protein sequences have human-verified functions, and scientists continually update the set of possible functions. While deep learning methods have demonstrated promise for protein function prediction, current models are limited to predicting only those functions on which they were trained. Here, we introduce ProtNote, a multimodal deep learning model that leverages free-form text to enable both supervised and zero-shot protein function prediction. ProtNote not only maintains near state-of-the-art performance for annotations in its train set, but also generalizes to unseen and novel functions in zero-shot test settings. We envision that ProtNote will enhance protein function discovery by enabling scientists to use free text inputs, without restriction to predefined labels – a necessary capability for navigating the dynamic landscape of protein biology.
<p align="center"> <img src="img/main_fig.jpg" /> </p>Table of Contents
<!-- markdown-toc -i README.md --> <!-- toc -->- Installation
- Config
- Notation
- Data
- Train and run inference with ProtNote
- Reproducing paper results
- Other useful scripts
- Contributing
- Trademarks
Installation
git clone https://github.com/microsoft/protnote.git
cd protnote
conda env create -f environment.yml
conda activate protnote
pip install -e ./ # make sure ./ is the dir including setup.py
Config
Most hyperparameters and paths are managed through the base_config.yaml. Whenever reasonable, we enforce certain files to be in specific directories to increase consistency and reproducibility. In general, we adhere to the following data argument naming conventions in scripts:
- Argument ending in "dir" corresponds to the full path of the folder where a file is located. E.g., data/swissprot/
- Argument ending in "path" corresponds to the full file path. E.g., data/swissprot/myfile.fasta
- Argument ending in "file" corresponds to the full file name alone (including the extension). E.g., myfile.fasta. This is used for files with enforced location within the data folder structure
Notation
The following notes will make it easier to navigate these instructions:
- We denote user-defined inputs using all-caps text surrounded by curly braces. For example,
{DATA_SET_PATH}, should be replaced by the user with a dataset path likepath/to/dataset, without the curly braces. - We refer to specific keys in the
base_config.yamlfile using this format:KEY_NAME. For example,TRAIN_DATA_PATHandTEST_DATA_PATHrefer to the paths for the datasets used to train and test ProtNote in the supervised setting.
Data
We train and test ProtNote with protein sequences from the SwissProt section of UniProt, corresponding to sequences with human-verified funcitons. Further, we evaluate ProtNote on different zero shot scenarios, including prediction of unseen/novel GO terms and of EC Numbers -- a type of annotation which the model was not trained on.
All the data to train and run inference with ProtNote can is available in the data.zip file (17.6 GB) that can be downloaded from Zenodo using the following command from the protnote root folder:
sudo apt-get install unzip
curl -O https://zenodo.org/records/13897920/files/data.zip?download=1
unzip data.zip
The data folder has the following structure:
- data/
- annotations/: contains the text descriptions of all the GO and EC annotations for the 2019 and 2024 releases used for ProtNote.
- embeddings/: stores the text description embeddings that are cached during training.
- models/: holds ProtNote and ProteInfer weights for multiple seeds.
- swissprot/: contains all SwissProt fasta files.
- vocabularies/: holds the 2019 and 2024 GO graphs in a simple json format, which relates each annotation with its parents.
- zero_shot/: contains the datasets used in the zero-shot evaluation setting.
The names of the main datasets used in the paper are listed below. These names correspond (in most cases) to the keys in paths/data_paths in the base_config.yaml.
{TRAIN,VAL,TEST}_DATA_PATH: correspond to the train, validation, and test sets used for training ProtNote. These are consistent with ProteInfer datasets.TEST_DATA_PATH_ZERO_SHOT: zero-shot dataset for unseen, novel GO terms.TEST_DATA_PATH_ZERO_SHOT_LEAF_NODES: zero-shot dataset for unseen, novel GO terms, but only for the leaf nodes of the GO graph.TEST_EC_DATA_PATH_ZERO_SHOT: zero-shot dataset of EC numbers, a dataset and type of annotation which ProtNote was not trained on.TEST_2024_PINF_VOCAB_DATA_PATH:TEST_DATA_PATHupdated with the July 2024 GO annotations, but only including GO terms in the ProteInfer vocabulary. This dataset was used to isolate and quantify the impact of the changes in GO.test_*_GO.fasta: creates smaller test sets for runtime calculations.TEST_TOP_LABELS_DATA_PATH: a subset ofTEST_DATA_PATH, based on a sample of sequences and only the most frequent GO terms. This dataset was used for the embedding space analysis.
Train and run inference with ProtNote
To train and test with ProtNote you will need: ProtNote weights, an annotations file, generated function description text embeddings, and train/validation/test datasets.
You can use the main.py script for both training and inference. Refer to Inference and Training for details.
ProtNote weights
There are five sets of weights (one for each seed) available in data/models/ProtNote, with the pattern: data/models/ProtNote/seed_replicates_v9_{SEED}_sum_last_epoch.pt, where {SEED} can be any of 12,22,32,42,52. The model weights are passed through the argument --model-file.
Annotations file
This is a pickle file storing a pandas dataframe with the annotations and their text descriptions. The dataframe's index should be the function IDs, and the dataframe should have at least three columns: "label", "name", "synonym_exact". In the Gene Ontology, each term has a short description called "name", a long description called "label", and a list of equivalent descriptions called "synonym_exact". If using ProtNote for zero-shot inference on annotations other than GO annotations, the values of the "label" and "name" columns can be identical, while the values for the "synonym_exact" column can be empty lists.
To seamlessly create the annotations file for GO annotations or EC numbers, we provide the download_GO_annotations.py and download_EC_annotations.py scripts. To get the GO annotations run:
python bin/download_GO_annotations.py --url {GO_ANNOTATIONS_RELEASE_URL} --output-file {OUTPUT_FILE_NAME}
Where {GO_ANNOTATIONS_RELEASE_URL} is a specific GO release (e.g., https://release.geneontology.org/2024-06-17/ontology/go.obo) and {OUTPUT_FILE_NAME} is the name of the annotations file that will be stored in data/annotations/ (e.g., go_annotations_jul_2024.pkl).
To download the latest EC annotations, run:
python bin/download_EC_annotations.py
Function description text embeddings
For each sequence, ProtNote computes the likelihood that it is annotated with any of the available functional annotations in the dataset. To avoid repeatedly embedding the same functional text descriptions for every sequence, we calculate the text embeddings once and cache them for use during inference and training. This allows us to perform only num_labels forward passes through the text encoder, instead of num_sequences × num_labels.
To generate the embeddings that we used to train ProtNote, execute the following code:
python bin/generate_label_embeddings.py --base-label-embedding-path {EMBEDDING_PATH_CONFIG_KEY} --annotations-path-name {ANNOTATIONS_PATH_CONFIG_KEY} --add-instruction --account-for-sos
{EMBEDDING_PATH_CONFIG_KEY}: should be a key from the config that specifies the "base" path name where the embeddings will be stored. It's called "base" because{EMBEDDING_PATH_CONFIG_KEY}will be modified based on some of the arguments passed to the script, such as the pooling method.{ANNOTATIONS_PATH_CONFIG_KEY}: the pkl file in data/annotations/ containing the text descriptions and created in the previous Annotations file step.
There are other arguments set to the following defaults we used during training and inference:
--label-encoder-checkpoint: defaults tointfloat/multilingual-e5-large-instruct, which is the M
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
13.8kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
000-main-rules
Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce
