PROVAL
Comparison of Protein Sequence Embeddings to Classify Molecular Functions
Install / Use
/learn @philippvaeth/PROVALREADME
PROVAL: Evaluation Framework for Protein Sequence Embeddings
Code submission of paper 'PROVAL: A Framework for Comparison of Protein Sequence Embeddings'
PROVAL Setup
- We recommend using a new Conda enviroment!
- Install Proval Framework
pip install -e .[all] - (Optional) Install Smith-Watermann Alignment:
git clone https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.git
cd Complete-Striped-Smith-Waterman-Library/src
make
Extension to Other Embedding Algorithms
<details> <summary>Integration into <b>embedding.py<b></summary>- Load pretrained model
- Add function to
embedding_utils.py, which takes the train and test sequences as lists of Bio sequences (see read_fasta() in utils.py) and returns the vectors in a dictionary of the form id(String):vector(NumPy array) - Add approach to embedding list (
embeddings.py, line 17) - Add embedding function call to the if/elif statements in the similar form
- Run
embeddings.pyand the respective comparison scripts
- Load the train and test sequences as lists of Bio sequences (see
read_fasta()in utils.py) - Use custom embedding to predict the embedding vector for each sequence in the dictionary format id(String):vector(NumPy array).
- Truncate the vectors to d=100 if necessary, compare
embeddings.py - Save as pickle '.p' file, compare
embeddings.py
Full Reproducibility of the Paper Results
Note, the extraction of the vectors and the results might not be fully deterministic and small deviations might be possible.
<details> <summary>Data set (optional)</summary>Steps to reproduce the test.fasta and train.fasta files in the data/ folder:
- Download the full SwissProt data set (release 02/2021):
https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2021_02/ - Select the sequence IDs, the sequence strings and the molecular function information ('GO:xxxxxx' terms)
- Discard all sequences with more than one molecular function (to reduce the complexity of the experiments)
- Select 1000 random sequences for each of the most frequent 15 molecular functions (=15,000 sequences)
- Randomly split the sequences in training and test sets (70:30)
- Save the sequences in the .fasta format, compare the test.fasta and train.fasta files in the data folder:
<Sequence ID> [<GO-ID>]
<Sequence>
<Sequence ID> [<GO-ID>]
<Sequence>
...
- Install the Smith-Watermann Alignment
- Run <b>embeddings.py<b> to obtain the vectors
- Run
dataset_metrics.pyfor optional data set plots - Run
semantics.pyfor the classification results (Table 3) - Run
visualization.pyfor the visualization results (Figure 7) - Run
eigenspectrum_plot.pyfor the information theory results (Figure 8)
Citation
@article{VATH2022100044,
title = {PROVAL: A framework for comparison of protein sequence embeddings},
journal = {Journal of Computational Mathematics and Data Science},
pages = {100044},
year = {2022},
issn = {2772-4158},
doi = {https://doi.org/10.1016/j.jcmds.2022.100044},
url = {https://www.sciencedirect.com/science/article/pii/S2772415822000128},
author = {Philipp Väth and Maximilian Münch and Christoph Raab and F.-M. Schleif},
}
Related Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
research_rules
Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
