PROVAL

Comparison of Protein Sequence Embeddings to Classify Molecular Functions

Generate Convert Improve

Install / Use

/learn @philippvaeth/PROVAL

About this skill

Quality Score

0/100

README

PROVAL: Evaluation Framework for Protein Sequence Embeddings

Code submission of paper 'PROVAL: A Framework for Comparison of Protein Sequence Embeddings'

PROVAL Setup

We recommend using a new Conda enviroment!
Install Proval Framework pip install -e .[all]
(Optional) Install Smith-Watermann Alignment:

git clone https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.git
cd Complete-Striped-Smith-Waterman-Library/src
make

Extension to Other Embedding Algorithms

<details> <summary>Integration into embedding.py</summary>

Load pretrained model
Add function to embedding_utils.py, which takes the train and test sequences as lists of Bio sequences (see read_fasta() in utils.py) and returns the vectors in a dictionary of the form id(String):vector(NumPy array)
Add approach to embedding list (embeddings.py, line 17)
Add embedding function call to the if/elif statements in the similar form
Run embeddings.py and the respective comparison scripts

</details> or <details> <summary>Custom integration through vector file</summary>

Load the train and test sequences as lists of Bio sequences (see read_fasta() in utils.py)
Use custom embedding to predict the embedding vector for each sequence in the dictionary format id(String):vector(NumPy array).
Truncate the vectors to d=100 if necessary, compare embeddings.py
Save as pickle '.p' file, compare embeddings.py

</details>

Full Reproducibility of the Paper Results

Note, the extraction of the vectors and the results might not be fully deterministic and small deviations might be possible.

<details> <summary>Data set (optional)</summary>

Steps to reproduce the test.fasta and train.fasta files in the data/ folder:

Download the full SwissProt data set (release 02/2021):
https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2021_02/
Select the sequence IDs, the sequence strings and the molecular function information ('GO:xxxxxx' terms)
Discard all sequences with more than one molecular function (to reduce the complexity of the experiments)
Select 1000 random sequences for each of the most frequent 15 molecular functions (=15,000 sequences)
Randomly split the sequences in training and test sets (70:30)
Save the sequences in the .fasta format, compare the test.fasta and train.fasta files in the data folder:

<Sequence ID> [<GO-ID>]
<Sequence>
<Sequence ID> [<GO-ID>]
<Sequence>
...

</details> <details> <summary>Embedding methods</summary>

Install the Smith-Watermann Alignment
Run embeddings.py to obtain the vectors

</details> <details> <summary>Figures</summary>

Run dataset_metrics.py for optional data set plots
Run semantics.py for the classification results (Table 3)
Run visualization.py for the visualization results (Figure 7)
Run eigenspectrum_plot.py for the information theory results (Figure 8)

</details>

Citation

 @article{VATH2022100044,
title = {PROVAL: A framework for comparison of protein sequence embeddings},
journal = {Journal of Computational Mathematics and Data Science},
pages = {100044},
year = {2022},
issn = {2772-4158},
doi = {https://doi.org/10.1016/j.jcmds.2022.100044},
url = {https://www.sciencedirect.com/science/article/pii/S2772415822000128},
author = {Philipp Väth and Maximilian Münch and Christoph Raab and F.-M. Schleif},
}

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

research_rules

Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

philippvaeth

View profile

View on GitHub

GitHub Stars5

CategoryEducation

Updated2y ago

Forks1

philippvaeth/PROVAL

Languages

Python

Security Score

75/100

Audited on Nov 13, 2023

No findings