Highly accurate discovery of terpene synthases powered by machine learning

</div>

🚀 Quick Start: Get Predictions with Colab Notebooks

| Required input | Colab Notebook | |-----------|--------------| | Uniprot ID | | | Structure | | | Sequence (structure will be predicted in Colab) | |

Introduction
Installation
Quick start locally
Workflow
Reference

Introduction

Did you know that Terpene Synthases (TPSs) are responsible for the most natural scents humans have ever experienced [1]? Among other invaluable molecules, TPSs are also responsible for the Nobel-prize-winning antimalarial treatment artemisinin [2] with a market size projected to reach USD 697.9 million by 2025 [3], or TPSs are accountable for the first-line anticancer medicine taxol with billion-dollar pick annual sales [4].

Welcome to the GitHub repository showcasing state-of-the-art computational methods for Terpene Synthase (TPS) discovery and characterization. The pipeline can be easily repurposed for other enzyme families, by simply changing the config files.

TPSs generate the scaffolds of the largest class of natural products (more than 96.000 compounds), including several first-line medicines [5]. Our research, outlined in the accompanying paper Highly accurate discovery of terpene synthases powered by machine learning reveals functional terpene cyclization in Archaea, addresses the challenge of accurately detecting TPS activity in sequence databases.

Our approach significantly outperforms existing methods for TPS detection and substrate prediction. Using it, we identified and experimentally confirmed the activity of seven previously unknown TPS enzymes undetected by all state-of-the-art protein signatures integrated into InterProScan.

Notably, our method is the first to reveal functional terpene cyclization in the Archaea, one of the major domains of life [6]. Before our work, it was believed that Archaea can form prenyl monomers but cannot perform terpene cyclization [7]. Thanks to the cyclization, terpenoids are the largest and most diverse class of natural products. Our predictive pipeline sheds light on the ancient history of TPS biosynthesis, which "is deeply intertwined with the establishment of biochemistry in its present form" [7].

Furthermore, the presented research unveiled a new TPS structural domain and identified distinct subtypes of known domains, enhancing our understanding of TPS diversity and function.

This repository provides access to our approach's source codes. We invite researchers to explore, contribute, and apply our approach to other enzyme families, accelerating biological discoveries.

Installation

git clone https://github.com/pluskal-lab/EnzymeExplorer.git

cd EnzymeExplorer
. scripts/setup_env.sh
conda activate enzyme_explorer
pip install -e .

Quick start locally

Running full TPS detection and classification

To predict using the full EnzymeExplorer model, put the sequences of interest into a .fasta file and run

cd EnzymeExplorer
conda activate enzyme_explorer
python scripts/easy_predict.py --needed-proteins-csv-path path/to/input_sequences.csv --csv-id-column sequence_id_columm_name --input-directory-with-structures path/to/dir/with/structures --output-csv-path predictions.csv

Running sequence-based TPS detection and classification

To predict using the model based on TPS language model only, put the sequences of interest into a .fasta file and run

cd EnzymeExplorer
conda activate enzyme_explorer
python scripts/easy_predict_sequence_only.py --input-fasta-path path/to/input_sequences.fasta --output-csv-path predictions.csv

Workflow

Data Preparation

1 - Sampling negative examples from Swiss-Prot

We sample negative (non-TPS) sequences from Swiss-Prot, the expertly curated UniProtKB component produced by the UniProt consortium. For reproducibility, we share the sampled sequences in data/sampled_id_2_seq.pkl.

If you want to sample Swiss-Prot entries on your own, download Swiss-Prot .fasta file from UniProt.org Downloads to the data folder and then run

cd EnzymeExplorer
conda activate enzyme_explorer
mkdir -p outputs/logs
if [ ! -f data/sampled_id_2_seq.pkl ]; then
    get_uniprot_sample \
        --uniprot-fasta-path data/uniprot_sprot.fasta \
        --output-path "data/sampled_id_2_seq.pkl" \
        --sample-size 10000 > outputs/logs/swissprot_sampling.log 2>&1
else
    echo "data/sampled_id_2_seq.pkl exists already. You might want to stash it before re-writing the file by the sampling script."
fi

Also, for experimental (wet-lab) validation, we sample Swiss-Prot for negative examples with the same script, while ensuring that the sampled sequences are not present in the training set.

cd EnzymeExplorer
conda activate enzyme_explorer
if [ ! -f data/sampled_id_2_seq_experimental.pkl ]; then
    get_uniprot_sample \
        --uniprot-fasta-path data/uniprot_sprot.fasta \
        --output-path "data/sampled_id_2_seq_experimental.pkl" \
        --blacklist-path "data/sampled_id_2_seq.pkl" \
        --sample-size 1000 > outputs/logs/swissprot_sampling_experimental.log 2>&1
else
    echo "data/sampled_id_2_seq_experimental.pkl exists already. You might want to stash it before re-writing the file by the sampling script."
fi

2 - Raw Data Preprocessing

cd EnzymeExplorer
conda activate enzyme_explorer
python -m enzymeexplorer.src.data_preparation.cleaning_data_from_raw_tps_table

This data preprocessing script is application-specific. It would require a separate implementation for other enzyme families. For that reason, the script is not configurable via command line arguments.

3 - Computing a phylogenetic tree and clade-based sequence groups

To check the generalization of our models to novel TPS sequences, we need to ensure that groups of similar sequences always stay either in train or in test fold. We construct a phylogenetic

EnzymeExplorer

Install / Use

README