DecRiPPter

Genome mining tool for novel RiPP BGCs

Generate Convert Improve

Install / Use

/learn @Alexamk/DecRiPPter

About this skill

Quality Score

0/100

README

decRiPPter (Data-driven Exploratory Class-independent RiPP TrackER)

Alexander M. Kloosterman, Peter Cimermancic, Somayah S. Elsayed, Chao Du, Michalis Hadjithomas, Mohamed S. Donia, Michael A. Fischbach, Gilles P. van Wezel, and Marnix H. Medema.

Reference: Integration of machine learning and pan-genomics expands the biosynthetic landscape of RiPP natural products Preprint: https://biorxiv.org/cgi/content/short/2020.05.19.104752v1

Sample output: Large scale Streptomyces analysis (1,295 genomes) https://decrippter.bioinformatics.nl/

Description:

decRiPPter is a genome mining tool for detection of novel biosynthetic gene clusters (BGCs) of ribosomally synthesized and post-translationally modified peptides (RiPPs).

decRiPPter functions as a platform for the exploration and prioritization of candidate RiPP BGCs. It prioritizes novelty at the cost of accuracy. As such, many of the BGCs that will come out of these results may not actually be RiPP BGCs. However, the resuling BGCs are not restricted to specific genetic markers, and may be novel RiPP BGCs as a result. To help interpret the results, it allows for extensive user-defined filtering of the candidate RiPP BGCs, to detect RiPP BGCs that fall outside the scope of known RiPP classes.

If you're more interested in highly accurate detection of RiPP BGCs, staying within the bounds of known RiPP subclasses, there are some excellent tools written for that purpose. For example, see BAGEL4, RODEO, antiSMASH or RiPP-PRISM.

decRiPPter detects putative RiPP precursors using a Support Vector Machine (SVM) trained for the detection of RiPP precursors irrespective of RiPP subclass. The genomic context of all candidate precursors is used to filter the results and narrow down to contain the features of interest. It then groups the remaining gene clusters together to form candidate families. Overlap with antiSMASH can also be analyzed, to remove known RiPP families.

decRiPPter is meant for the analysis of multiple closely related genomes simultaneously. From these genomes, it will infer information about the frequency of occurrence of genes (called to COG-score), to filter out household genes. Analyzing more genomes simultaneously allows gives a representation of the spread of a gene cluster family, which can help in the filtering process.

Pipeline description

1) SVM precursor detection

All encoded proteins with maximum length of 100 amino acid are analyzed with a pretrained SVM. In addition, all intergenic small open reading frames are also analyzed, even if these were not annotated.

2) COG analysis

The relative frequency of occurrence of each gene in all of the query genomes is determined next. Genes are grouped together if they have orthologue-like similarity to one another, the cutoff of is based on the similarity of highly conserved genes in all genomes. See the publication for more details.

3) Gene cluster formation

Operon-like gene clusters are formed around each candidate precursor. Only genes on the same strand as the precursor are used.

Two methods are built in: In the simple method, only intergenic distance is used as a cutoff In the island method, genes are first fused within a small intergenic distance. The island may be further fused based on the difference of average COG scores

4) Gene cluster annotation

Gene clusters are annotated with Pfam and TIGRFAM. A number of flanking genes is included to show additional context. These domains are grouped into biosynthetic, transporter, regulator, peptidases and known RiPP categories. The lists with these domains are found in the data/domains folder, and can be changed and expanded there.

5) Gene cluster filtering

Gene clusters are filtered based on passed filters By default, outputs are generated for two different filter settings, called the mild filter and the strict filter. See below and the publication for more details.

6) Gene cluster grouping

Gene clusters are grouped in two ways: Precursor similarity (determined by blastp) Jaccard index of protein domains found

The resulting groups are further refined with the Markov Clustering Algorithm (MCL). An additional grouping method is carried out when both precursors and protein domains are overlapping between two gene clusters. This last method is considered the most reliable by the authors (see preprint).

7) Output generation

Protein fasta and genbank files will be generated as output for each operon. The formed gene cluster families are also written out as a network file that can be parsed by CytoScape. In addition, for each of the filter settings, an index page will be generated which contains links to the families and graphical output of each formed gene cluster. On this index page, you can further filter the resulting gene clusters, based on gene cluster features.

Installation guide:

decRiPPter is available as a commandline-tool for Linux. As of now, it still runs in Python2 (Update to python3 is high on the todo list).

1) Clone the environment locally

git clone https://github.com/Alexamk/decRiPPter.git

2) Setup the environment

The easiest way to install it is to create a vritual environment

virtualenv decrippter -p $(which python2) decrippter

Then use pip to install these python packages. The following versions have been tested:

scikit-learn==0.11

biopython==1.76

scipy==1.2.3

matplotlib==2.2.5

networkx==2.2

numpy==1.16.6

Some issues with installing BioPython 1.76 on Python2 have been reported. If you encounter these, please try installing 1.75 or 1.74.

These can be installed using the requirements.txt file

pip install -r requirements.txt

In addition, make sure the following executables are in your $PATH variable.

blastp (from NCBI BLAST+, tested V2.6)

diamond (from DIAMOND, tested v0.9.31.132)

hmmsearch (from hmmer, tested V3.1b2)

mcl (from the Markov Clustering Algorithm)

muscle (from MUSCLE)

In addition, please download the latest versions of the Pfam and TIGRFAM databases,

Later versions of these packages have not been tested, although they should work fine barring any changes in output files. The only package with a version requirement is scikit-learn (v.011).

Optional

prodigal (from prodigal, tested v2.6.3)

Genome (re)annotation is built in with decRiPPter via prodigal, although it is not a strict requirement.

antismash (from antiSMASH V5)

Download and install antiSMASH V5 as specified in it's own environment.

3) Setup the config file:

In the config file, let the variables pfam_db_path and tigrfam_db_path point to the Pfam and TIGRFAM databases, respectively. When downloading genomes, taxonomy information will be downloaded. Specify a taxonomy folder for this in the config file under taxonomy_folder.

Usage:

Quick start:

Step 1

python genome_prep.py -o path/to/output -t taxid_to_download -i genomes_to_analyze PROJECT NAME

Step 1.5 (Optional):

Deactivate the decRiPPter environment

deactivate

Switch to antiSMASH environment, e.g. if antiSMASH is installed via conda:

conda activate antismash

Run the antiSMASH wrapper; use the same arguments for -o and PROJECT NAME

python antismash_wrapper.py -o path/to/output PROJECT NAME

Switch back to decRiPPter environment

source /path/to/decrippter_env/bin/activate

Step 2

python genecluster_formation -o path/to/output PROJECT_NAME

Visual output can be found by opening the Index.html file in the output folder

Detailed usage:

Running the pipeline goes in two (optionally three) steps.

Step 1.

In the first step the genbank files are downloaded and/or parsed. Candidate precursors are detected and the COG scores are calculated.

usage: genome_prep.py [-h] [-o OUTPUTFOLDER] [-c CORES] [-i IN] [-t TAXID] [-rg REUSE_GENOMES] [--run_prodigal {auto,never,always}] [-p] [--load_pickles] [--store_cog] [--load_cog] PROJECT NAME

-c (CORES): Number of processor cores to use.

-o (OUTPUTFOLDER): The folder in which to create the project folder. The created project folder will be named after the project

1a) Input selection and genome downloading

-i (IN): Point to a file or folder of files to be analyzed. Each file should correspond to one genome.

-t (TAXID): Indicate the taxonomic identifier to download all genomes corresponding to that identifier. E.g., give "1883" to download all Streptomyces genomes from NCBI. By default all genomes under the given taxonomic identifier are downloaded. Additional requirements for downloading genomes can be set in the config file.

--run_prodigal (PRODIGAL_MODE): Reannotates downloaded files with prodigal. When set to never, the program will only download genbank files and parse these. When set to always, the program will only download DNA fasta files, and annotate these to create a very basic genbank file. When set to auto (default), the pipeline will download/use genbank files when available. If these are not found, it will download DNA fasta files instead and annotate them.

-rg (REUSE_GENOMES): Used to reuse genome files already in the Genomes fol

Related Skills

node-connect

354.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。