DecRiPPter
Genome mining tool for novel RiPP BGCs
Install / Use
/learn @Alexamk/DecRiPPterREADME
decRiPPter (Data-driven Exploratory Class-independent RiPP TrackER)
Alexander M. Kloosterman, Peter Cimermancic, Somayah S. Elsayed, Chao Du, Michalis Hadjithomas, Mohamed S. Donia, Michael A. Fischbach, Gilles P. van Wezel, and Marnix H. Medema.
Reference: Integration of machine learning and pan-genomics expands the biosynthetic landscape of RiPP natural products Preprint: https://biorxiv.org/cgi/content/short/2020.05.19.104752v1
Sample output: Large scale Streptomyces analysis (1,295 genomes) https://decrippter.bioinformatics.nl/
Description:
decRiPPter is a genome mining tool for detection of novel biosynthetic gene clusters (BGCs) of ribosomally synthesized and post-translationally modified peptides (RiPPs).
decRiPPter functions as a platform for the exploration and prioritization of candidate RiPP BGCs. It prioritizes novelty at the cost of accuracy. As such, many of the BGCs that will come out of these results may not actually be RiPP BGCs. However, the resuling BGCs are not restricted to specific genetic markers, and may be novel RiPP BGCs as a result. To help interpret the results, it allows for extensive user-defined filtering of the candidate RiPP BGCs, to detect RiPP BGCs that fall outside the scope of known RiPP classes.
If you're more interested in highly accurate detection of RiPP BGCs, staying within the bounds of known RiPP subclasses, there are some excellent tools written for that purpose. For example, see BAGEL4, RODEO, antiSMASH or RiPP-PRISM.
decRiPPter detects putative RiPP precursors using a Support Vector Machine (SVM) trained for the detection of RiPP precursors irrespective of RiPP subclass. The genomic context of all candidate precursors is used to filter the results and narrow down to contain the features of interest. It then groups the remaining gene clusters together to form candidate families. Overlap with antiSMASH can also be analyzed, to remove known RiPP families.
decRiPPter is meant for the analysis of multiple closely related genomes simultaneously. From these genomes, it will infer information about the frequency of occurrence of genes (called to COG-score), to filter out household genes. Analyzing more genomes simultaneously allows gives a representation of the spread of a gene cluster family, which can help in the filtering process.
Pipeline description
1) SVM precursor detection
All encoded proteins with maximum length of 100 amino acid are analyzed with a pretrained SVM. In addition, all intergenic small open reading frames are also analyzed, even if these were not annotated.
2) COG analysis
The relative frequency of occurrence of each gene in all of the query genomes is determined next. Genes are grouped together if they have orthologue-like similarity to one another, the cutoff of is based on the similarity of highly conserved genes in all genomes. See the publication for more details.
3) Gene cluster formation
Operon-like gene clusters are formed around each candidate precursor. Only genes on the same strand as the precursor are used.
Two methods are built in: In the simple method, only intergenic distance is used as a cutoff In the island method, genes are first fused within a small intergenic distance. The island may be further fused based on the difference of average COG scores
4) Gene cluster annotation
Gene clusters are annotated with Pfam and TIGRFAM. A number of flanking genes is included to show additional context. These domains are grouped into biosynthetic, transporter, regulator, peptidases and known RiPP categories. The lists with these domains are found in the data/domains folder, and can be changed and expanded there.
5) Gene cluster filtering
Gene clusters are filtered based on passed filters By default, outputs are generated for two different filter settings, called the mild filter and the strict filter. See below and the publication for more details.
6) Gene cluster grouping
Gene clusters are grouped in two ways: Precursor similarity (determined by blastp) Jaccard index of protein domains found
The resulting groups are further refined with the Markov Clustering Algorithm (MCL). An additional grouping method is carried out when both precursors and protein domains are overlapping between two gene clusters. This last method is considered the most reliable by the authors (see preprint).
7) Output generation
Protein fasta and genbank files will be generated as output for each operon. The formed gene cluster families are also written out as a network file that can be parsed by CytoScape. In addition, for each of the filter settings, an index page will be generated which contains links to the families and graphical output of each formed gene cluster. On this index page, you can further filter the resulting gene clusters, based on gene cluster features.
Installation guide:
decRiPPter is available as a commandline-tool for Linux. As of now, it still runs in Python2 (Update to python3 is high on the todo list).
1) Clone the environment locally
git clone https://github.com/Alexamk/decRiPPter.git
2) Setup the environment
The easiest way to install it is to create a vritual environment
virtualenv decrippter -p $(which python2) decrippter
Then use pip to install these python packages. The following versions have been tested:
scikit-learn==0.11
biopython==1.76
scipy==1.2.3
matplotlib==2.2.5
networkx==2.2
numpy==1.16.6
Some issues with installing BioPython 1.76 on Python2 have been reported. If you encounter these, please try installing 1.75 or 1.74.
These can be installed using the requirements.txt file
pip install -r requirements.txt
In addition, make sure the following executables are in your $PATH variable.
blastp (from NCBI BLAST+, tested V2.6)
diamond (from DIAMOND, tested v0.9.31.132)
hmmsearch (from hmmer, tested V3.1b2)
mcl (from the Markov Clustering Algorithm)
muscle (from MUSCLE)
In addition, please download the latest versions of the Pfam and TIGRFAM databases,
Later versions of these packages have not been tested, although they should work fine barring any changes in output files. The only package with a version requirement is scikit-learn (v.011).
Optional
prodigal (from prodigal, tested v2.6.3)
Genome (re)annotation is built in with decRiPPter via prodigal, although it is not a strict requirement.
antismash (from antiSMASH V5)
Download and install antiSMASH V5 as specified in it's own environment.
3) Setup the config file:
In the config file, let the variables pfam_db_path and tigrfam_db_path point to the Pfam and TIGRFAM databases, respectively.
When downloading genomes, taxonomy information will be downloaded. Specify a taxonomy folder for this in the config file under taxonomy_folder.
Usage:
Quick start:
Step 1
python genome_prep.py -o path/to/output -t taxid_to_download -i genomes_to_analyze PROJECT NAME
Step 1.5 (Optional):
Deactivate the decRiPPter environment
deactivate
Switch to antiSMASH environment, e.g. if antiSMASH is installed via conda:
conda activate antismash
Run the antiSMASH wrapper; use the same arguments for -o and PROJECT NAME
python antismash_wrapper.py -o path/to/output PROJECT NAME
Switch back to decRiPPter environment
source /path/to/decrippter_env/bin/activate
Step 2
python genecluster_formation -o path/to/output PROJECT_NAME
Visual output can be found by opening the Index.html file in the output folder
Detailed usage:
Running the pipeline goes in two (optionally three) steps.
Step 1.
In the first step the genbank files are downloaded and/or parsed. Candidate precursors are detected and the COG scores are calculated.
usage: genome_prep.py [-h] [-o OUTPUTFOLDER] [-c CORES] [-i IN] [-t TAXID]
[-rg REUSE_GENOMES] [--run_prodigal {auto,never,always}]
[-p] [--load_pickles] [--store_cog] [--load_cog]
PROJECT NAME
-c (CORES): Number of processor cores to use.
-o (OUTPUTFOLDER): The folder in which to create the project folder. The created project folder will be named after the project
1a) Input selection and genome downloading
-i (IN): Point to a file or folder of files to be analyzed. Each file should correspond to one genome.
-t (TAXID): Indicate the taxonomic identifier to download all genomes corresponding to that identifier. E.g., give "1883" to download all Streptomyces genomes from NCBI. By default all genomes under the given taxonomic identifier are downloaded. Additional requirements for downloading genomes can be set in the config file.
--run_prodigal (PRODIGAL_MODE): Reannotates downloaded files with prodigal. When set to never, the program will only download genbank files and parse these. When set to always, the program will only download DNA fasta files, and annotate these to create a very basic genbank file. When set to auto (default), the pipeline will download/use genbank files when available. If these are not found, it will download DNA fasta files instead and annotate them.
-rg (REUSE_GENOMES): Used to reuse genome files already in the Genomes fol
Related Skills
node-connect
354.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
112.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
354.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
354.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
