MetaShot
MetaShot (Metagenomics Shotgun) is a complete pipeline designed for the taxonomic classification of the human microbiota members. In MetaShot, third party tools and new developed Python and Bash scripts are integrated to analyze paired-end (PE) Illumina sequences, offering an automated procedure covering all the analysis steps from raw data management to taxonomic profiling. It is designed to analyze both DNA-Seq and RNA-Seq data.
Install / Use
/learn @bfosso/MetaShotREADME
MetaShot (Metagemomics Shotgun)
MetaShot (Metagenomics Shotgun) PMID: 28130230 is a pipeline designed for the complete taxonomic assessment of the human microbiota.
In MetaShot, third party tools and new developed Python and Bash scripts are integrated to analyze paired-end (PE) Illumina reads, offering an automated procedure covering all the analysis steps from raw data management to taxonomic profiling.
MetaShot is designed to analyze both DNA-Seq and RNA-Seq data.
- Pipeline description
- Division data creation
- Install
3.a Requirements
3.b MetaShot setting up - Usage
4.a Taxonomic assignment
4.b Paired End (PE) reads extraction
4.c Result files interpretation - References
Pipeline description
The MetaShot analysis procedure can be divided in four main processes:
- Pre-processing procedures: input sequences containing low-quality/complexity regions and reads shorter than 50 nucleotides are removed. This is performed by applying FaQCs [1]. Moreover, it removes the phage PhiX sequences. Only reads passing all the quality check filters are directed to the following steps.
- Comparison with the human genome: cleaned data are mapped against the human genome (hg19, GRCh37, 2009) [2] by applying STAR [3].
- Comparison with reference databases and taxonomic annotation: in MetaShot are implemented four reference collections for Prokaryotes, Viruses, Fungi and Protista, obtained by processing data from GenBank and RefSeq databases (for more information see the “Division data creation” section). The taxonomic annotation is a three step procedure:
- Identification of candidate microbial sequences: in order to reduce the processing time and the computational requirements, and to improve the assignment accuracy, the PE reads passing the denoising procedure are compared to the four reference collections, by applying Bowtie2 [4] with the default options (a compromise between speed and mapping accuracy). The mapping data are filtered and PE reads obtaining at least one match with a hamming distance [5] below 10% of the sequence length are retained. In this way MetaShot identifies a subset of reads to be addressed to the microbial taxonomic classification.
- Comparison with reference databases and taxonomic annotation: microbial candidate sequences are compared to the four reference collections by using Bowtie2. In this step, the mapping accuracy is increased by modifying the length of seed substrings (-L option set to 20, the default value is 22) and the number of alignments per read (-k option set to 100, default 1). The resulting alignments are filtered according to identity percentage (threshold ≥ 97%) and query coverage (threshold ≥ 70%). The obtained results for the four different taxonomic divisions and for human genome and transcriptome are then intersected to remove the sequences that map ambiguously across them. Finally, the sequences are taxonomically annotated by using the TANGO (Taxonomic assignment in metagenomics) [^fn6] tool.
- Human Endogenous Retrovirus (HERV) identification: in order to identify HERV sequences, the PE reads labeled as ambiguous are parsed to identify those mapping only to human genome and to the viral division. These PE reads are subsequently taxonomically annotated by applying TANGO on the mapping information obtained against the viral division. If the obtained classification is under the HERV group it is accepted, otherwise the PE will be considered ambiguous.
- Report generation: a CSV file, an HTML interactive table summarizing the taxonomic assignment and a Krona graph [7] of the obtained taxonomy are provided for each division.
Figure1: MetaShot workflow
Division data creation
The reference collections for Prokaryotes, Viruses, Fungi and Protista have been built by following a common procedure, with some specific add-ons, implemented in a Bash and Python pipeline:
- For each collection the GenBank and RefSeq flat-file and FASTA files were downloaded from the NCBI ftp site ftp://ftp.ncbi.nlm.nih.gov. For Prokaryotes and Viruses two specific GenBank divisions (BCT and VRL) are available. For Fungi, the “Plantae” GenBank division (PLN) was downloaded and the fungal entries were identified by parsing the taxonomic information of each entry. For Protista, the “Invertebrate” GenBank division (INV) was downloaded and the Protista entries were collected by suitably parsing the taxonomic information of each entry. Regarding the RefSeq data, specific divisions are available for Prokaryotes, Viruses and Fungi. As for GenBank data, the Prostista collection was created by downloading and parsing the “invertebrate” RefSeq division.
- The entire NCBI taxonomy dump data were downloaded from the NCBI ftp site ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.
- The Genbank flat-file was parsed in order to associate the accession number of each entry to a taxonomy identifier (taxid) in the NCBI taxonomy, by extracting the information annotated in the feature table “source” field. An entry was discarded if resulting associated to more than one microbial division or the associated taxid was not present in the reference taxonomy due to inconsistency between Genbank and Taxonomy databases. For the entries passing the taxonomy-based filter, the sequence data were annotated and the accession number – taxid association was stored in a dump.
- FASTA sequences were indexed by applying the bowtie2-build program.
- By using the dump containing the accession number – taxid association the TANGO reference taxonomy was built. The current MetaShot reference collections were built starting from the releases 209 and 72 of the GenBank and RefSeq databases, respectively.
Install
Requirements
MetaShot scripts use freely available Python packages and third party tools both requiring an installation performed by the users. Actually, it requires a working Python 2.7 environment and the following modules:
- NumPy: release 1.7.1 or superior http://www.numpy.org
- BioPython: release 1.61 http://biopython.org
- Pysam: 0.7.4 http://pysam.readthedocs.io/en/stable/
- Psutil: release 3.3.0 https://github.com/giampaolo/psutil
- ETE2: release 2.3.10 http://etetoolkit.org/download/
The following tools need to be installed on your machine:
- Bowtie2: release2.2.3 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
- STAR: release STAR_2.4.2a https://github.com/alexdobin/STAR
- Krona Graph: release 2.6 https://github.com/marbl/Krona/wiki
- FaQCs: release 2.08 https://github.com/chienchi/FaQCs: please install it by using conda:
conda install faqcs -c bioconda - Samtools: release 0.1.18 http://samtools.sourceforge.net
MetaShot requires at least 30 Gb of RAM to perform the entire analysis. Its reference collections require about 420Gb of free space on your storage system.
To complete the analysis of 500 million PE reads it requires about 1.3 Tb.
MetaShot setting up
Download the reference collections
All the reference collections, taxonomies and the other files needed for the MetaShot computation are stored in the compressed folder MetaShot_reference_data.tar.gz, freely available at https://recascloud.ba.infn.it/index.php/s/9dFFTJUr0bOU2mN (the file is about 49 GB. Md5sum: 2def0475c8c3b49c15181de0c226b11a).
To decompress the folder, type the following command in your terminal: tar xvfz MetaShot_reference_data.tar.gz
The MetaShot_reference_data folder contains 9 subfolders:
- Bacteria (contains the data for all the Prokaryotes), Fungi, Protist and Virus folders contain the FASTA files to produce the bowtie indexes for each reference collection and the specific input data for taxonomic assignment.
In the “bowtie_index” subfolder a bash script is supplied to correctly format and name the bowtie-index.
For Prokaryotes, the reference collection is divided in 15 section, called “split”, each containing the bash script for bowtie2 index building;
To build the bowtie2 index type the following commands:chmod 755 build_index.sh ./build_index.sh - Homo_sapiens: contains the hg19 release of human genome. Please follow the instructions available in the STAR guide;
- Phix: contains the Phix genome. To build the bowtie2 index type the following commands:
cd PhiX chmod 755 build_index.sh ./build_index.sh - krona_tax: contains the NCBI taxonomy pre-formatted needed for Krona graph production;
- New_TANGO_perl_version. Contains the executable TANGO Perl scripts,
- Metashot_reference_taxonomy: contains the taxonomy data.
Configure the Parameters File
In the MetaShot package, a textual file called “parameters_file.txt” is available that contains all the required information for the pipeline execution.
The user must complete this file by adding the following information to the “General data” section:
- Adding the complete path to the MetaShot package:
cd Metashot && pwd
Copy the result in correspondence of “script_path”, as in the following example: scri
