MIST
MIST: a metagenomic intra-species typing tool.
Install / Use
/learn @pandafengye/MISTREADME
MIST (Metagenomic Intra-Species Typing)
Table of content
Introduction
MIST is a metagenomic intra-species typing technique that was developed primarily for clinical specimens with low pathogen loads. Hopefully, it will aid in strain-level diagnoses of bacterial infections as well as public health epidemiology and surveillance. Its algorithm contains the following three characteristics.
- Based on average nucleotide identity (ANI), reference genomes are clustered into hierarchical levels to resolve the ambiguous definition of “strain”.
- Maximum likelihood estimation is conducted upon the reads’ mismatch values to infer the compositional abundance.
- Read ambiguity is used to infer the abundance uncertainty, and the similarity to reference genomes is used to predict the presence of novel strains.
MIST contains four modules:
Index,Cluster,Species,Strain,Bootstrap. Its workflow is depicted in the figure below. MIST depends on the counts of reads that are matched to the pan-genomes of each pathogen species for species-level typing (panel A). Next, MIST prepares a hierarchical database of reference genomes based on ANI grouping for strain-level typing (panel B). By matching reads to all of the species' reference genomes, the scores for each alignment are transformed to posterior probabilities that indicate the likelihood of the sequence read of being allocated to each cluster. The probability matrix is then employed using the maximum likelihood estimation (MLE) to infer the abundance of each cluster.
Note: Red box, the module Species; yellow box, the modules Index and Cluster; purple box, the module Strain.
Requirements
Software
- Linux system
- Python = 3.6
- GCC >= 4.8
- Bowtie 2
- FastANI
- Python modules:
networkx,pandas,matplotlib,numpy,scipy,scikit-learn,joblib,click.
Pre-built database
- Pre-built pangenome: This folder is used in the
speciesmodule, which include Bowtie-indexed pan-genomes of14bacterial species (listed below). - Pre-built Bowtie indexed reference genomes: For each species, their complete genomes deposited from NCBI Genbank database are downloaded and are Bowtie-indexed with the module
index. These pre-built Bowtie index files are used in combination with the pre-built clustering files in the modulestrain.
| Species | Species | | ---- | ---- | | Acinetobacter baumannii | Campylobacter jejuni | | Clostriodioides difficile | Enterococcus faecalis | | Enterococcus faecium | Escherichia coli | | Haemophilus influenzae | Klebsiella pneumoniae | | Legionella pneumophila | Listeria monocytogenes | Mycobacterium tuberculosis | Salmonella enterica | | Staphylococcus aureus | Streptococcus pneumoniae |
- Pre-built clustering files: This folder contains the clustering files (obtained from the module
cluster) for the14pathogens and is used in the modulestrain. Same as above, these clustering files can only be used in combination with the Bowtie-indexed reference genomes.
Note: In additional to the pre-built database above, you can customize your own database by the following steps:
- Downloading reference genomes of a certain species in FASTA format.
- Cluster these reference genomes by the
clustermodule. - Bowtie-index these reference genomes by the
indexmodule.
Installation
$ conda create -n MIST -c conda-forge -c bioconda python=3.6 fastANI bowtie2
$ conda activate MIST
$ git clone https://github.com/pandafengye/MIST.git
$ cd MIST
$ pip install -r requirements.txt --default-timeout=1000 # Install related python dependencies
$ python MIST.py # Test install
Usage
MIST contains four modules: index, cluster, species, strain and bootstrap.
MIST-index
This module functions to index the reference genomes with Bowtie2 indexer (Bowtie2-build). Once the reference genomes are indexed, users will not need to re-index the genome before each analysis of metagenomics datasets.
Command
$ python MIST.py index --refdir Example_Dir/input/ref_dir/ --output Example_Dir/output/
Options:
-i, --refdir PATH
Path to the reference genome folder; All reference genomes should be in FASTA format and put in the same folder; each file represents one reference genome, with a *.fa prefix.
-o, --output PATH
Output folder saving the index files for reference genomes. The base name of the index is the same as the reference genome.
MIST-cluster
This module functions to assign reference genomes into clusters at user-defined levels. MIST calls FastANI program to calculate ANI for estimation of pairwise genetic distance, based on which the reference genomes are divided into clusters. Same as the index module, once the clusters are established, users are not required to run this module before each independent job.
Command
$ python MIST.py cluster --threads 8 --refdir Example_Dir/input/ref_dir/ --cutoff 0.98,0.99,0.999 --output Example_Dir/output/
Options:
-t, --threads INTEGER Number of threads for ANI
-i, --refdir PATH
input folder of reference genomes
-o, --output PATH
matrix file of the clustered reference genomes
-s, --cutoff TEXT
list of similarity thresholds (between 0 and 1); separated by comma (e.g. 0.98,0.99,0.999)
MIST-species
This module functions to perform species-level typing. MIST calls Bowtie2 to map the user’s mNGS reads (in .fastq format) against the pan-genomes of each bacterial species and estimate the abundance by counting the reads mapped to each species. The species-specific reads are extracted from the resulting SAM file for the downstream strain-level typing.
Command
$ python MIST.py species --threads 8 --pair_1 Example_Dir/input/read/example_data1.1.fq --pair_2 Example_Dir/input/read/example_data1.2.fq --database Pre-built-pangenome/ --output Example_Dir/output/
Options:
-p, --threads INTEGER
Number of threads for Bowtie2 (default: 8)
-1, --pair_1 PATH
input fq file with #1 mate, paired with pair_2 file
-2, --pair_2 PATH
input fq file with #2 mate, paired with pair_1 file
-d, --database PATH
input bowtie2 index file for the pan-genome sequences
-o, --output PATH
output folder which contains: 1) read counts for each pathogen species (_MIST_species_count.txt); 2) reads specific to each pathogen species (_MIST.*.fq).
Tips:
The pre-built pan-genome index file is available at here. For the reads specific to each pathogen species (_MIST.*.fq), 0.1x sequencing coverage of bacterial genome (e.g. 5000 100-bp reads for a 5-Mb bacterial genome) is usually sufficient for MIST to do strain-level typing. Too many reads (e.g., > 50000 reads) for the subsequent mapping and maximum likelihood estimation would otherwise cause long running time. Users can extract a subset (5000) of reads with the command such as "head –n 20000 _MIST.*.fq > input.fq".
MIST-strain
This module functions to map metagenomic sequences against reference genomes using Bowtie2, and to measure the relative abundance of each cluster in the metagenomics dataset.
Command
$ python MIST.py strain --threads 8 --indexpath Example_Dir/output/_MIST_index/ --cluster_output Example_Dir/output/_MIST_ref_cluster.csv --pair_1 Example_Dir/input/read/example_data1.1.fq --pair_2 Example_Dir/input/read/example_data1.2.fq --read_length 200 --genome_size 5000000 --output Example_Dir/output/
Options:
-p, --threads INTEGER
Number of threads for Bowtie2 (default: 8)
-c, --cluster_output PATH
input file; the matrix of the clustered reference genomes
-i, --indexpath PATH
input folder of index files for reference genomes; produced by MIST-index module.
-U, --single_end PATH
input single-end fq file;can be the output produced by MIST-species module.
-1, --pair_1 PATH
input fq file with #1 mate, paired with pair_2 file
-2, --pair_2 PATH
input fq file with #2 mate, paired with pair_1 file; either choose the paired input or the single-end input.
-l, --read_length FLOAT
read length
-g, --genome_size INTEGER
genome size (optional)
-o, --output PATH
output folder for mismatch matrix file and alignment output files. A folder _MIST_map_alignment, which contains the mapped .sam files corresponding to each reference genome; a file _MIST_map_Mismatch_matrix.csv, which co
