SkillAgentSearch skills...

MIST

MIST: a metagenomic intra-species typing tool.

Install / Use

/learn @pandafengye/MIST

README

MIST (Metagenomic Intra-Species Typing)

Table of content

Introduction

MIST is a metagenomic intra-species typing technique that was developed primarily for clinical specimens with low pathogen loads. Hopefully, it will aid in strain-level diagnoses of bacterial infections as well as public health epidemiology and surveillance. Its algorithm contains the following three characteristics.

  • Based on average nucleotide identity (ANI), reference genomes are clustered into hierarchical levels to resolve the ambiguous definition of “strain”.
  • Maximum likelihood estimation is conducted upon the reads’ mismatch values to infer the compositional abundance.
  • Read ambiguity is used to infer the abundance uncertainty, and the similarity to reference genomes is used to predict the presence of novel strains. MIST contains four modules: Index, Cluster, Species, Strain, Bootstrap. Its workflow is depicted in the figure below. MIST depends on the counts of reads that are matched to the pan-genomes of each pathogen species for species-level typing (panel A). Next, MIST prepares a hierarchical database of reference genomes based on ANI grouping for strain-level typing (panel B). By matching reads to all of the species' reference genomes, the scores for each alignment are transformed to posterior probabilities that indicate the likelihood of the sequence read of being allocated to each cluster. The probability matrix is then employed using the maximum likelihood estimation (MLE) to infer the abundance of each cluster.
<p align="center"><img src="https://github.com/pandafengye/MIST/blob/master/Pipeline.png" alt="workflow_small" width="800">

Note: Red box, the module Species; yellow box, the modules Index and Cluster; purple box, the module Strain.

Requirements

Software

Pre-built database

  • Pre-built pangenome: This folder is used in the species module, which include Bowtie-indexed pan-genomes of 14 bacterial species (listed below).
  • Pre-built Bowtie indexed reference genomes: For each species, their complete genomes deposited from NCBI Genbank database are downloaded and are Bowtie-indexed with the module index. These pre-built Bowtie index files are used in combination with the pre-built clustering files in the module strain.

| Species | Species | | ---- | ---- | | Acinetobacter baumannii | Campylobacter jejuni | | Clostriodioides difficile | Enterococcus faecalis | | Enterococcus faecium | Escherichia coli | | Haemophilus influenzae | Klebsiella pneumoniae | | Legionella pneumophila | Listeria monocytogenes | Mycobacterium tuberculosis | Salmonella enterica | | Staphylococcus aureus | Streptococcus pneumoniae |

  • Pre-built clustering files: This folder contains the clustering files (obtained from the module cluster) for the 14 pathogens and is used in the module strain. Same as above, these clustering files can only be used in combination with the Bowtie-indexed reference genomes.

Note: In additional to the pre-built database above, you can customize your own database by the following steps:

  • Downloading reference genomes of a certain species in FASTA format.
  • Cluster these reference genomes by the cluster module.
  • Bowtie-index these reference genomes by the index module.

Installation

$ conda create -n MIST -c conda-forge -c bioconda python=3.6 fastANI bowtie2
$ conda activate MIST
$ git clone https://github.com/pandafengye/MIST.git
$ cd MIST
$ pip install -r requirements.txt --default-timeout=1000 # Install related python dependencies
$ python MIST.py # Test install

Usage

MIST contains four modules: index, cluster, species, strain and bootstrap.

MIST-index

This module functions to index the reference genomes with Bowtie2 indexer (Bowtie2-build). Once the reference genomes are indexed, users will not need to re-index the genome before each analysis of metagenomics datasets.

Command

$ python MIST.py index --refdir Example_Dir/input/ref_dir/ --output Example_Dir/output/

Options:

-i, --refdir PATH 
     Path to the reference genome folder; All reference genomes should be in FASTA format and put in the same folder; each file represents one reference genome, with a *.fa prefix.
-o, --output PATH
   Output folder saving the index files for reference genomes. The base name of the index is the same as the reference genome.

MIST-cluster

This module functions to assign reference genomes into clusters at user-defined levels. MIST calls FastANI program to calculate ANI for estimation of pairwise genetic distance, based on which the reference genomes are divided into clusters. Same as the index module, once the clusters are established, users are not required to run this module before each independent job.

Command

$ python MIST.py cluster --threads 8 --refdir Example_Dir/input/ref_dir/ --cutoff 0.98,0.99,0.999 --output Example_Dir/output/

Options:

-t, --threads INTEGER  Number of threads for ANI
-i, --refdir PATH     
input folder of reference genomes
-o, --output PATH    
matrix file of the clustered reference genomes
-s, --cutoff TEXT    
list of similarity thresholds (between 0 and 1); separated by comma (e.g. 0.98,0.99,0.999)

MIST-species

This module functions to perform species-level typing. MIST calls Bowtie2 to map the user’s mNGS reads (in .fastq format) against the pan-genomes of each bacterial species and estimate the abundance by counting the reads mapped to each species. The species-specific reads are extracted from the resulting SAM file for the downstream strain-level typing.

Command

$  python MIST.py species --threads 8 --pair_1 Example_Dir/input/read/example_data1.1.fq --pair_2 Example_Dir/input/read/example_data1.2.fq --database Pre-built-pangenome/ --output Example_Dir/output/

Options:

-p, --threads INTEGER      
Number of threads for Bowtie2 (default: 8)
-1, --pair_1 PATH         
input fq file with #1 mate, paired with pair_2 file
-2, --pair_2 PATH
input fq file with #2 mate, paired with pair_1 file
-d, --database PATH
input bowtie2 index file for the pan-genome sequences
-o, --output PATH        
output folder which contains: 1) read counts for each pathogen species (_MIST_species_count.txt); 2) reads specific to each pathogen species (_MIST.*.fq).

Tips:

The pre-built pan-genome index file is available at here. For the reads specific to each pathogen species (_MIST.*.fq), 0.1x sequencing coverage of bacterial genome (e.g. 5000 100-bp reads for a 5-Mb bacterial genome) is usually sufficient for MIST to do strain-level typing. Too many reads (e.g., > 50000 reads) for the subsequent mapping and maximum likelihood estimation would otherwise cause long running time. Users can extract a subset (5000) of reads with the command such as "head –n 20000 _MIST.*.fq > input.fq".

MIST-strain

This module functions to map metagenomic sequences against reference genomes using Bowtie2, and to measure the relative abundance of each cluster in the metagenomics dataset.

Command

$ python MIST.py strain --threads 8 --indexpath Example_Dir/output/_MIST_index/ --cluster_output Example_Dir/output/_MIST_ref_cluster.csv --pair_1 Example_Dir/input/read/example_data1.1.fq --pair_2 Example_Dir/input/read/example_data1.2.fq --read_length 200 --genome_size 5000000 --output Example_Dir/output/

Options:

  -p, --threads INTEGER      
Number of threads for Bowtie2 (default: 8)
  -c, --cluster_output PATH  
 input file; the matrix of the clustered reference genomes
  -i, --indexpath PATH       
input folder of index files for reference genomes; produced by MIST-index module.
  -U, --single_end PATH     
input single-end fq file;can be the output produced by MIST-species module.
  -1, --pair_1 PATH         
input fq file with #1 mate, paired with pair_2 file
  -2, --pair_2 PATH         
input fq file with #2 mate, paired with pair_1 file; either choose the paired input or the single-end input.
  -l, --read_length FLOAT   
read length
  -g, --genome_size INTEGER  
genome size (optional)
  -o, --output PATH        
output folder for mismatch matrix file and alignment output files. A folder _MIST_map_alignment, which contains the mapped .sam files corresponding to each reference genome; a file _MIST_map_Mismatch_matrix.csv, which co
View on GitHub
GitHub Stars7
CategoryProduct
Updated2mo ago
Forks0

Languages

Python

Security Score

90/100

Audited on Jan 23, 2026

No findings