StringMLST
Fast k-mer based tool for multi locus sequence typing (MLST)
Install / Use
/learn @jordanlab/StringMLSTREADME
stringMLST
Fast k-mer based tool for multi locus sequence typing (MLST) stringMLST is a tool for detecting the MLST of an isolate directly from the genome sequencing reads. stringMLST predicts the ST of an isolate in a completely assembly and alignment free manner. The tool is designed in a light-weight, platform-independent fashion with minimum dependencies.
Some portions of the allele selection algorithm in stringMLST are patent pending. Please refer to the PATENTS file for additional inforamation regarding licencing and use.
Reference http://jordan.biology.gatech.edu/page/software/stringmlst/
Abstract http://bioinformatics.oxfordjournals.org/content/early/2016/09/06/bioinformatics.btw586.short?rss=1
Application Note http://bioinformatics.oxfordjournals.org/content/early/2016/09/06/bioinformatics.btw586.full.pdf+html
stringMLST is a tool not a database, always use the most up-to-date database files as possible. To facilitate keeping your databases updated, stringMLST can download and build databases from pubMLST using the most recent allele and profile definitions. Please see the "Included databases and automated retrieval of databases from pubMLST" section below for instructions. The databases bundled here are for convenience only, do not rely on them being up-to-date.
stringMLST is licensed and distributed under CC Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) and is free for academic users and requires permission before any commercial use for any version of this code/algorithm. If you are a commercial user, please contact king.jordan@biology.gatech.edu for permissions
Recommended installation method
pip install stringMLST
Installation via git (Not recommended for most users)
git clone https://github.com/jordanlab/stringMLST
# Optional, download prebuilt databases
# We don't recommend this method, instead build the databases locally
cd stringMLST
git submodule init
git submodule update
Quickstart guide
pip install stringMLST
mkdir -p stringMLST_analysis; cd stringMLST_analysis
stringMLST.py --getMLST -P neisseria/nmb --species neisseria
# Download all available databases with:
# stringMLST.py --getMLST -P mlst_dbs --species all
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_2.fastq.gz
stringMLST.py --predict -P neisseria/nmb -1 ERR026529_1.fastq.gz -2 ERR026529_2.fastq.gz
Sample abcZ adk aroE fumC gdh pdhC pgm ST
ERR026529 231 180 306 612 269 277 260 10174
Python dependencies and external programs
stringMLST does not require any python dependencies for basic usage (Building databases and predicting STs).
For advanced used (genome coverage), stringMLST depends on the pyfaidx python module and bamtools, bwa, and samtools.
See the coverage section for more information
stringMLST has been tested with:
pyfaidx: 0.4.8.1
samtools: 1.3 (Using htslib 1.3.1) [Requires the 1.x branch of samtools]
bedtools: v2.24.0
bwa: 0.7.13-r1126
To install the dependencies
# pyfaidx
pip install --user pyfaidx
# samtools
wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2 -o samtools-1.3.1.tar.bz2
tar xf samtools-1.3.1.tar.bz2
cd samtools-1.3.1.tar
make
make prefix=$HOME install
# bedtools
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxvf bedtools-2.25.0.tar.gz
cd bedtools2; make
cp ./bin/* ~/bin
# bwa
git clone https://github.com/lh3/bwa.git
cd bwa; make
cp bwa ~/bin/bwa
export PATH=$PATH:$HOME/bin
Usage for Example Read Files (Neisseria meningitidis)
- Download stringMLST.py, example read files (ERR026529, ERR027250, ERR036104) and the dataset for Neisseria meningitidis (Neisseria_spp.zip).
Build database:
# Add dir to path
export PATH=$PATH:$PWD
# Will connect to EBI's SRA servers
download_example_reads.sh
- Extract the MLST loci dataset.
unzip datasets/Neisseria_spp.zip -d datasets
- Create or use a config file specifying the location of all the locus and profile files. Example config file (Neisseria_spp/config.txt):
[loci]
abcZ datasets/Neisseria_spp/abcZ.fa
adk datasets/Neisseria_spp/adk.fa
aroE datasets/Neisseria_spp/aroE.fa
fumC datasets/Neisseria_spp/fumC.fa
gdh datasets/Neisseria_spp/gdh.fa
pdhC datasets/Neisseria_spp/pdhC.fa
pgm datasets/Neisseria_spp/pgm.fa
[profile]
profile datasets/Neisseria_spp/neisseria.txt
- Run stringMLST.py --buildDB to create DB. Choose a k value and prefix (optional).
stringMLST.py --buildDB -c databases/Neisseria_spp/config.txt -k 35 -P NM
Predict:
Single sample :
stringMLST.py --predict -1 tests/fastqs/ERR026529_1.fastq -2 tests/fastqs/ERR026529_2.fastq -k 35 -P NM
Batch mode (all the samples together):
stringMLST.py --predict -d ./tests/fastqs/ -k 35 -P NM
List mode:
Create a list file (list_paired.txt) as :
tests/fastqs/ERR026529_1.fastq tests/fastqs/ERR026529_2.fastq
tests/fastqs/ERR027250_1.fastq tests/fastqs/ERR027250_2.fastq
tests/fastqs/ERR036104_1.fastq tests/fastqs/ERR036104_2.fastq
Run the tool as:
stringMLST.py --predict -l list_paired.txt -k 35 -P NM
Working with gziped files
stringMLST.py --predict -1 tests/fastqs/ERR026529_1.fq.gz -2 tests/fastqs/ERR026529_2.fq.gz -p -P NM -k 35 -o ST_NM.txt
Usage Documentation
stringMLST's workflow is divided into two routines:
- Database building and
- ST discovery
Database building: Builds the stringMLST database which is used for assigning STs to input sample files. This step is required once for each organism. Please note that stringMLST is capable of working on a custom user defined typing scheme but its efficiency has not been tested on other typing scheme.
ST discovery: This routine takes the database created in the last step and predicts the ST of the input sample(s). Please note that the database building is required prior to this routine. stringMLST is capable of processing single-end and paired-end files. It can run in three modes:
- Single sample mode - for running stringMLST on a single sample
- Batch mode - for running stringMLST on all the FASTQ files present in a directory
- List mode - for running stringMLST on all the FASTQ files provided in a list file
Readme for stringMLST
=============================================================================================
Usage
./stringMLST.py
[--buildDB]
[--predict]
[-1 filename_fastq1][--fastq1 filename_fastq1]
[-2 filename_fastq2][--fastq2 filename_fastq2]
[-d directory][--dir directory][--directory directory]
[-l list_file][--list list_file]
[-p][--paired]
[-s][--single]
[-c][--config]
[-P][--prefix]
[-z][--fuzzy]
[-a]
[-C][--coverage]
[-k]
[-o output_filename][--output output_filename]
[-x][--overwrite]
[-t]
[-r]
[-v]
[-h][--help]
==============================================================================================
There are two steps to predicting ST using stringMLST.
1. Create DB : stringMLST.py --buildDB
2. Predict : stringMLST --predict
1. stringMLST.py --buildDB
Synopsis:
stringMLST.py --buildDB -c <config file> -k <kmer length(optional)> -P <DB prefix(optional)>
config file : is a tab delimited file which has the information for typing scheme ie loci, its multifasta file and profile definition file.
Format :
[loci]
locus1 locusFile1
locus2 locusFile2
[profile]
profile profileFile
kmer length : is the kmer length for the db. Note, while processing this should be smaller than the read length.
We suggest kmer lengths of 35, 66 depending on the read length.
DB prefix(optional) : holds the information for DB files to be created and their location. This module creates 3 files with this prefix.
You can use a folder structure with prefix to store your db at particular location.
Required arguments
--buildDB
Identifier for build db module
-c,--config = <configuration file>
Config file in the format described above.
All the files follow the structure followed by pubmlst. Refer extended document for details.
Optional arguments
-k = <kmer length>
Kmer size for which the db has to be formed(Default k = 35). Note the tool works best with kmer length in between 35 and 66
for read lengths of 55 to 150 bp. Kmer size can be increased accordingly. It is advised to keep lower kmer sizes
if the quality of reads is not very good.
-P,--prefix = <prefix>
Prefix for db and log files to be created(Default = kmer). Also you can specify folder where you want the dbb to be created.
-a
File location to write build log
-h,--help
Prints the help manual for this application
--------------------------------------------------------------------------------------------
2. stringMLST.py --predict
stringMLST --predict : can run in three modes
1) single sample (default mode)
2) batch mode : run stringMLST for all the samples in a folder (for a particular specie)
3) list mode : run stringMLST on samples specified in a file
stringMLST can process both single and paired end files. By default program expects paired end files.
Synopsis
stringMLST.py --predict -1 <fastq file> -2 <fastq file> -d <directory location> -l <list file> -p -s -P <DB prefix(optional)> -k <kmer l
