stringMLST

Fast k-mer based tool for multi locus sequence typing (MLST) stringMLST is a tool for detecting the MLST of an isolate directly from the genome sequencing reads. stringMLST predicts the ST of an isolate in a completely assembly and alignment free manner. The tool is designed in a light-weight, platform-independent fashion with minimum dependencies.

Some portions of the allele selection algorithm in stringMLST are patent pending. Please refer to the PATENTS file for additional inforamation regarding licencing and use.

Reference http://jordan.biology.gatech.edu/page/software/stringmlst/

Abstract http://bioinformatics.oxfordjournals.org/content/early/2016/09/06/bioinformatics.btw586.short?rss=1

Application Note http://bioinformatics.oxfordjournals.org/content/early/2016/09/06/bioinformatics.btw586.full.pdf+html

downloads

stringMLST is a tool not a database, always use the most up-to-date database files as possible. To facilitate keeping your databases updated, stringMLST can download and build databases from pubMLST using the most recent allele and profile definitions. Please see the "Included databases and automated retrieval of databases from pubMLST" section below for instructions. The databases bundled here are for convenience only, do not rely on them being up-to-date.

stringMLST is licensed and distributed under CC Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) and is free for academic users and requires permission before any commercial use for any version of this code/algorithm. If you are a commercial user, please contact king.jordan@biology.gatech.edu for permissions

Recommended installation method

pip install stringMLST

Installation via git (Not recommended for most users)

git clone https://github.com/jordanlab/stringMLST
# Optional, download prebuilt databases
# We don't recommend this method, instead build the databases locally
cd stringMLST
git submodule init
git submodule update

Quickstart guide

pip install stringMLST
mkdir -p stringMLST_analysis; cd stringMLST_analysis
stringMLST.py --getMLST -P neisseria/nmb --species neisseria
# Download all available databases with:
# stringMLST.py --getMLST -P mlst_dbs --species all
wget  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_2.fastq.gz
stringMLST.py --predict -P neisseria/nmb -1 ERR026529_1.fastq.gz -2 ERR026529_2.fastq.gz
Sample  abcZ    adk     aroE    fumC    gdh     pdhC    pgm     ST
ERR026529       231     180     306     612     269     277     260     10174

Python dependencies and external programs

stringMLST does not require any python dependencies for basic usage (Building databases and predicting STs).

For advanced used (genome coverage), stringMLST depends on the pyfaidx python module and bamtools, bwa, and samtools. See the coverage section for more information

stringMLST has been tested with:

pyfaidx: 0.4.8.1
samtools: 1.3 (Using htslib 1.3.1)  [Requires the 1.x branch of samtools]
bedtools: v2.24.0
bwa: 0.7.13-r1126

To install the dependencies

# pyfaidx
pip install --user pyfaidx
# samtools
wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2 -o samtools-1.3.1.tar.bz2
tar xf samtools-1.3.1.tar.bz2
cd samtools-1.3.1.tar
make
make prefix=$HOME install
# bedtools
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxvf bedtools-2.25.0.tar.gz
cd bedtools2; make
cp ./bin/* ~/bin
# bwa
git clone https://github.com/lh3/bwa.git
cd bwa; make
cp bwa ~/bin/bwa
export PATH=$PATH:$HOME/bin

Usage for Example Read Files (Neisseria meningitidis)

Download stringMLST.py, example read files (ERR026529, ERR027250, ERR036104) and the dataset for Neisseria meningitidis (Neisseria_spp.zip).

Build database:

# Add dir to path
export PATH=$PATH:$PWD
# Will connect to EBI's SRA servers
download_example_reads.sh

Extract the MLST loci dataset.

unzip datasets/Neisseria_spp.zip -d datasets

Create or use a config file specifying the location of all the locus and profile files. Example config file (Neisseria_spp/config.txt):

[loci]
abcZ  datasets/Neisseria_spp/abcZ.fa
adk datasets/Neisseria_spp/adk.fa
aroE  datasets/Neisseria_spp/aroE.fa
fumC  datasets/Neisseria_spp/fumC.fa
gdh datasets/Neisseria_spp/gdh.fa
pdhC  datasets/Neisseria_spp/pdhC.fa
pgm datasets/Neisseria_spp/pgm.fa
[profile]
profile datasets/Neisseria_spp/neisseria.txt

Run stringMLST.py --buildDB to create DB. Choose a k value and prefix (optional).

stringMLST.py --buildDB -c databases/Neisseria_spp/config.txt -k 35 -P NM

Predict:

Single sample :

stringMLST.py --predict -1 tests/fastqs/ERR026529_1.fastq -2 tests/fastqs/ERR026529_2.fastq -k 35 -P NM

Batch mode (all the samples together):

stringMLST.py --predict -d ./tests/fastqs/ -k 35 -P NM

List mode:

Create a list file (list_paired.txt) as :

tests/fastqs/ERR026529_1.fastq  tests/fastqs/ERR026529_2.fastq
tests/fastqs/ERR027250_1.fastq  tests/fastqs/ERR027250_2.fastq
tests/fastqs/ERR036104_1.fastq  tests/fastqs/ERR036104_2.fastq

Run the tool as:

stringMLST.py --predict -l list_paired.txt -k 35 -P NM

Working with gziped files

stringMLST.py --predict -1 tests/fastqs/ERR026529_1.fq.gz -2 tests/fastqs/ERR026529_2.fq.gz -p -P NM -k 35 -o ST_NM.txt

Usage Documentation

stringMLST's workflow is divided into two routines:

Database building and
ST discovery

Database building: Builds the stringMLST database which is used for assigning STs to input sample files. This step is required once for each organism. Please note that stringMLST is capable of working on a custom user defined typing scheme but its efficiency has not been tested on other typing scheme.

ST discovery: This routine takes the database created in the last step and predicts the ST of the input sample(s). Please note that the database building is required prior to this routine. stringMLST is capable of processing single-end and paired-end files. It can run in three modes:

Single sample mode - for running stringMLST on a single sample
Batch mode - for running stringMLST on all the FASTQ files present in a directory
List mode - for running stringMLST on all the FASTQ files provided in a list file

Readme for stringMLST
=============================================================================================
Usage
./stringMLST.py
[--buildDB]
[--predict]
[-1 filename_fastq1][--fastq1 filename_fastq1]
[-2 filename_fastq2][--fastq2 filename_fastq2]
[-d directory][--dir directory][--directory directory]
[-l list_file][--list list_file]
[-p][--paired]
[-s][--single]
[-c][--config]
[-P][--prefix]
[-z][--fuzzy]
[-a]
[-C][--coverage]
[-k]
[-o output_filename][--output output_filename]
[-x][--overwrite]
[-t]
[-r]
[-v]
[-h][--help]
==============================================================================================

There are two steps to predicting ST using stringMLST.
1. Create DB : stringMLST.py --buildDB
2. Predict : stringMLST --predict

1. stringMLST.py --buildDB

Synopsis:
stringMLST.py --buildDB -c <config file> -k <kmer length(optional)> -P <DB prefix(optional)>
  config file : is a tab delimited file which has the information for typing scheme ie loci, its multifasta file and profile definition file.
    Format :
      [loci]
      locus1    locusFile1
      locus2    locusFile2
      [profile]
      profile   profileFile
  kmer length : is the kmer length for the db. Note, while processing this should be smaller than the read length.
    We suggest kmer lengths of 35, 66 depending on the read length.
  DB prefix(optional) : holds the information for DB files to be created and their location. This module creates 3 files with this prefix.
    You can use a folder structure with prefix to store your db at particular location.

Required arguments
--buildDB
  Identifier for build db module
-c,--config = <configuration file>
  Config file in the format described above.
  All the files follow the structure followed by pubmlst. Refer extended document for details.

Optional arguments
-k = <kmer length>
  Kmer size for which the db has to be formed(Default k = 35). Note the tool works best with kmer length in between 35 and 66
  for read lengths of 55 to 150 bp. Kmer size can be increased accordingly. It is advised to keep lower kmer sizes
  if the quality of reads is not very good.
-P,--prefix = <prefix>
  Prefix for db and log files to be created(Default = kmer). Also you can specify folder where you want the dbb to be created.
-a
        File location to write build log
-h,--help
  Prints the help manual for this application

 --------------------------------------------------------------------------------------------

2. stringMLST.py --predict

stringMLST --predict : can run in three modes
  1) single sample (default mode)
  2) batch mode : run stringMLST for all the samples in a folder (for a particular specie)
  3) list mode : run stringMLST on samples specified in a file
stringMLST can process both single and paired end files. By default program expects paired end files.

Synopsis
stringMLST.py --predict -1 <fastq file> -2 <fastq file> -d <directory location> -l <list file> -p -s -P <DB prefix(optional)> -k <kmer l

StringMLST

Install / Use

README