SRST2

Short Read Sequence Typing for Bacterial Pathogens

This program is designed to take Illumina sequence data, a MLST database and/or a database of gene sequences (e.g. resistance genes, virulence genes, etc) and report the presence of STs and/or reference genes.

Authors - Michael Inouye, Harriet Dashnow, Bernie Pope, Ryan Wick, Kathryn Holt (University of Melbourne)

How to cite - The peer-reviewed open-access paper is available in Genome Medicine: http://genomemedicine.com/content/6/11/90

Story-behind-the-paper is here

Problems? Please post an issue here in github: https://github.com/katholt/srst2/issues.

To be notifed of updates, join the SRST2 google group at https://groups.google.com/forum/#!forum/srst2.

Current release

Installation

Basic usage - MLST

Basic usage - Resistance genes

All usage options

Input read formats and options

MLST Database format

Gene databases

Output files

Printing consensus sequences

More basic usage examples

Compile results from completed runs

Running lots of jobs and compiling results

Known issues

Generating SRST2-compatible clustered database from raw sequences

Using the VFBD Virulence Factor Database with SRST2

Preformatted databases for specialist typing with SRST2

Plotting output in R

Example - Shigella sonnei public data

Current release - v0.2.0 - July 28, 2016

Dependencies:

python (v2.7.5 or later)
scipy, numpy http://www.scipy.org/install.html
bowtie2 (v2.1.0 or later) http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
SAMtools v0.1.18 https://sourceforge.net/projects/samtools/files/samtools/0.1.18/ (NOTE: later versions can be used, but better results are obtained with v0.1.18, especially at low read depths (<20x))

Updates in current master branch (not yet in a release)

Added new AMR gene database CARD_v3.0.8_SRST2.fasta, a curated version of CARD v3.0.8 with some fixes and additions based on comparisons with the ARG-Annot, ResFinder and NCBI AMR databases. This was done my Margaret Lam for the Kleborate paper, see details of modifications vs CARD v3.0.8 here).
Added to readme some instructions for using SRST2 to serotype Group B Streptococcus (S. agalactiae) using the GBS-SBG database, available here - thanks to Swaine Chen and colleagues for this.

Updates in v0.2.0

Some improvements to allele calling, particularly for Klebsiella MLST locus mdh, kindly contributed by andreyto. Includes rejection of read alignments that are clipped on both ends (likely to be spurious) and minor bug fixes associated with depth calculations.
Updated E. coli serotype database to remove duplicate sequences.
Added mcr-2 colistin resistance gene to ARGannot.r1.fasta resistance gene database.
A --threads option was added, which makes SRST2 call Bowtie and Samtools with their threading options. The resulting speed up is mostly due to the Bowtie mapping step which parallelises very well.
The VFDB_cdhit_to_csv.py script was updated to work with the new VFDB FASTA format.
Versions of Bowtie2 up to 2.2.9 are now supported. Samtools v1.3 can now be used as well, however v0.1.18 is still the recommended version (for reasons discussed below).
Added scripts/qsub_srst2.py to generate SRST2 jobs for the Grid Engine (qsub) scheduling system (http://gridscheduler.sourceforge.net/). Thanks to Ramon Fallon from the University of St Andrews for putting this together. Some of the specifics are set up for his cluster, so modifications may be necessary to make it run properly on a different cluster using Grid Engine.
Various other small bug fixes!

Updates in v0.1.8

/data directory includes files for subtyping of the LEE pathogenicity island of E. coli, as per Ingle et al, 2016, Nature Microbiology. Instructions below
Resistance gene database updates:

Fixed ARGannot.r1.fasta to include proper mcr1 DNA sequence.
Added columns to the ARGannot_clustered80.csv table, to indicate classes of beta-lactamases included in the ARGannot.r1.fasta database according to the NCBI beta-lactamase resource (new location for the Lahey list).

Fixed some issues with handling of missing data (i.e. where there were no hits to MLST and/or no hits to genes) when compiling results into a table via --prev_output. This could result in misalignment of gene columns in previous versions.

Updates in v0.1.7

Use the following environment variables to specify your prefered samtools and bowtie2 executables (thanks to Ben Taylor for this):

SRST2_SAMTOOLS
SRST2_BOWTIE2
SRST2_BOWTIE2_BUILD

Added mcr1, the plasmid-borne colisting resistance gene to the included ARG-Annot-based resistance gene DB (ARGannot.r1.fasta)
Fixed a problem with writing consensus files that occurred when a directory structure was specified using --output (bug introduced in v0.1.6)

Updates in v0.1.6

The original validation of SRST2 (see paper) was performed with bowtie2 version 2.1.0 and samtools v0.1.18.

bowtie2: SRST2 has now been tested on the tutorial example and other test data sets using the latest versions of bowtie2, 2.2.3 and 2.2.4, which gave identical results to those obtained with bowtie2 v2.1.0. Therefore, the SRST2 code will now run if any of these versions of bowtie2 are available: 2.1.0, 2.2.3 or 2.2.4.
samtools: SRST2 has now been tested on the Staph & Salmonella test data sets used in the paper, and will work with newer samtools versions (tested up to v1.1). Note however that SRST2 still works best with samtools v0.1.18, due to small changes in the mapping algorithms in later versions that result in some loss of reads at the ends of alleles. This has most impact at low read depths, however we do recommend using v0.1.18 for optimum results.

Minor fixes to the ARG-Annot database of resistance genes, including removal of duplicate sequences and fixes to gene names (thanks to Wan Yu for this). Old version remains unchanged for backwards compatibility, but we recommend using the revised version (located in data/ARGannot.r1.fasta).
Added EcOH database for serotyping E. coli (thanks to Danielle Ingle for this). See Using the EcOH database for serotyping E. coli with SRST2 and this MGen paper.
Fixed a problem where, when analysing multiple read sets in one SRST2 call against a gene database in which cluster ids don't match gene symbols, individual gene clusters appear multiple times in the output. The compile function was unaffected and remains unchanged.
Fixed behaviour so that including directory paths in --output parameter works (thanks to nyunyun for contributing most of this fix). E.g. --output test_dir/test will create output files prefixed with test, located in test_dir/, and all SRST2 functions should work correctly including consensus allele calling. If test_dir/ doesn't exist, we attempt to create it; if this is not possible the user is alerted and SRST2 stops.
Fixed problem when using a gene database with a simple fasta header (ie not clustered for SRST2; note best results are achieved by pre-clusering your sequence database beforehand) (thanks to cglambert for this one).
Fixes contributed

Srst2

Install / Use

README

SRST2

Contents

Current release - v0.2.0 - July 28, 2016

Updates in current master branch (not yet in a release)