Srst2
Short Read Sequence Typing for Bacterial Pathogens
Install / Use
/learn @katholt/Srst2README
SRST2
Short Read Sequence Typing for Bacterial Pathogens
This program is designed to take Illumina sequence data, a MLST database and/or a database of gene sequences (e.g. resistance genes, virulence genes, etc) and report the presence of STs and/or reference genes.
Authors - Michael Inouye, Harriet Dashnow, Bernie Pope, Ryan Wick, Kathryn Holt (University of Melbourne)
How to cite - The peer-reviewed open-access paper is available in Genome Medicine: http://genomemedicine.com/content/6/11/90
Story-behind-the-paper is here
Problems? Please post an issue here in github: https://github.com/katholt/srst2/issues.
To be notifed of updates, join the SRST2 google group at https://groups.google.com/forum/#!forum/srst2.
Contents
Basic usage - Resistance genes
Input read formats and options
Compile results from completed runs
Running lots of jobs and compiling results
Generating SRST2-compatible clustered database from raw sequences
Preformatted databases for specialist typing with SRST2
Example - Shigella sonnei public data
Current release - v0.2.0 - July 28, 2016
Dependencies:
- python (v2.7.5 or later)
- scipy, numpy http://www.scipy.org/install.html
- bowtie2 (v2.1.0 or later) http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
- SAMtools v0.1.18 https://sourceforge.net/projects/samtools/files/samtools/0.1.18/ (NOTE: later versions can be used, but better results are obtained with v0.1.18, especially at low read depths (<20x))
Updates in current master branch (not yet in a release)
- Added new AMR gene database CARD_v3.0.8_SRST2.fasta, a curated version of CARD v3.0.8 with some fixes and additions based on comparisons with the ARG-Annot, ResFinder and NCBI AMR databases. This was done my Margaret Lam for the Kleborate paper, see details of modifications vs CARD v3.0.8 here).
- Added to readme some instructions for using SRST2 to serotype Group B Streptococcus (S. agalactiae) using the GBS-SBG database, available here - thanks to Swaine Chen and colleagues for this.
Updates in v0.2.0
- Some improvements to allele calling, particularly for Klebsiella MLST locus mdh, kindly contributed by andreyto. Includes rejection of read alignments that are clipped on both ends (likely to be spurious) and minor bug fixes associated with depth calculations.
- Updated E. coli serotype database to remove duplicate sequences.
- Added mcr-2 colistin resistance gene to
ARGannot.r1.fastaresistance gene database. - A
--threadsoption was added, which makes SRST2 call Bowtie and Samtools with their threading options. The resulting speed up is mostly due to the Bowtie mapping step which parallelises very well. - The
VFDB_cdhit_to_csv.pyscript was updated to work with the new VFDB FASTA format. - Versions of Bowtie2 up to 2.2.9 are now supported. Samtools v1.3 can now be used as well, however v0.1.18 is still the recommended version (for reasons discussed below).
- Added
scripts/qsub_srst2.pyto generate SRST2 jobs for the Grid Engine (qsub) scheduling system (http://gridscheduler.sourceforge.net/). Thanks to Ramon Fallon from the University of St Andrews for putting this together. Some of the specifics are set up for his cluster, so modifications may be necessary to make it run properly on a different cluster using Grid Engine. - Various other small bug fixes!
Updates in v0.1.8
- /data directory includes files for subtyping of the LEE pathogenicity island of E. coli, as per Ingle et al, 2016, Nature Microbiology. Instructions below
- Resistance gene database updates:
- Fixed
ARGannot.r1.fastato include proper mcr1 DNA sequence. - Added columns to the
ARGannot_clustered80.csvtable, to indicate classes of beta-lactamases included in theARGannot.r1.fastadatabase according to the NCBI beta-lactamase resource (new location for the Lahey list).
- Fixed some issues with handling of missing data (i.e. where there were no hits to MLST and/or no hits to genes) when compiling results into a table via
--prev_output. This could result in misalignment of gene columns in previous versions.
Updates in v0.1.7
- Use the following environment variables to specify your prefered samtools and bowtie2 executables (thanks to Ben Taylor for this):
- SRST2_SAMTOOLS
- SRST2_BOWTIE2
- SRST2_BOWTIE2_BUILD
- Added mcr1, the plasmid-borne colisting resistance gene to the included ARG-Annot-based resistance gene DB (
ARGannot.r1.fasta) - Fixed a problem with writing consensus files that occurred when a directory structure was specified using
--output(bug introduced in v0.1.6)
Updates in v0.1.6
- The original validation of SRST2 (see paper) was performed with bowtie2 version 2.1.0 and samtools v0.1.18.
- bowtie2: SRST2 has now been tested on the tutorial example and other test data sets using the latest versions of bowtie2, 2.2.3 and 2.2.4, which gave identical results to those obtained with bowtie2 v2.1.0. Therefore, the SRST2 code will now run if any of these versions of bowtie2 are available: 2.1.0, 2.2.3 or 2.2.4.
- samtools: SRST2 has now been tested on the Staph & Salmonella test data sets used in the paper, and will work with newer samtools versions (tested up to v1.1). Note however that SRST2 still works best with samtools v0.1.18, due to small changes in the mapping algorithms in later versions that result in some loss of reads at the ends of alleles. This has most impact at low read depths, however we do recommend using v0.1.18 for optimum results.
- Minor fixes to the ARG-Annot database of resistance genes, including removal of duplicate sequences and fixes to gene names (thanks to Wan Yu for this). Old version remains unchanged for backwards compatibility, but we recommend using the revised version (located in
data/ARGannot.r1.fasta). - Added EcOH database for serotyping E. coli (thanks to Danielle Ingle for this). See Using the EcOH database for serotyping E. coli with SRST2 and this MGen paper.
- Fixed a problem where, when analysing multiple read sets in one SRST2 call against a gene database in which cluster ids don't match gene symbols, individual gene clusters appear multiple times in the output. The compile function was unaffected and remains unchanged.
- Fixed behaviour so that including directory paths in
--outputparameter works (thanks to nyunyun for contributing most of this fix). E.g.--output test_dir/testwill create output files prefixed withtest, located intest_dir/, and all SRST2 functions should work correctly including consensus allele calling. Iftest_dir/doesn't exist, we attempt to create it; if this is not possible the user is alerted and SRST2 stops. - Fixed problem when using a gene database with a simple fasta header (ie not clustered for SRST2; note best results are achieved by pre-clusering your sequence database beforehand) (thanks to cglambert for this one).
- Fixes contributed
