Sequence to Medical Phenotypes (STMP)

A pipeline featuring variant annotation, prioritization, pharmacogenomics, and tools for analyzing genomic trios (mother, father, child).

Release versions can be downloaded from https://github.com/AshleyLab/stmp/releases, or you can clone this repository to download the latest version of the code.

The toolkit currently uses an SQLite database for added portability.

Sequence to Medical Phenotypes (STMP)
Contents

Created by gh-md-toc

Dependencies

To date, STMP has been tested on Python 2.7.6. Other versions of Python may also work, but are not officially supported.

External Dependencies (to be placed in the `third_party` folder -- see instructions below)

ANNOVAR version 2015-03-22 15:29:59 (Sun, 22 Mar 2015)
snpEFF version 4.1e (build 2015-05-02)

Other versions of the above tools may also work but are not currently supported.

Other dependencies (these must be in the user or system PATH before running STMP)

bcftools version 1.2
bedtools version 2.25.0

Python dependencies

Pyyaml version 3.11
xlwt version 1.0.0 (for exporting results to an Excel file)

Installation Instructions

Downloading software and dependencies

Download the latest STMP release from here, or clone this repository.
Download ANNOVAR and snpEFF and make sure they are copied/symlinked in a folder called annovar and snpeff within the third_party folder.
- E.g. ANNOVAR would be linked/copied to third_party/annovar (this folder should contain all files from the ANNOVAR download, including annotate_variation.pl and table_annovar.pl)
- E.g. snpeff would be linked/copied as third_party/snpeff/snpEff (this folder should include files such as snpEff.jar)
Ensure Pyyaml is installed (via pip install, etc.)
Ensure the appropriate versions of bedtools and bcftools (above) are installed and in the user/system PATH. These can be either downloaded directly from the corresponding websites or installed via a program such as bcbio.
Run the appropriate ANNOVAR command to download the datasets specified in Appendix 1 (e.g. annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/ from within third_party/annovar to download the refGene dataset).
- NOTE: hg19_wgEncodeGencodeBasicV19Mrna.fa is no longer provided by Annovar/UCSC and must instead be downloaded manually from https://www.dropbox.com/s/icw1loscvpm6v84/hg19_wgEncodeGencodeBasicV19Mrna.fa?dl=0 and copied to the annovar/humandb folder. Without this file, Annovar_ExonicFunc_wgEncodeGencodeBasicV19 and Annovar_AAChange_wgEncodeGencodeBasicV19 will not show up correctly in the annotated output.
If you would like to run trio tools:
- Copy code/annovar/summarize_annovarRDv2.pl to third_party/annovar
- Run the appropriate ANNOVAR command to download the datasets specified in Appendix 2 (e.g. annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/ from within third_party/annovar to download the refGene dataset).

Setting up STMP

Run python stmp.py --update_db. This will create a SQLite database file in the db folder and download and import the core datasets required for annotation and tiering.

Testing STMP

For a basic test of whether STMP has been installed and configured correctly, run python stmp.py --test. This will run a small VCF file (a subset of Genome in a Bottle sequence variants) through STMP annotation, tiering, and pgx. Output will be stored in (stmp_dir)/data/test/output_unverified by default, or can be stored in a different directory with the --output_dir parameter. Compare the output against our verified output (stmp_dir)/data/test/output_verified to see if it is the same, e.g. diff -rq (stmp_dir)/data/test/output_unverified/ (stmp_dir)/data/test/output_verified/.

Running STMP

To run STMP on an input VCF:
python stmp.py --vcf=(path to input VCF) --output_dir=(output directory)

Example (cd to the unzipped STMP release folder you downloaded):
python code/stmp.py --vcf=data/test/input_data/genome_in_a_bottle/subset.rs.vcf --output_dir=outputs/genome_in_a_bottle_output

This will run three different modules: annotation, tiering, and pharmacogenomics (pgx).

1) Annotation

This module annotates the input VCF with information from each of the above datasets. It outputs a TSV (tab-separated values) file with each annotation as a separate column (after the standard VCF columns). Annotation includes point annotation, functional annotation (using ANNOVAR and snpEff), and region (range) annotation using bedtools. Intermediate outputs of specific annotations (e.g. point annotations) are available in the scratch folder within the output directory. The final output (each of these three annotation types joined into a single file) is written as a .tsv file in the specified output directory.

2) Tiering

This module takes the annotated TSV from the previous step and prioritizes the variants into different tiers (below). It outputs the list of variants in each tier as an Excel worksheet (tiered_output.xls).It also outputs the variants and tiering metrics as text files (tiering_allvars.metrics, tiering_allvars.tier0.txt, tiering_allvars.tier1.txt, etc.).

Tier 0: Variants classified as pathogenic or likely pathogenic according to ClinVar (with ClinVar star rating > 0 according to the new mid-2015 guidelines).
Tier 1: Loss of function variants (splice dinucleotide disrupting, nonsense, nonstop, and frameshift indels.
Tier 2: All rare variants cataloged in HGMD, regardless of functional annotation. Rarity is defined as minor allele frequency (MAF) no greater than 1% by default or according to use-defined criteria in any of the following population genetic surveys: ethnically- matched population in HapMap 2 and 3, the 1000 genomes phase 1 data33 from an ethnically-matched super population, and global allele frequency, the 1000 genomes pilot 1 project global allele frequency, 69 publicly available genomes released by Complete Genomics, and the NHLBI Grand Opportunity exome sequencing project global allele frequency.
Tier 3: All non-rare missense and non-frameshift indels.
Tier 4: All other rare exonic/splicing variants with ExAC tolerance z-score (syn_z or mis_z or lof_z) > 2

3) Pharmacogenomics (pgx)

This module takes in a VCF file and outputs several text files summarizing variants with known pharmacogenomic effects. These include effects on drug dosage, efficacy, toxicity, and other interactions, as well as whether any variants in the input file match known "star" alleles associated with clinical drug response for 6 genes (CYP2C19, CYP2C9, CYP2D6, SLCO1B1, TPMT, VKORC1). Each of these files is output in the specified output directory.

For additional options, run python stmp.py -h. For example, one can use the --annotate_only flag to run only the annotation module, the --tiering_only flag to run just the tiering module, or the --pgx_only flag to run just the pgx module. Note that tiering depends on the annotated output file, so annotation must be run before tiering.

4) Trio (separate script)

This module analyzes genome sequence data from a father, mother, and child. It takes as input a single VCF with different sample IDs for mother, father, and child.

Usage: python trio/trioPipeline.py input output path_to_annovar path_to_matrix offspringID fatherID motherID

Example: (Note: as the combined file is large, you must download the HG002, HG003, and HG004 VCFs from this site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_CallsIn2Technologies_05182015/) and manually combine them into a single VCF file using bcftools or similar. The sample command below assumes you have placed the combined file in the data/test/input_data/trio directory and called it trio.combined.vcf.)

python code/trio/trioPipeline.py data/test/input_data/trio/trio.comb

Stmp

Install / Use

README

Sequence to Medical Phenotypes (STMP)

Contents

Dependencies

External Dependencies (to be placed in the `third_party` folder -- see instructions below)

Other dependencies (these must be in the user or system PATH before running STMP)

Python dependencies

Installation Instructions

Downloading software and dependencies

Setting up STMP

Testing STMP

Running STMP

1) Annotation

2) Tiering

3) Pharmacogenomics (pgx)

4) Trio (separate script)

Stmp

Install / Use

README

Sequence to Medical Phenotypes (STMP)

Contents

Dependencies

External Dependencies (to be placed in the third_party folder -- see instructions below)

Other dependencies (these must be in the user or system PATH before running STMP)

Python dependencies

Installation Instructions

Downloading software and dependencies

Setting up STMP

Testing STMP

Running STMP

1) Annotation

2) Tiering

3) Pharmacogenomics (pgx)

4) Trio (separate script)

External Dependencies (to be placed in the `third_party` folder -- see instructions below)