NB - this code / pipeline structure remains mostly untouched since it was first written by me in 2013 / 14 while working in the National Health Service (NHS) England - it would be designed differently were it written today, and likely using a proper workflow manager. The main areas where it could be developed are:

<ol><li>improved command line parameter parsing</li> <li>improved error handling, for example, checking output of each code section and making a decision to proceed or not</li></ol>

Although it runs AOK in its current state, anyone re-using this code should make changes as you see fit.

I update this pipeline as I re-use it myself in order to keep it maintained on a low level in line with new program versions that are released.

Last update: Sunday, 31st March, 2019 @ 02:20 BST (GMT+1)

ClinicalGradeDNAseq - random read sampling to improve sensitivity

Automated next generation DNA sequencing analysis pipeline 'suited' for clinical tests, with >99.9% sensitivity to Sanger sequencing for Single Nucleotide Variants (SNVs) at read-depth>18 over target regions over interest.

This clinical-grade analysis pipeline, ClinicalGradeDNAseq, is a watered-down and modified version of the work by Blighe, Beauchamp, and colleagues at Sheffield Diagnostic Genetics Service, Sheffield Children's NHS Foundation Trust, Sheffield, UK, and their efforts to introduce a clinical-grade next generation sequencing (NGS) analysis pipeline fully validated against Sanger di-deoxy sequencing.

The pipeline is built using open source programs mixed with customised scripts. A wrapper script manages command line parameters and then executes the master analysis script. Control is then returned to the wrapper, where results files are transferred to a remote server via SSH/sFTP. A master and concise log is kept, with date- and time-stamps. Results directory structure is formed based on the run number and patient ID.

[Key point] The unique feature of the analysis pipeline that increases sensitivity to Sanger sequencing is in the variant calling step, where a final aligned BAM is split into 3 'sub-BAMs' of 75%, 50%, and 25% random reads. Variants are then called on all 4 BAMs, after which a consensus VCF is produced.

This pipeline proceeds in an 8-step process:

<ol type="1"> <li>Adaptor and read quality trimming - TrimGalore! (Krueger F), FastQC (Andrews S), cutadapt (Martin M, 2011)</li> <li>Alignment - bwa mem (Li & Durbin, 2009)</li> <li>Marking and removing PCR duplicates - Picard (Broad Institute of MIT and Harvard), SAMtools (Li et al., 2009)</li> <li>Remove low mapping quality reads - SAMtools (Li et al., 2009)</li> <li>QC - SAMtools (Li et al., 2009), BEDTools (Quinlan & Hall, 2010), custom scripts</li> <li>Downsampling / random read sampling - Picard (Broad Institute of MIT and Harvard)</li> <li>Variant calling - SAMtools/BCFtools (Li et al., 2009)</li> <li>Annotation - Variant Effect Predictor (McLaren et al., 2016)</li> </ol> <h1>Requirements</h1> <ul> <li>Paired-end FASTQ files</li> <li>Chromosome-ordered and position-sorted BED file in hg19 / hg38 (BED files can be sorted with sort -k1,1V -k2,2n)</li> <li>Global installations of cutadapt and unix2dos (included in dos2unix)</li> <li>Valid credentials for returning results files to remote server (username / password) via SSH/sFTP</li> </ul> <h1>Execution</h1> <ol type="1"> Run the ‘PipelineWrapper’ wrapper script, which will check command-line parameters, execute the master script, and then return results files via SSH/sFTP to remote server. Use the following parameters: <ol type="i""> <li>FASTQ mate-pair 1 (absolute file path)</li> <li>FASTQ mate-pair 2 (absolute file path)</li> <li>Reference genome FASTA (absolute file path)</li> <li>Run number (e.g. Plate6, Plate7, etc.) (alphanumeric)</li> <li>Patient ID (alphanumeric)</li> <li>BED file (absolute file path)</li> <li>Minimum quality for bases at read ends, below which bases will be cut (integer)</li> <li>Minimum allowed read length (integer)</li> <li>Adaptor for trimming off read ends ('illumina' / 'nextera' / 'small_rna')</li> <li>Minimum read depth for calling a variant (integer)</li> <li>Minimum allowed mapping quality (integer)</li> <li>Stringency for calling variants ('relaxed' / 'normal') (relaxed uses <code>--pval-threshold 1.0</code> with BCFtools call)</li> <li>Directory where results will be output (absolute file path)</li> <li>User initials (alphanumeric)</li> </ol> <h1>Output</h1> Results files are output locally to [results root]/[run number]/[sample ID]/, with a copy being also sent via SSH/sFTP to a remote server. <ul> <li>*_AnalysisLog.txt - analysis log (short)</li> <li>Master.log - analysis log (comprehensive)</li> <li>*_R1_001.fastq.gz_trimming_report.txt - details on base and read trimming for mate-pair 1</li> <li>*_R1_001_val_1_fastqc.html - FastQC report for mate-pair 1 (after trimming)</li> <li>*_R2_001.fastq.gz_trimming_report.txt - details on base and read trimming for mate-pair 2</li> <li>*_R2_001_val_2_fastqc.html - FastQC report for mate-pair 2 (after trimming)</li> <li>*_Alignment.txt - alignment metrics</li> <li>*_ReadsOffTarget.txt - number of reads falling outside regions specified in BED file</li> <li>*_PCRDuplicates.txt - details on identified PCR duplicates</li> <li>*_CoverageTotal.bedgraph - coverage for all mapped locations (contiguous bases at same read depth are merged into regions)</li> <li>*_MeanCoverageBED.bedgraph - mean read depth for each region specified in supplied BED file</li> <li>*_PerBaseDepthBED.bedgraph - per base read depth for each base in each region specified in supplied BED file</li> <li>*_PercentGenomeCovered.txt - percentage of reference genome covered by reads.</li> <li>*_Aligned_Sorted_PCRDuped_FiltMAPQ.bam - aligned BAM file with sorted reads, PCR duplicates removed, and reads below mapping quality threshold removed</li> <li>*_Aligned_Sorted_PCRDuped_FiltMAPQ.bam.bai - index for above BAM file</li> <li>*_Final.vcf - final VCF file</li> <li>*_AnnotationVEP.html - HTML report of variant annotation, with consequences for all known transcript isoforms</li> <li>*_AnnotationVEP.tsv - as above but in tab-separated values (TSV) format</li> </ul> <h1>Hard-coded sections of code</h1> <ul> <li>PipelineWrapper.sh, line 120: <code>/home/ubuntu/pipeline/AnalysisMasterVersion1.sh "${Read1}" "${Read2}" ...</code> - absolute path filename for AnalysisMasterVersion1.sh</li> <li>PipelineWrapper.sh, line 138: <code>remoteDir="/remote/SAMBA/share/"</code> - Remote server directory to which results files will be transferred via SSH/sFTP</li> <li>PipelineWrapper.sh, line 150, 161: <code>sshpass -e sftp $username@XXX.XXX.XXX.XXX << !</code> - Remote server IP address or host name to which results files will be transferred via SSH/sFTP</li> <li>AnalysisMasterVersion1.sh, lines 25-35 - root directories (absolute paths) of required programs</li> <li>AnalysisMasterVersion1.sh, lines 286/304 - minimum base quality (<code>--min-BQ</code>) set to 30 for BCFtools mpileup</li> <li>AnalysisMasterVersion1.sh, lines 374 - extra filters for filtering variants via vcfutils.pl (bundled with BCFtools in 'misc' directory)</li> <li>AnalysisMasterVersion1.sh, line 396/7 - species and assembly for VEP set to Homo sapiens and GRCh38</li> </ul> <h1>References</h1> <ul> <li>Andrews S, FastQC, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, last accessed 28th August 2017.</li> <li>Broad Institute of MIT and Harvard, Picard, http://broadinstitute.github.io/picard/, last accessed 28th August 2017</li> <li>Krueger F, Trim Galore!, https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/, last accessed 28th August 2017.</li> <li>Li H and Durbin R (2009), Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 25(14): 1754–1760.</li> <li>Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup (2009), The Sequence Alignment/Map format and SAMtools, Bioinformatics 25(16): 2078-9.</li> <li>Martin M (2011), Cutadapt removes adapter sequences from high-throughput sequencing reads, EMBnet.journal 17(1): 10-12.</li> <li>McLaren W, Gil L, Hunt S, Riat H, Ritchie G, Thormann A, Flicek P, Cunningham F (2016), The Ensembl Variant Effect Predictor, Genome Biology 17: 122.</li> <li>Quinlan AR & Hall IM (2010), BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics 26(6): 841-2.</li> </ul> <h1>Credits</h1> <ul> <li>Kevin Blighe (Sheffield Children's NHS Foundation Trust)</li> <li>Nick Beauchamp (Sheffield Children's NHS Foundation Trust)</li> <li>Darren Grafham (Sheffield Children's NHS Foundation Trust)</li> <li>Lucy Crookes (Sheffield Children's NHS Foundation Trust)</li> <li>Sasirekha Palaniswamy ('Sashi') (Sheffield Children's NHS Foundation Trust)</li> <li>Sheffield Diagnostics Genetics Service</li> </ul>

ClinicalGradeDNAseq

Install / Use

README

ClinicalGradeDNAseq - random read sampling to improve sensitivity