CaVEMan
SNV expectation maximisation based mutation calling algorithm aimed at detecting somatic mutations in paired (tumour/normal) cancer samples. Supports both bam and cram format via htslib
Install / Use
/learn @cancerit/CaVEManREADME
CaVEMan
A C implementation of the CaVEMan program. Uses an expectation maximisation approach to calling single base substitutions in paired data. Designed for use with a compute farm/cluster; most steps in the program make use of an index parameter. The split step is designed to divide the genome into chunks of adjustable size to optimise for runtime/memory usage requirements. For simple execution of CaVEMan please see cgpCaVEManWrapper
| Master | Dev |
| ------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| |
|
Installation
See INSTALL.TXT
Prerequisites
- BWA Mapped, indexed, duplicate marked/removed alignment files, for both a normal and tumour sample
- Only alignments with standard illumina quality scores have been tested, other inputs may cause unexpected behaviour
- Supports BAM and, as of version 1.6.0, CRAM (with index).
- Reference.fasta and index
- A one based bed style format file of regions to ignore during analysis (see specified format).
- zlib >= 1.2.3.5
- linasm >= 1.13
Optional inputs (will result in more accurate calls)
- Normal and tumour copy number files (see specified format).
- A normal contamination of tumour value
Processing flow
CaVEMan is executed in several distinct steps in the order listed below.
Setup
Generates a config file for use with the remaining CaVEMan steps (it'll save you a lot of time typing commandline args). Also generates a file named 'alg_bean' in the run directory.
bin/caveman setup || ./bin/setupCaveman
Usage: caveman setup -t tum.bam -n norm.bam -r reference.fa.fai -g ignore_regions.tab -e tum_cn.bed -j norm_cn.bed [-f path] [-l path] [-a path] [-wzu]
-t --tumour-bam [file] Location of tumour bam
-n --normal-bam [file] Location of normal bam
-r --reference-index [file] Location of reference fasta index
-g --ignore-regions-file [file] Location of tsv ignore regions file
Optional
-c --config-file [file] File to write caveman run config file [default:'./caveman.cfg.ini']
-f --results-folder [file] Folder to write results [default:'./results']
-l --split-file [file] File to write list of split sections [default:'./splitList']
-a --alg-bean-file [file] Location to write alg-bean [default:'./alg_bean']
-w --include-smith-waterman Include SW mapped reads in the analysis
-z --include-single-end Use single end reads for this analysis
-u --include-duplicates Include reads marked as duplicates in the analysis
-e --tumour-copy-no-file [file] Location of tumour copy number bed file (if the extension is not .bed the file will be treated as 1 based start). If no copy number file is supplied then the default cn of 2 will be used
-j --normal-copy-no-file [file] Location of normal copy number bed file (if the extension is not .bed the file will be treated as 1 based start). If no copy number file is supplied then the default cn of 2 will be used
-h --help Display this usage information.
Split
Requires one job per entry in your reference.fasta.fai file. Each job creates a list of segments to be analysed, these are determined by total read count and do not include reads from ignored regions. The size of sections can be tuned using the -m, -c and -e parameters. Once all jobs complete successfully you will need to concatenate all split files into a single file with the name passed to the setup step in -f parameter (or splitList if you used the default).
bin/caveman split || ./bin/splitCaveman
Usage: caveman split -i jobindex -f path [-c int] [-m int] [-e int]
-f --config-file file Path to the config file produced by setup [default: 'caveman.cfg.ini'].
-i --index int Job index (e.g. from $LSB_JOBINDEX)
Optional
-c --increment int Increment to use when deciding split sizes
-m --max-read-count double Proportion of read-count to allow as a max in a split section
-e --read-count int Guide for maximum read count in a section
-h help Display this usage information.
Mstep
Requires one job per entry in the merged split file. The parameter -i referes to a line in the merged split file. -a can be used to tune the size of section downloaded from bam at a time, this allows tuning of the memory footprint. The mstep will create a file for each job under the results folder (specified as a parameter in the setup step).
bin/caveman mstep || ./bin/mstepCaveman
Usage: caveman mstep -f config.file -i jobindex [-m int] [-a int]
-f --config-file file Path to the config file produced by setup [default: 'caveman.cfg.ini'].
-i --index int Job index.
Optional
-a --split_size int Size of section to retrieve at a time from bam file. Allows memory footprint tuning [default:50000].
-m --min-base-qual int Minimum base quality to include in analysis [default:11]
-h help Display this usage information.
Merge
Runs as a single job, merging all the cov_array files generated by the mstep into a file representing the profile of the whole genome. The resulting file is named cov_array and stored in the root run folder. Another file named prob_array is also created.
bin/caveman merge || ./bin/mergeCaveman
Usage: caveman merge -f config_file [-c path] [-p path]
-f --config-file file Path to the config file produced by setup. [default: 'caveman.cfg.ini']
Optional
-c --covariate-file filename Location to write merged covariate array [default: covs_arr]
-p --probabilities-file filename Location to write probability array [default: probs_arr]
-h help Display this usage information.
Estep
The final step in calling variants using CaVEMan. As was the case with the mstep, a job per entry in the merged split list is required. Copy number (-e, -j), and normal contamination (-k) are required (See default settings for advice on obtaining results if you don't have these). Each job will create several files named according to the corresponding line in the merged split file. *_muts.vcf and *_snps.vcf are the calls above the given cutoffs for somatic and SNP probabilities alike. The *.no_analysis.bed file lists sections not analysed by caveman, due to being ignored or lacking coverage. Using the debugoption will output another file named *.dbg.vcf containing a line for every position in the genome that was analysed with read counts (as seen by CaVEMan) and top two probabilities calculated.
bin/caveman estep || ./bin/estepCaveman
Usage: caveman estep -i jobindex [-f file] [-m int] [-k float] [-b float] [-p float] [-q float] [-x int] [-y int] [-c float] [-d float] [-a int]
-i --index [int] Job index (e.g. from $LSB_JOBINDEX)
Optional
-f --config-file [file] Path to the config file produced by setup. [default:'caveman.cfg.ini']
-m --min-base-qual [int] Minimum base quality for inclusion of a read position [default:11]
-c --prior-mut-probability [float] Prior somatic probability [default:0.000006]
-d --prior-snp-probability [float] Prior germline mutant probability [default:0.000100]
-k --normal-contamination [float] Normal contamination of tumour [default:0.100000]
-b --reference-bias [float] Reference bias [default:0.950000]
-p --mut-probability-cutoff [float] Minimum probability call for a somatic mutant position to be output [default:0.800000]
-q --snp-probability-cutoff [float] Minimum probability call for a germline mutant position to be output [default:0.950000]
-x --min-tum-coverage [int] Minimum tumour coverage for analysis of a position [default:1]
-y --min-norm-coverage [int] Minimum normal coverage for analysis of a position [default:1]
-a --split-size [int] Size of section to retrieve at a time from bam file. Allows memory footprint tuning [default:50000].
-s --debug Adds an extra output to a debug file. Every base analysed has an output
-g --cov-file [file] File location of the covariate array. [default:'covs_arr']
-o --prob-file [file] File location of the prob array. [default:'probs_arr']
-v --species-assembly [string] Species assembly (eg 37/GRCh37), required if bam header SQ lines do not contain AS and SP information.
-w --species [string] Species name (eg Human), required if bam header SQ lines do not contain AS and SP information.
-n --normal-copy-number [int] Copy number to use when filling gaps in the normal copy number file [default:2].
-t --tumour-copy-number [int] Copy number to use when filling gaps in the tumour copy number file [default:2].
-l --normal-protocol [string] Normal
