CaVEMan

A C implementation of the CaVEMan program. Uses an expectation maximisation approach to calling single base substitutions in paired data. Designed for use with a compute farm/cluster; most steps in the program make use of an index parameter. The split step is designed to divide the genome into chunks of adjustable size to optimise for runtime/memory usage requirements. For simple execution of CaVEMan please see cgpCaVEManWrapper

| Master | Dev | | ------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | | | |

Installation

See INSTALL.TXT

Prerequisites

BWA Mapped, indexed, duplicate marked/removed alignment files, for both a normal and tumour sample
- Only alignments with standard illumina quality scores have been tested, other inputs may cause unexpected behaviour
- Supports BAM and, as of version 1.6.0, CRAM (with index).
Reference.fasta and index
A one based bed style format file of regions to ignore during analysis (see specified format).
zlib >= 1.2.3.5
linasm >= 1.13

Optional inputs (will result in more accurate calls)

Normal and tumour copy number files (see specified format).
A normal contamination of tumour value

Processing flow

CaVEMan is executed in several distinct steps in the order listed below.

Setup

Generates a config file for use with the remaining CaVEMan steps (it'll save you a lot of time typing commandline args). Also generates a file named 'alg_bean' in the run directory.

bin/caveman setup || ./bin/setupCaveman
Usage: caveman setup -t tum.bam -n norm.bam -r reference.fa.fai -g ignore_regions.tab -e tum_cn.bed -j norm_cn.bed [-f path] [-l path] [-a path] [-wzu]

-t  --tumour-bam [file]             Location of tumour bam
-n  --normal-bam [file]             Location of normal bam
-r  --reference-index [file]        Location of reference fasta index
-g  --ignore-regions-file [file]    Location of tsv ignore regions file

Optional
-c  --config-file [file]            File to write caveman run config file [default:'./caveman.cfg.ini']
-f  --results-folder [file]         Folder to write results [default:'./results']
-l  --split-file [file]             File to write list of split sections [default:'./splitList']
-a  --alg-bean-file [file]          Location to write alg-bean [default:'./alg_bean']
-w  --include-smith-waterman        Include SW mapped reads in the analysis
-z  --include-single-end            Use single end reads for this analysis
-u  --include-duplicates            Include reads marked as duplicates in the analysis
-e  --tumour-copy-no-file [file]    Location of tumour copy number bed file (if the extension is not .bed the file will 		be treated as 1 based start). If no copy number file is supplied then the default cn of 2 will be used
-j  --normal-copy-no-file [file]    Location of normal copy number bed file (if the extension is not .bed the file will 		be treated as 1 based start). If no copy number file is supplied then the default cn of 2 will be used
-h	--help                          Display this usage information.

Split

Requires one job per entry in your reference.fasta.fai file. Each job creates a list of segments to be analysed, these are determined by total read count and do not include reads from ignored regions. The size of sections can be tuned using the -m, -c and -e parameters. Once all jobs complete successfully you will need to concatenate all split files into a single file with the name passed to the setup step in -f parameter (or splitList if you used the default).

bin/caveman split || ./bin/splitCaveman
Usage: caveman split -i jobindex -f path [-c int] [-m int] [-e int]

-f --config-file file           Path to the config file produced by setup [default: 'caveman.cfg.ini'].
-i  --index int                 Job index (e.g. from $LSB_JOBINDEX)

Optional
-c  --increment int             Increment to use when deciding split sizes
-m  --max-read-count double     Proportion of read-count to allow as a max in a split section
-e  --read-count int            Guide for maximum read count in a section
-h	help                        Display this usage information.

Mstep

Requires one job per entry in the merged split file. The parameter -i referes to a line in the merged split file. -a can be used to tune the size of section downloaded from bam at a time, this allows tuning of the memory footprint. The mstep will create a file for each job under the results folder (specified as a parameter in the setup step).

bin/caveman mstep || ./bin/mstepCaveman
Usage: caveman mstep -f config.file -i jobindex [-m int] [-a int]

-f --config-file file                  Path to the config file produced by setup [default: 'caveman.cfg.ini'].
-i --index int                         Job index.

Optional
-a --split_size int                    Size of section to retrieve at a time from bam file. Allows memory footprint tuning [default:50000].
-m  --min-base-qual int                Minimum base quality to include in analysis [default:11]
-h	help                                Display this usage information.

Merge

Runs as a single job, merging all the cov_array files generated by the mstep into a file representing the profile of the whole genome. The resulting file is named cov_array and stored in the root run folder. Another file named prob_array is also created.

bin/caveman merge || ./bin/mergeCaveman
Usage: caveman merge -f config_file [-c path] [-p path]

-f --config-file file                Path to the config file produced by setup. [default: 'caveman.cfg.ini']

Optional
-c  --covariate-file filename        Location to write merged covariate array [default: covs_arr]
-p  --probabilities-file filename    Location to write probability array [default: probs_arr]
-h	help                              Display this usage information.

Estep

The final step in calling variants using CaVEMan. As was the case with the mstep, a job per entry in the merged split list is required. Copy number (-e, -j), and normal contamination (-k) are required (See default settings for advice on obtaining results if you don't have these). Each job will create several files named according to the corresponding line in the merged split file. *_muts.vcf and *_snps.vcf are the calls above the given cutoffs for somatic and SNP probabilities alike. The *.no_analysis.bed file lists sections not analysed by caveman, due to being ignored or lacking coverage. Using the debugoption will output another file named *.dbg.vcf containing a line for every position in the genome that was analysed with read counts (as seen by CaVEMan) and top two probabilities calculated.

bin/caveman estep || ./bin/estepCaveman
Usage: caveman estep -i jobindex [-f file] [-m int] [-k float] [-b float] [-p float] [-q float] [-x int] [-y int] [-c 	float] [-d float] [-a int]

-i  --index [int]                                Job index (e.g. from $LSB_JOBINDEX)

Optional
-f  --config-file [file]                         Path to the config file produced by setup. [default:'caveman.cfg.ini']
-m  --min-base-qual [int]                        Minimum base quality for inclusion of a read position [default:11]
-c  --prior-mut-probability [float]              Prior somatic probability [default:0.000006]
-d  --prior-snp-probability [float]              Prior germline mutant probability [default:0.000100]
-k  --normal-contamination [float]               Normal contamination of tumour [default:0.100000]
-b  --reference-bias [float]                     Reference bias [default:0.950000]
-p  --mut-probability-cutoff [float]             Minimum probability call for a somatic mutant position to be output 		[default:0.800000]
-q  --snp-probability-cutoff [float]             Minimum probability call for a germline mutant position to be output 		[default:0.950000]
-x  --min-tum-coverage [int]                     Minimum tumour coverage for analysis of a position [default:1]
-y  --min-norm-coverage [int]                    Minimum normal coverage for analysis of a position [default:1]
-a  --split-size [int]                           Size of section to retrieve at a time from bam file. Allows memory 								 footprint tuning [default:50000].
-s  --debug                                      Adds an extra output to a debug file. Every base analysed has an 			output
-g  --cov-file [file]                            File location of the covariate array. [default:'covs_arr']
-o  --prob-file [file]                           File location of the prob array. [default:'probs_arr']
-v  --species-assembly [string]                  Species assembly (eg 37/GRCh37), required if bam header SQ lines do 								 not contain AS and SP information.
-w  --species [string]                           Species name (eg Human), required if bam header SQ lines do not 								 contain AS and SP information.
-n  --normal-copy-number [int]                   Copy number to use when filling gaps in the normal copy number file 								 [default:2].
-t  --tumour-copy-number [int]                   Copy number to use when filling gaps in the tumour copy number file 								 [default:2].
-l  --normal-protocol [string]                   Normal

CaVEMan

Install / Use

README

CaVEMan

Installation

Prerequisites

Optional inputs (will result in more accurate calls)

Processing flow

Setup

Split

Mstep

Merge

Estep