VariantBam: Filtering and profiling of next-generational sequencing data using region-specific rules

License: [Apache2][license]

[Bioinformatics Paper][biop] Wala, J., C. Zhang, M. Meyerson, R. Beroukhim. VariantBam: filtering and profiling of nextgenerational sequencing data using region-specific rules. 2016. Bioinformatics, doi: 10.1093/bioinformatics/btw111

NOTE: VariantBam recently was updated to use the more universal JSON syntax, and to remove all dependencies on Boost to make installation easier.

Installation
Description
Examples
Rules script syntax
- Region
- Global region
- Rules
Command line usage
- Full list of options
Full list of available JSON rules
Attributions

Installation

I have succesfully built on Unix with GCC-4.8, 4.9 and 5.1

git clone --recursive https://github.com/jwalabroad/VariantBam.git
cd VariantBam
./configure
make

To add support for reading BAMs, etc with HTTPS, FTP, S3, Google cloud, etc, you must compile and link with libcurl.

## set hts to build with libcurl links and hfile_libcurl.c
cd VariantBam/SeqLib/htslib
./configure --enable-libcurl 
## compile seqlib with libcurl support
cd ../../ # back to VariantBam main directory
./configure LDFLAGS="-lcurl -lcrypto"
make 
make install ## if no root or didn't configure with --prefix, binary is found at src/variant

## test
variant gs://isb-cgc-open/ccle/LUSC/DNA-Seq/C836.NCI-H1339.2.bam | head

Quick Start

## using the included test BAM (HCC1143)
BAMOUTPUT='-b'
VariantBam/src/variant test/small.bam -g 'X:1,000,000-1,100,000' --min-mapq 10 $BAMOUTPUT -o mini.bam -v

## can stream in and out (out is binary stream with -b flag)
samtools view -h test/small.bam | variant - --min-mapq 10 -b | bamToBed > out.bed

## get help
VariantBam/src/variant --help

## TL;DR examples

## extract all reads and their pair-mates that overlap SNP sites within 100 bp
rfile=<BED file, samtools-style string (e.g. "1:1,000,000-1,000,010"), or VCF>
variant <bam> -l $rfile -P 100 -b -o mini.bam -v

## mask regions (exclude reads and their pair-mates that overlap)
variant <bam> -L $rfile -b -o mini.bam -v 

## extract high-quality clipped reads (where clip length account for low quality bases)
variant <bam> --min-phred 4 --min-clip 5 -o mini.sam -v

## extract reads with high mapq that also contain a large insertion or deletion
## pipe output
variant <bam> --min-mapq 20 --min-ins 10 --min-del 10 | wc -l

## subsample to max-coverage. BAM must be sorted
variant <bam> -m 100 -o mini.bam -v -b

Description

VariantBam is a tool to extract/count specific sets of sequencing reads from next-generational sequencing files. To save money, disk space and I/O, one may not want to store an entire BAM on disk. In many cases, it would be more efficient to store only those read-pairs or reads who intersect some region around the variant locations. Alternatively, if your scientific question is focused on only one aspect of the data (e.g. breakpoints), many reads can be removed without losing the information relevant to the problem, and enriching for the signal you are interested in.

Tool comparison

VariantBam packages into a single executable a number of filtering features not easily found using samtools + awk:

Filter specifically on read clipping, orientation and insert size (all important for structural variation)

Support for considering only high-quality bases when determining read length or clip count

[Interval tree][ekg] to efficiently determine if a read overlaps a region

Ability to link reads to a genomic region if their mate intersects that region.

Provide different rules for different arbitrarily-sized regions, and to provide these regions as common variant files (VCF, MAF, BED)

Select reads by matching motifs against a large dictionary using [Aho-Corasick implementation][aho]

Count reads that satisfy any number of user-defined properties

Selectively strip alignment tags

Support for sub-sampling to obtain a BAM file with a coverage limit

VariantBam is implemented in C++ and uses [HTSlib][hlib], a highly optimized C library used as the core of [Samtools][samtools] and [BCFtools][bcf].

To get a full list of options, run variant --help.

Examples

Example Use 1

Whole-genome analysis has been conducted on a BAM, generating VCF and MAF files. Ideally, these regions could be manually inspected or reanalyzed without having to keep the entire BAM. Running VariantBam to extract only read-pairs that overlap these events will allow these regions to be rapidly queried, without having to keep the full BAM record.

### Extract all read PAIRS that interset with a variant from a VCF
variant $bam -l myvcf.vcf -o mini.bam -b

Example Use 2

In situations where the sequencing or library preparation quality is low, it may be advantageous to remove poor quality reads before starting the analysis train. VariantBam handles this by optionally taking into account base-qualities when making a decision whether to keep a sequencing read. For instance, one might only be interested in high quality MAPQ 0 or clipped reads. VariantBam can be setup to apply unique base-quality filters to different regions or across the entire genome, all with one-pass.

### Extract only high quality reads with >= 50 bases  and MAPQ >= 1 and not duplicated/hardclip/qcfail
### json
{
  "region1" : {
  "region"  : "WG",
  "rules" : [{
      "length" : [50,1000],
      "mapq" : [1, 60],
      "duplicate" : false,
      "hardclip" : false,
      "qcfail" : false
   }
  ]
   }
}
###
variant $bam -r example2.json -o mini.bam -b

Example Use 3

An NGS tool operates only on a subset of the reads (eg. structural variant caller using only clipped/discordant reads). Running VariantBam to keep only these reads allows the tool to run much faster. This is particurlaly useful for facilitating a more rapid "build/test" cycle.

### Extract clipped, discordant, unmapped and indel reads
### json
{
  "global" : { "nbases" : [0,0], "hardclip" : false, "supplementary" : false, "qcfail" : false},
  "region_wg" : {"region" : "WG", "rules" : [ 
         { "mapq" : [0, 1000], "clip" : [5,1000] }, {"ic" : true}, {"ff" : true}, {"rf" : true}, {"rr" : true}, { "ins" : [1,1000], "mapq" : [1,100] }, { "del" : [1,1000], "mapq" : [1,1000] } 
  ]}
}
###
variant $bam -r example3.json

Example Use 5

A team is only interested in variants in known cancer genes, and would like to analyze thousands of exomes and genomes. Running VariantBam to extract reads from only these genes, and sending the BAM files to compressed CRAM provides sufficient data reduction to allow all of the relevant data to be stored on disk.

### Grab only reads from predefined regions. Strip unneccessary tags and convert to CRAM for maximum compression
variant $bam -l mygenes.bed -C -o mini.cram -s BI,OQ

Example Use 6

A research team would like to extract only reads matching a certain motifs, but only if they have high optical quality. VariantBam with the motif rule will accomplish this with rapid O(n) efficiency for an arbitrarily large motif dictionary (where n is the length of a read)

## example6.json
{
"example6": {
  "rules": [{"motif": "mymotifs.txt",
             "length": 20 }]
}
}									  
##

### 
variant $bam -r example6.json ## input as a JSON
variant $bam --min-length 20 --motif mymotifs.txt ## using command line shortcuts

Example Use 7

To reduce the size of the BAM, reads can be removed from centromeric and satellite repeat regions. These reads are rarely helpful for variant calling. To remove reads that intersect a region, set the region as an inverse-region. In a VariantBam script, use "exclude" : true```. For quick use on the command line, use -Lor-G(opposites of-land-g``).

### json 
{
  "" : {
         "region" : "bad.bed",
         "exclude" : true,
	 "matelink" : true
       }
}
###
variant $bam -L bad.bed -o mini.bam -b

Example Use 8

Massive read-pileups can occur at repetitive regions. These can reduced with VariantBam by subsampling to a max-coverage.

### BAM must be sorted
variant $bam -m 100 -o mini.bam -b

Example Use 9

Obtain basic QC stats from a BAM file, or profile how many reads were accepted by each rule

### get QC stats on whole bam AND find how many reads are clipped with high-quality clipped bases
### use the -x flag to produce no output (profiling only)
variant <bam> --min-clip 10 --min-phred 5 -q qcreport.txt -x
Rscript VariantBam/R/BamQCPlot.R -i qcreport.txt -o qcreport.pdf

Example Use 10

A user would like to extract only those reads supporting a particular allele at a variant site. This can be done by combining a small point-region at the variant site with a motif dictionary. Consider two alleles G and A at a site (e.g. 1:143250877), along with their adjacent sequences: GCAGAAT and GCAAAAT. To extract variant reads supporting the A allele:

## make the motifs file (include reverse complements) 
printf "GCAAAAT\nATTTTGC" > motifs.txt
## just look near the variant
k="1:143,250,677-143,251,077" 
r='{"":{"rules":[{"motif":"mo

VariantBam

Install / Use

README