freebayes, a haplotype-based variant detector

user manual and guide

Note that CI tests may fail until vcflib is updated on the main distros to 1.0.13. This is because of the location of vcflib include files moved to /usr/include/vcflib.

Overview

freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

freebayes is haplotype-based, in the sense that it calls variants based on the literal sequences of reads aligned to a particular target, not their precise alignment. This model is a straightforward generalization of previous ones (e.g. PolyBayes, samtools, GATK) which detect or report variants based on alignments. This method avoids one of the core problems with alignment-based variant detection--- that identical sequences may have multiple possible alignments:

freebayes uses short-read alignments (BAM files with Phred+33 encoded quality scores, now standard) for any number of individuals from a population and a reference genome (in FASTA format) to determine the most-likely combination of genotypes for the population at each position in the reference. It reports positions which it finds putatively polymorphic in variant call file (VCF) format. It can also use an input set of variants (VCF) as a source of prior information, and a copy number variant map (BED) to define non-uniform ploidy variation across the samples under analysis.

freebayes is maintained by Erik Garrison and Pjotr Prins. See also RELEASE-NOTES.

Citing freebayes

A preprint Haplotype-based variant detection from short-read sequencing provides an overview of the statistical models used in freebayes. We ask that you cite this paper if you use freebayes in work that leads to publication. This preprint is used for documentation and citation. freebayes was never submitted for review, but has been used in over 1000 publications.

Please use this citation format:

Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012

If possible, please also refer to the version number provided by freebayes when it is run without arguments or with the --help option.

Install

freebayes is provided as a pre-built 64-bit static Linux binary as part of releases.

Debian and Conda packages should work too, see the badges at the top of this page.

To build freebayes from source check the development section below. It is important to get the full recursive git checkout and dependencies.

Support

Please report any issues or questions to the freebayes mailing list. Report bugs on the freebayes issue tracker

Usage

In its simplest operation, freebayes requires only two inputs: a FASTA reference sequence, and a BAM-format alignment file sorted by reference position. For instance:

freebayes -f ref.fa aln.bam >var.vcf

... will produce a VCF file describing all SNPs, INDELs, and haplotype variants between the reference and aln.bam. The CRAM version is

freebayes -f ref.fa aln.cram >var.vcf

Multiple BAM files may be given for joint calling.

Typically, we might consider two additional parameters. GVCF output allows us to have coverage information about non-called sites, and we can enable it with --gvcf. For performance reasons we may want to skip regions of extremely high coverage in the reference using the --skip-coverage parameter or -g. These can greatly increase runtime but do not produce meaningful results. For instance, if we wanted to exclude regions of 1000X coverage, we would run:

freebayes -f ref.fa aln.bam --gvcf -g 1000 >var.vcf

For a description of available command-line options and their defaults, run:

freebayes --help

Examples

Call variants assuming a diploid sample:

freebayes -f ref.fa aln.bam >var.vcf

Call variants on only chrQ:

freebayes -f ref.fa -r chrQ aln.bam >var.vcf

Call variants on only chrQ, from position 1000 to 2000:

freebayes -f ref.fa -r chrQ:1000-2000 aln.bam >var.vcf

Require at least 5 supporting observations to consider a variant:

freebayes -f ref.fa -C 5 aln.bam >var.vcf

Skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than 200:

freebayes -f ref.fa -g 200 aln.bam >var.vcf

Use a different ploidy:

freebayes -f ref.fa -p 4 aln.bam >var.vcf

Assume a pooled sample with a known number of genome copies. Note that this means that each sample identified in the BAM file is assumed to have 32 genome copies. When running with high --ploidy settings, it may be required to set --use-best-n-alleles to a low number to limit memory usage.

freebayes -f ref.fa -p 32 --use-best-n-alleles 4 --pooled-discrete aln.bam >var.vcf

Generate frequency-based calls for all variants passing input thresholds. You'd do this in the case that you didn't know the number of samples in the pool.

freebayes -f ref.fa -F 0.01 -C 1 --pooled-continuous aln.bam >var.vcf

Use an input VCF (bgzipped + tabix indexed) to force calls at particular alleles:

freebayes -f ref.fa -@ in.vcf.gz aln.bam >var.vcf

Generate long haplotype calls over known variants:

freebayes -f ref.fa --haplotype-basis-alleles in.vcf.gz \
                    --haplotype-length 50 aln.bam

Naive variant calling: simply annotate observation counts of SNPs and indels:

freebayes -f ref.fa --haplotype-length 0 --min-alternate-count 1 \
    --min-alternate-fraction 0 --pooled-continuous --report-monomorphic >var.vcf

Parallelisation

In general, freebayes can be parallelised by running multiple instances of freebayes on separate regions of the genome, and then concatenating the resulting output. The wrapper, freebayes-parallel will perform this, using GNU parallel.

Example freebayes-parallel operation (use 36 cores in this case):

freebayes-parallel <(fasta_generate_regions.py ref.fa.fai 100000) 36 \
    -f ref.fa aln.bam > var.vcf

Note that any of the above examples can be made parallel by using the scripts/freebayes-parallel script. If you find freebayes to be slow, you should probably be running it in parallel using this script to run on a single host, or generating a series of scripts, one per region, and run them on a cluster. Be aware that the freebayes-parallel script contains calls to other programs using relative paths from the scripts subdirectory; the easiest way to ensure a successful run is to invoke the freebayes-parallel script from within the scripts subdirectory.

A current limitation of the freebayes-parallel wrapper, is that due to variance in job memory and runtimes, some cores can go unused for long periods, as they will not move onto the next job unless all cores in use have completed their respective genome chunk. This can be partly avoided by calculating coverage of the input bam file, and splitting the genome into regions of equal coverage using the coverage_to_regions.py script. An alternative script split_ref_by_bai_datasize.py will determine target regions based on the data within multiple bam files, with the option of choosing a target data size. This is useful when submitting to Slurm and other cluster job managers, where use of resources needs to be controlled.

Alternatively, users may wish to parallelise freebayes within the workflow manager snakemake. As snakemake automatically dispatches jobs when a core becomes available, this avoids the above issue. An example .smk file, and associated conda environment recipe, can be found in the /examples directory.

Calling variants: from fastq to VCF

You've sequenced some samples. You have a reference genome or assembled set of contigs, and you'd like to determine reference-relative variants in your samples. You can use freebayes to detect the variants, following these steps:

Align your reads to a

Freebayes

Install / Use

README