Hap.py
Haplotype VCF comparison tools
Install / Use
/learn @Illumina/Hap.pyREADME
Haplotype Comparison Tools
Peter Krusche pkrusche@illumina.com
This is a set of programs based on htslib to benchmark variant calls against gold standard truth datasets.
To compare a VCF against a gold standard dataset, use the following commmand line to perform genotype-level haplotype comparison.
hap.py truth.vcf query.vcf -f confident.bed -o output_prefix -r reference.fa
We also have a script to perform comparisons only based on chromosome, position, and allele identity. This comparison will not resolve haplotypes and only verify that the same alleles were observed at the same positions (e.g. for comparison of somatic callsets).
som.py truth.vcf query.vcf -f confident.bed -o output_prefix -r reference.fa
More information can be found below in the usage section.
Contents
Motivation
Complex variant comparison
A major challenge when comparing VCF files for diploid samples is the handling of complex variant representations. In a VCF file, we describe two haplotype sequences by means of REF-ALT pairs and genotypes. These variant calls do not uniquely represent the haplotype sequences: since alignments are always not unique even when using a fixed set of gap and substitution scores, different variant calling methods may produce different variant representations. While some of these representational differences can be handled using pre-processing of VCF files (e.g. variant trimming and left-shifting), others cannot be fixed easily.
In addition to comparing VCF records individually, we produce a graph-based representation of the VCF alleles, create all possible haplotype sequences, and compare these by alignment / exact matching. Here is an example where this is needed:
Variant representation 1 (shown in purple in the image below):
CHROM POS REF ALT GT
chrQ 10 G GTGTGTGCATGCT 0/1
Variant representation 2 (shown in green in the image below):
CHROM POS REF ALT GT
chrQ 16 G GCATGCT 0/1
chrQ 19 T TGTGTG 0/1
Both representations in this example are able to produce the same alt sequences, but we are not able to match them up with standard VCF tools. In particular, we can see from this example that the second representation actually may allow us to create two different sets of alt sequences if they are part of unphased heterozygous variant calls. When we don't know the phasing of our variants, the insertions could have occurred on different haplotypes when using representation 2.
With this tool, we can produce all haplotypes sequences by enumerating paths through a reference graph. By finding the paths / alt alleles that are consistent between two VCFs files we can produce accurate benchmarking numbers for comparing a VCF to a gold standard truth set. See doc/spec.md for more information.
An alternative method to compare complex variant calls is implemented in RTG vcfeval. It is possible to use vcfeval with hap.py, and to use hap.py only for pre-processing, stratification and counting.
The comparison method in vcfeval is more sophisticated than ours and can resolve some corner cases more accurately. For whole-genome comparisons, the difference between the two benchmarking methods is small, but when focusing on difficult subsets of the genome or when using variant calling methods that produce many complex variant calls, these corner cases can become relevant. Moreover, when benchmarking against gold-standard datasets that cover difficult regions of the genome (e.g. Platinum Genomes), the more complicated subsets of the genome will be responsible for most of the difference between methods.
Variant preprocessing
Another component of hap.py is a variant pre-processing method which deals with complex variant representations and MNPs. When different callers may represent variants using a different number of VCF records, we should attempt to count these in a consistent fashion between methods. One example is the representation of MNVs as individual SNPs vs. as complex variants.
Consider the following case:
Complex variant representation:
chrQ 16 GGG TTT 0/1
vs.
Atomized representation:
chrQ 16 G T 0/1
chrQ 17 G T 0/1
chrQ 18 G T 0/1
If this variant is a false-positive, the first representation would naively contribute a single FP record. A variant caller that outputs the second representation would instead receive a penalty of three FPs for making the same variant call. Overall, the difference between the two representations might show significantly when looking at precision levels or false-positive rates (since these are relative to the total number of query counts, which use the same representations), but become important when we need to compare absolute numbers of false-positives. For this case, hap.py can perform a re-alignment of REF and ALT alleles on the query VCF, and splits the records into atomic variant alleles to produce more granular counts using pre.py. Left-shifting and trimming are also supported.
Variant counting
Hap.py includes a module to produce stratified variant counts. Variant types are determined using a re-alignment of REF and ALT alleles. This is more reliable than only using allele lengths. Consider the following complex deletion.
chr1 201586350 . CTCTCTCTCT CA
This complex variant call is equivalent to a deletion, followed by a SNP. Our quantification code will recognize this variant as a deletion and a SNP, and will count it in both categories (so a TP call for this variant will contribute a SNP and an INDEL). This effectively deals with variant calling methods that prefer to combine local haplotypes in the same variant records (e.g. Freebayes / Platypus), which would otherwise fall into a hard-to-assess "COMPLEX" variant call category that varies substantially between different variant calling methods.
chr1 201586350 . CTCTCTCTC C
chr1 201586359 . T A
Another feature of the quantification module in hap.py is stratification into variant sub-types and into genomic regions. For example, precision and recall can be computed at the same time for all GA4GH stratification regions, and for different INDEL lengths (<5, 7-15, 16+). Hap.py also calculates het-hom and Ti/Tv ratios for all subsets of benchmarked variants. Note that all region matching in hap.py is based on reference coordinates only. One case where this can lead to counterintuitive results is when considering hompolymer insertions:
Reference:
>chrQ
CAAAAA
VCF:
chrQ 1 C CA 0/1
BED for homopolymers:
1 6
In this example, the variant call given above would not be captured by the bed region for the homopolymers because it is associated with the reference base just before. To account for this, the bed intervals need to be expanded to include the padding base just before the regions.
Finally, we produce input data for ROC and precision/recall curves. An example is included.
Usage
The main two tools are hap.py (diploid precision/recall evaluation) and som.py (somatic precision/recall evaluation -- this ignores the GT and just checks for presence of alleles). Other tools are qfy.py (which just executes the quantification step of the analysis pipeline, this requires a GA4GH-intermediate VCF file), and pre.py, which is hap.py's input cleaning and variant normalisation step.
Here are some small example command lines. Advanced features like confident call / ambiguity / FP regions are also available, see the documentation for each tool for these.
Below, we assume that the code has been installed to the directory ${HAPPY}.
hap.py
See also doc/happy.md.
$ ${HAPPY}/bin/hap.py \
example/happy/PG_NA12878_chr21.vcf.gz \
example/happy/NA12878_chr21.vcf.gz \
-f example/happy/PG_Conf_chr21.bed.gz \
-o test
$ ls test.*
test.metrics.json test.summary.csv
This example compares an example run of GATK 1.6 on NA12878 agains the Platinum Genomes reference dataset (Note: this is a fairly old version of GATK, so don't rely on these particular numbers for competitive comparisons!).
The summary CSV file contains all high-level metrics:
| Type | TRUTH.TOTAL| QUERY.TOTAL | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio| |---------------|-------------|--------------|---------------|------------------|-----------------|-------------------------|-------------------------|-----------------------------|----------------------------| |INDEL | 9124| 9905 | 0.869406 | 0.978441 | 0.194548 | NaN | NaN | 1.463852 | 1.209105|
Related Skills
node-connect
341.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
