Variantkey
Numerical Encoding for Human Genetic Variants and Regions
Install / Use
/learn @tecnickcom/VariantkeyREADME
VariantKey
This software library provides:
- VariantKey: a reversible numerical encoding schema for human genetic variants.
- RegionKey: a reversible numerical encoding schema for human genomic regions.
- ESID: a reversible numerical encoding schema for genetic string identifiers.
- normalize_variant: a function to normalize human genetic variants for a given genome reference.
Please consider supporting this project by making a donation via PayPal
- category Libraries
- author Nicola Asuni
- license MIT
- link https://github.com/tecnickcom/variantkey
How to cite
Nicola Asuni, Steven Wilder VariantKey - A Reversible Numerical Representation of Human Genetic Variants, bioRxiv 473744; doi: https://doi.org/10.1101/473744.
TOC
- Description
- Quick Start
- Human Genetic Variant Definition
- Variant Decomposition and Normalization
- VariantKey Format
- VariantKey Input values
- RegionKey
- Encoding String IDs
- Binary file formats for lookup tables
- C Library
- GO Library
- Python Module
- Python Class
- R Module
- Javascript library
<a name="description"></a>
Description
This software library provides:
- VariantKey: a reversible numerical encoding schema for human genetic variants.
- RegionKey: a reversible numerical encoding schema for human genomic regions.
- ESID: a reversible numerical encoding schema for genetic string identifiers.
Human genetic variants are usually represented by four values with variable length: chromosome, position, reference and alternate alleles. There is no guarantee that these components are represented in a consistent way across different data sources, and processing variant-based data can be inefficient because four different comparison operations are needed for each variant, three of which are string comparisons. Working with strings, in contrast to numbers, poses extra challenges on computer memory allocation and data-representation. Existing variant identifiers do not typically represent every possible variant we may be interested in, nor they are directly reversible.
VariantKey, a novel reversible numerical encoding schema for human genetic variants, overcomes these limitations by allowing to process variants as a single 64 bit numeric entities while preserving the ability to be searched and sorted per chromosome and position.
The individual components of short variants (up to 11 bases between REF and ALT alleles) can be directly read back from the VariantKey, while long variants requires a lookup table to retrieve the reference and alternate allele strings.
The VariantKey Format doesn't represent universal codes, it only encodes normalized CHROM, POS, REF and ALT, so each code is unique for a given reference genome. The direct comparisons of two VariantKeys makes sense only if they both refer to the same genome reference.
This software library also provides other genetic variant-related tools.
<a name="quickstart"></a>
Quick Start
This project includes a Makefile that allows you to test and build the project in a Linux-compatible system with simple commands.
To see all available options, from the project root type:
make help
To build all the VariantKey versions inside a Docker container (requires Docker):
make dbuild
An arbitrary make target can be executed inside a Docker container by specifying the MAKETARGET parameter:
MAKETARGET='build' make dbuild
The list of make targets can be obtained by typing make
The base Docker building environment is defined in the following Dockerfile:
resources/Docker/Dockerfile.dev
To build and test only a specific language version, cd into the language directory and use the make command.
For example:
cd c
make test
<a name="hgvdefinition"></a>
Human Genetic Variant Definition
In this context, the human genetic variant for a given genome assembly is defined as the set of four components compatible with the VCF format:
CHROM- chromosome: An identifier from the reference genome. It only has 26 valid values: autosomes from 1 to 22, the sex chromosomes X=23 and Y=24, mitochondria MT=25 and a symbol NA=0 to indicate missing data.POS- position: The reference position in the chromosome, with the first nucleotide having position 0. The largest expected value is less than 250 million to represent the last base pair in the chromosome 1.REF- reference allele: String containing a sequence of reference nucleotide letters. The value in the POS field refers to the position of the first nucleotide in the String.ALT- alternate allele: Single alternate non-reference allele. String containing a sequence of nucleotide letters. Multiallelic variants must be decomposed in individual biallelic variants.
<a name="decompandnorm"></a>
Variant Decomposition and Normalization
The VariantKey model assumes that the variants have been decomposed and normalized.
<a name="decomposition"></a>
Decomposition
In the common Variant Call Format (VCF) the alternate field can contain comma-separated strings for multiallelic variants, while in this context we only consider biallelic variants to allow for allelic comparisons between different data sets.
For example, the multiallelic variant:
{CHROM=1, POS=3759889, REF=TA, ALT=TAA,TAAA,T}
can be decomposed as three biallelic variants:
{CHROM=1, POS=3759889, REF=TA, ALT=TAA}
{CHROM=1, POS=3759889, REF=TA, ALT=TAAA}
{CHROM=1, POS=3759889, REF=TA, ALT=T}
In VCF files the decomposition from multiallelic to biallelic variants can be performed using the 'vt' software tool with the command:
vt decompose -s source.vcf -o decomposed.vcf
The -s option (smart decomposition) splits up INFO and GENOTYPE fields that have number counts of R and A appropriately.
Example:
- input
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2
1 3759889 . TA TAA,TAAA,T . PASS AF=0.342,0.173,0.037 GT:DP:PL 1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567
- output
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2
1 3759889 . TA TAA . PASS AF=0.342;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL 1/.:281,5,9 0/0:0,30,323
1 3759889 . TA TAAA . . AF=0.173;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./1:281,58,115 0/0:0,31,483
1 3759889 . TA T . . AF=0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./.:281,338,809 0/0:0,38,567
<a name="normalization"></a>
Normalization
A normalization step is required to ensure a consistent and unambiguous representation of variants. As shown in the following example, there are multiple ways to represent the same variant, but only one can be considered "normalized" as defined by Tan et al., 2015:
- A variant representation is normalized if and only if it is left aligned and parsimonious.
- A variant representation is left aligned if and only if its base position is smallest among all potential representations having the same allele length and representing the same variant.
- A variant representation is parsimonious if and only if the entry has the shortest allele length among all VCF entries representing the same variant.
Example of entries representing the same variant:
DELETE
POS: 0 ||
VARIANT REF: GGGCACACACAGGG
ALT: GGGCACACAGGG
POS: 5
NOT-LEFT-ALIGNED REF: CAC
ALT: C
POS: 2
NOT-LEFT-ALIGNED, NOT-PARSIMONIOUS REF: GCACA
ALT: GCA
POS: 1
NOT-PARSIMONIOUS REF: GGCA
ALT: GG
POS: 2
NORMALIZED REF: GCA
ALT: G
In VCF files the variant normalization can be performed using the vt software tool with the command:
vt normalize decomposed.vcf -m -r genome.fa -o no
