VariantKey

This software library provides:

VariantKey: a reversible numerical encoding schema for human genetic variants.
RegionKey: a reversible numerical encoding schema for human genomic regions.
ESID: a reversible numerical encoding schema for genetic string identifiers.
normalize_variant: a function to normalize human genetic variants for a given genome reference.

Please consider supporting this project by making a donation via PayPal

category Libraries
author Nicola Asuni
license MIT
link https://github.com/tecnickcom/variantkey

How to cite

Nicola Asuni, Steven Wilder VariantKey - A Reversible Numerical Representation of Human Genetic Variants, bioRxiv 473744; doi: https://doi.org/10.1101/473744.

Description
Quick Start
Human Genetic Variant Definition
Variant Decomposition and Normalization
- Decomposition
- Normalization
  - Normalization Function
VariantKey Format
- VariantKey Properties
VariantKey Input values
RegionKey
- RegionKey Properties
Encoding String IDs
Binary file formats for lookup tables
C Library
GO Library
Python Module
Python Class
R Module
Javascript library

Description

This software library provides:

VariantKey: a reversible numerical encoding schema for human genetic variants.
RegionKey: a reversible numerical encoding schema for human genomic regions.
ESID: a reversible numerical encoding schema for genetic string identifiers.

Human genetic variants are usually represented by four values with variable length: chromosome, position, reference and alternate alleles. There is no guarantee that these components are represented in a consistent way across different data sources, and processing variant-based data can be inefficient because four different comparison operations are needed for each variant, three of which are string comparisons. Working with strings, in contrast to numbers, poses extra challenges on computer memory allocation and data-representation. Existing variant identifiers do not typically represent every possible variant we may be interested in, nor they are directly reversible.

VariantKey, a novel reversible numerical encoding schema for human genetic variants, overcomes these limitations by allowing to process variants as a single 64 bit numeric entities while preserving the ability to be searched and sorted per chromosome and position.

The individual components of short variants (up to 11 bases between REF and ALT alleles) can be directly read back from the VariantKey, while long variants requires a lookup table to retrieve the reference and alternate allele strings.

The VariantKey Format doesn't represent universal codes, it only encodes normalized CHROM, POS, REF and ALT, so each code is unique for a given reference genome. The direct comparisons of two VariantKeys makes sense only if they both refer to the same genome reference.

This software library also provides other genetic variant-related tools.

Quick Start

This project includes a Makefile that allows you to test and build the project in a Linux-compatible system with simple commands.

To see all available options, from the project root type:

make help

To build all the VariantKey versions inside a Docker container (requires Docker):

make dbuild

An arbitrary make target can be executed inside a Docker container by specifying the MAKETARGET parameter:

MAKETARGET='build' make dbuild

The list of make targets can be obtained by typing make

The base Docker building environment is defined in the following Dockerfile:

resources/Docker/Dockerfile.dev

To build and test only a specific language version, cd into the language directory and use the make command. For example:

cd c
make test

Human Genetic Variant Definition

In this context, the human genetic variant for a given genome assembly is defined as the set of four components compatible with the VCF format:

CHROM - chromosome: An identifier from the reference genome. It only has 26 valid values: autosomes from 1 to 22, the sex chromosomes X=23 and Y=24, mitochondria MT=25 and a symbol NA=0 to indicate missing data.
POS - position: The reference position in the chromosome, with the first nucleotide having position 0. The largest expected value is less than 250 million to represent the last base pair in the chromosome 1.
REF - reference allele: String containing a sequence of reference nucleotide letters. The value in the POS field refers to the position of the first nucleotide in the String.
ALT - alternate allele: Single alternate non-reference allele. String containing a sequence of nucleotide letters. Multiallelic variants must be decomposed in individual biallelic variants.

Variant Decomposition and Normalization

The VariantKey model assumes that the variants have been decomposed and normalized.

Decomposition

In the common Variant Call Format (VCF) the alternate field can contain comma-separated strings for multiallelic variants, while in this context we only consider biallelic variants to allow for allelic comparisons between different data sets.

For example, the multiallelic variant:

    {CHROM=1, POS=3759889, REF=TA, ALT=TAA,TAAA,T}

can be decomposed as three biallelic variants:

    {CHROM=1, POS=3759889, REF=TA, ALT=TAA}
    {CHROM=1, POS=3759889, REF=TA, ALT=TAAA}
    {CHROM=1, POS=3759889, REF=TA, ALT=T}

In VCF files the decomposition from multiallelic to biallelic variants can be performed using the 'vt' software tool with the command:

    vt decompose -s source.vcf -o decomposed.vcf

The -s option (smart decomposition) splits up INFO and GENOTYPE fields that have number counts of R and A appropriately.

Example:

input

  #CHROM  POS     ID   REF     ALT         QUAL   FILTER  INFO                  FORMAT    S1                                      S2
  1       3759889 .    TA      TAA,TAAA,T  .      PASS    AF=0.342,0.173,0.037  GT:DP:PL  1/2:81:281,5,9,58,0,115,338,46,116,809  0/0:86:0,30,323,31,365,483,38,291,325,567

output

  #CHROM  POS     ID   REF     ALT         QUAL   FILTER  INFO                                                 FORMAT   S1               S2
  1       3759889 .    TA      TAA         .      PASS    AF=0.342;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    1/.:281,5,9      0/0:0,30,323
  1       3759889 .    TA      TAAA        .      .       AF=0.173;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./1:281,58,115   0/0:0,31,483
  1       3759889 .    TA      T           .      .       AF=0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./.:281,338,809  0/0:0,38,567

Normalization

A normalization step is required to ensure a consistent and unambiguous representation of variants. As shown in the following example, there are multiple ways to represent the same variant, but only one can be considered "normalized" as defined by Tan et al., 2015:

A variant representation is normalized if and only if it is left aligned and parsimonious.
A variant representation is left aligned if and only if its base position is smallest among all potential representations having the same allele length and representing the same variant.
A variant representation is parsimonious if and only if the entry has the shortest allele length among all VCF entries representing the same variant.

Example of entries representing the same variant:

                                                  DELETE
                                    POS: 0        ||
                         VARIANT    REF: GGGCACACACAGGG
                                    ALT: GGGCACACAGGG

                                    POS:      5
                  NOT-LEFT-ALIGNED  REF:      CAC
                                    ALT:      C

                                    POS:   2
NOT-LEFT-ALIGNED, NOT-PARSIMONIOUS  REF:   GCACA
                                    ALT:   GCA

                                    POS:  1
                  NOT-PARSIMONIOUS  REF:  GGCA
                                    ALT:  GG

                                    POS:   2
                      NORMALIZED    REF:   GCA
                                    ALT:   G

In VCF files the variant normalization can be performed using the vt software tool with the command:

    vt normalize decomposed.vcf -m -r genome.fa -o no

Variantkey

Install / Use

README