SkillAgentSearch skills...

Variantkey

Numerical Encoding for Human Genetic Variants

Install / Use

/learn @genomics-dev/Variantkey

README

VariantKey

Numerical Encoding for Human Genetic Variants

Master Build Status Master Coverage Status

  • category Libraries
  • author Nicola Asuni
  • copyright 2017-2018 GENOMICS plc
  • license MIT (see LICENSE)
  • link https://github.com/Genomicsplc/variantkey

How to cite

Nicola Asuni. VariantKey - A Reversible Numerical Representation of Human Genetic Variants., bioRxiv 473744; doi: https://doi.org/10.1101/473744


TOC


<a name="description"></a>

Description

Human genetic variants are usually represented by four values with variable length: chromosome, position, reference and alternate alleles. There is no guarantee that these components are represented in a consistent way across different data sources, and processing variant-based data can be inefficient because four different comparison operations are needed for each variant, three of which are string comparisons. Working with strings, in contrast to numbers, poses extra challenges on computer memory allocation and data-representation. Existing variant identifiers do not typically represent every possible variant we may be interested in, nor they are directly reversible.

VariantKey, a novel reversible numerical encoding schema for human genetic variants, overcomes these limitations by allowing to process variants as a single 64 bit numeric entities while preserving the ability to be searched and sorted per chromosome and position.

The individual components of short variants (up to 11 bases between REF and ALT alleles) can be directly read back from the VariantKey, while long variants requires a lookup table to retrieve the reference and alternate allele strings.

The VariantKey Format doesn't represent universal codes, it only encodes CHROM, POS, REF and ALT, so each code is unique for a given reference genome. The direct comparisons of two VariantKeys makes sense only if they both refer to the same genome reference.

This software library can be used to generate and reverse VariantKeys and RegionKeys.


<a name="quickstart"></a>

Quick Start

This project includes a Makefile that allows you to test and build the project in a Linux-compatible system with simple commands.

To see all available options, from the project root type:

make help

To build all the VariantKey versions inside a Docker container (requires Docker):

make dbuild

An arbitrary make target can be executed inside a Docker container by specifying the MAKETARGET parameter:

MAKETARGET='build' make dbuild

The list of make targets can be obtained by typing make

The base Docker building environment is defined in the following Dockerfile:

resources/Docker/Dockerfile.dev

To build and test only a specific language version, cd into the language directory and use the make command. For example:

cd c
make test

<a name="hgvdefinition"></a>

Human Genetic Variant Definition

In this context, the human genetic variant for a given genome assembly is defined as the set of four components compatible with the VCF format:

  • CHROM - chromosome: An identifier from the reference genome. It only has 26 valid values: autosomes from 1 to 22, the sex chromosomes X=23 and Y=24, mitochondria MT=25 and a symbol NA=0 to indicate missing data.
  • POS - position: The reference position in the chromosome, with the first nucleotide having position 0. The largest expected value is less than 250 million to represent the last base pair in the chromosome 1.
  • REF - reference allele: String containing a sequence of reference nucleotide letters. The value in the POS field refers to the position of the first nucleotide in the String.
  • ALT - alternate allele: Single alternate non-reference allele. String containing a sequence of nucleotide letters. Multiallelic variants must be decomposed in individual biallelic variants.

<a name="decompandnorm"></a>

Variant Decomposition and Normalization

The VariantKey model assumes that the variants have been decomposed and normalized.

<a name="decomposition"></a>

Decomposition

In the common Variant Call Format (VCF) the alternate field can contain comma-separated strings for multiallelic variants, while in this context we only consider biallelic variants to allow for allelic comparisons between different data sets.

For example, the multiallelic variant:

    {CHROM=1, POS=3759889, REF=TA, ALT=TAA,TAAA,T}

can be decomposed as three biallelic variants:

    {CHROM=1, POS=3759889, REF=TA, ALT=TAA}
    {CHROM=1, POS=3759889, REF=TA, ALT=TAAA}
    {CHROM=1, POS=3759889, REF=TA, ALT=T}

In VCF files the decomposition from multiallelic to biallelic variants can be performed using the 'vt' software tool with the command:

    vt decompose -s source.vcf -o decomposed.vcf

The -s option (smart decomposition) splits up INFO and GENOTYPE fields that have number counts of R and A appropriately.

Example:

  • input
  #CHROM  POS     ID   REF     ALT         QUAL   FILTER  INFO                  FORMAT    S1                                      S2
  1       3759889 .    TA      TAA,TAAA,T  .      PASS    AF=0.342,0.173,0.037  GT:DP:PL  1/2:81:281,5,9,58,0,115,338,46,116,809  0/0:86:0,30,323,31,365,483,38,291,325,567
  • output
  #CHROM  POS     ID   REF     ALT         QUAL   FILTER  INFO                                                 FORMAT   S1               S2
  1       3759889 .    TA      TAA         .      PASS    AF=0.342;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    1/.:281,5,9      0/0:0,30,323
  1       3759889 .    TA      TAAA        .      .       AF=0.173;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./1:281,58,115   0/0:0,31,483
  1       3759889 .    TA      T           .      .       AF=0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./.:281,338,809  0/0:0,38,567

<a name="normalization"></a>

Normalization

A normalization step is required to ensure a consistent and unambiguous representation of variants. As shown in the following example, there are multiple ways to represent the same variant, but only one can be considered "normalized" as defined by Tan et al., 2015:

  • A variant representation is normalized if and only if it is left aligned and parsimonious.
  • A variant representation is left aligned if and only if its base position is smallest among all potential representations having the same allele length and representing the same variant.
  • A variant representation is parsimonious if and only if the entry has the shortest allele length among all VCF entries representing the same variant.

Example of entries representing the same variant:

                                                  DELETE
                                    POS: 0        ||
                         VARIANT    REF: GGGCACACACAGGG
                                    ALT: GGGCACACAGGG

                                    POS:      5
                  NOT-LEFT-ALIGNED  REF:      CAC
                                    ALT:      C

                                    POS:   2
NOT-LEFT-ALIGNED, NOT-PARSIMONIOUS  REF:   GCACA
                                    ALT:   GCA

                                    POS:  1
                  NOT-PARSIMONIOUS  REF:  GGCA
                                    ALT:  GG

                                    POS:   2
                      NORMALIZED    REF:   GCA
                                    ALT:   G

In VCF files the variant normalization can be performed using the vt software tool with the command:

    vt normalize decomposed.vcf -m -r genome.fa -o normalized.vcf

or the bcftools software with the command:

    bcftools norm -f genome.fa -o normalized.vcf decomposed.vcf

or decompose and normalize with a single command:

    bcftools norm --multiallelics -any -f genome.fa -o normalized.vcf source.vcf

<a name="normfunc"></a>

Normalization Function

Individual biallelic variants can be normalized using the normalize_variant function provided by this library.

The normalize_variant function first checks if the reference allele matches the genome reference. The match is considered valid and consistent if there is a perfect letter-by-letter match, and valid but not consistent if

View on GitHub
GitHub Stars42
CategoryDevelopment
Updated6mo ago
Forks9

Languages

C

Security Score

87/100

Audited on Sep 5, 2025

No findings