SkillAgentSearch skills...

VAPr

VAPr: A Python package for NoSQL variant data storage, annotation and prioritization

Install / Use

/learn @ucsd-ccbb/VAPr
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

VAPr

Variant Annotation and Prioritization package

This package is aimed at providing a way of retrieving variant information using ANNOVAR and myvariant.info. In particular, it is suited for bioinformaticians interested in aggregating variant information into a single NoSQL database (MongoDB solely at the moment).

Documentation now live at: http://vapr.readthedocs.io/en/latest/

DOI: Efficient population-scale variant analysis and prioritization with VAPr

Authors

  • Amanda Birmingham (abirmingham@ucsd.edu)
  • Adam Mark, M.S. (a1mark@ucsd.edu)
  • Carlo Mazzaferro
  • Guorong Xu, Ph.D.
  • Kathleen Fisch, Ph.D. (kfisch@ucsd.edu)

License

This project is licensed under the MIT License - see the LICENSE file for details

<a id='toc'></a>

Table of contents

  1. Background
    1.1. Data Models
  2. Getting Started
  3. Tutorial
    3.1. Workflow Overview
    3.2. VaprAnnotator - Tips on usage
        3.2.1 Arguments
    3.3. Core Methods
        3.3.1 Annovar
        3.3.2 Annotation
        3.3.3 Filtering
        3.3.1 Output Files

<a id='background'></a>

Background

VAPr was developed to simplify the steps required to get mutation data from a VCF file to a downstream analysis process. A query system was implemented allowing users to quickly slice the genomic variant (GV) data and select variants according to their characteristics, allowing researchers to focus their analysis only on the subset of data that contains meaningful information. Further, this query system allows the user to select the format in which the data can be retrieved. Most notably, CSV or VCF files can be retrieved from the database, allowing any researcher to quickly filter variants and retrieve them in commonly used formats. The package can also be installed and used without having to download ANNOVAR. In that case, variant data can be retrieved solely by MyVariant.info and rapidly parsed to the MongoDB instance.

<a id='datamodels'></a>

Data Models

The annotation process identifies every unique variant in the union of variants found for the input samples; it then submits batches (of a user-specifiable size) of variant ids to MyVariant.info and stores the resulting annotation information to the local MongoDB. Subsequent filtering and output of the resulting annotations is done against the MongoDB rather than via additional calls to MyVariant.info, allowing the user to investigate multiple different filtering strategies on a given annotation run without additional overhead. Note that, by design, each run of annotate() performs new annotation calls to MyVariant.info rather than attempting to find potentially relevant past annotations in the MongoDB; this is because MyVariant.info is continually updated live, and we anticipate that users will want to receive the latest annotations each time they choose to annotate, rather than potentially “stale” annotations from past runs.

Intuitively, variant data could be stored in SQL-like databases, since annotation files are usually produced in VCF or CSV formats. However, a different approach may be more fruitful. As explained on our paper (currently under review), the abundance and diversity of genomic variant data causes SQL schemas to perform poorly for variant storage and querying. As it can be the case for many variants, the number of different fields and sub-fields it can have can be over 500, with even more diverse nested sub-fields. Creating a pre-defined schema (as required by SQL-like engined) becomes rather impossible: representing such variant in a table format would thus result in a highly sparse and inefficient storage. Representing instead a variant atomically, that is, as a standalone JSON object having no pre-defied schema, it is possible to compress the rich data into a more manageable format. A sample entry in the Mongo Database will look like this. The variety of data that can be retrieved from the sources results from the richness of databases that can be accessed through MyVariant.info. However, not every variant will have such data readily available. In some cases, the data will be restricted to what can be inferred from the vcf file and the annotation carried out with Annovar. In that case, the entries that will be found in the document will be the following:

    {'1000g2015aug_all': 0.00579073,
 '_id': ObjectId('5a0d4c5b59f987f13d76aa17'),
 'alt': 'A',
 'cadd': {'1000g': {'af': 0.01, 'afr': 0.002, 'amr': 0.01, 'eur': 0.02},
          '_license': 'http://goo.gl/bkpNhq',
          'esp': {'af': 0.017, 'afr': 0.005, 'eur': 0.023},
          'gerp': {'n': 3.47, 'rs': 350.8, 'rs_pval': 8.50723e-58, 's': 1.47},
          'phred': 19.55,
          'polyphen': {'cat': 'benign', 'val': 0.017},
          'sift': {'cat': 'tolerated', 'val': 0.43}},
 'chr': '1',
 'clinvar': {'_license': 'https://goo.gl/OaHML9',
             'rcv': [{'accession': 'RCV000017600',
                      'clinical_significance': 'risk factor',
                      'conditions': {'identifiers': {'medgen': 'C2751604'},
                                     'name': 'Epilepsy, juvenile myoclonic 7 '
                                             '(EJM7)',
                                     'synonyms': ['EPILEPSY, JUVENILE '
                                                  'MYOCLONIC, SUSCEPTIBILITY '
                                                  'TO, 7',
                                                  'EPILEPSY, IDIOPATHIC '
                                                  'GENERALIZED, SUSCEPTIBILITY '
                                                  'TO, 10; EPILEPSY, JUVENILE '
                                                  'MYOCLONIC, SUSCEPTIBILITY '
                                                  'TO, 7']}},
                     {'accession': 'RCV000017599',
                      'clinical_significance': 'risk factor',
                      'conditions': {'identifiers': {'medgen': 'C3150401'},
                                     'name': 'Generalized epilepsy with '
                                             'febrile seizures plus type 5 '
                                             '(GEFSP5)'}},
                     {'accession': 'RCV000022558',
                      'clinical_significance': 'risk factor',
                      'conditions': {'identifiers': {'medgen': 'C2751603',
                                                     'omim': '613060'},
                                     'name': 'Epilepsy, idiopathic generalized '
                                             '10 (EIG10)',
                                     'synonyms': ['EPILEPSY, IDIOPATHIC '
                                                  'GENERALIZED, SUSCEPTIBILITY '
                                                  'TO, 10']}}]},
 'dbsnp': {'_license': 'https://goo.gl/Ztr5rl', 'rsid': 'rs41307846'},
 'end': 1959699,
 'exonicfunc_knowngene': 'nonsynonymous SNV',
 'func_knowngene': 'exonic',
 'gene_knowngene': 'GABRD',
 'hgvs_id': 'chr1:g.1959699G>A',
 'ref': 'G',
 'samples': [{'AD': [17, 20],
              'genotype': '0/1',
              'genotype_likelihoods': [400.0, 0.0, 314.0],
              'genotype_subclass_by_class': {'heterozygous': 'reference'},
              'sample_id': 'S1'}],
 'start': 1959699,
 'wellderly': {'_license': 'https://goo.gl/e8OO17',
               'alleles': [{'allele': 'A', 'freq': 0.015},
                           {'allele': 'G', 'freq': 0.985}]}}

<a id='getstarted'></a>

Getting started

These instructions will get you a copy of the package up and running on your local machine, and will enable you to run annotation jobs on any number of vcf files while storing the data in MongoDB. See the workflow

<a id='setup'></a>

Prerequisites

Python 3 and MongoDB

VAPr is written in Python and stores variant annotations in NoSQL database, using a locally-installed instance of MongoDB. Installation instructions

BCFtools

BCFtools will be used for VCF file merging between samples. To download and install:

wget https://github.com/samtools/bcftools/releases/download/1.6/bcftools-1.6.tar.bz2
tar -vxjf bcftools-1.6.tar.bz2
cd bcftools-1.6
make
make install
export PATH=/where/to/install/bin:$PATH

Refer here for installation debugging.

Tabix

Tabix and bgzip binaries are available through the HTSlib project:

wget https://github.com/samtools/htslib/releases/download/1.6/htslib-1.6.tar.bz2
tar -vxjf htslib-1.6.tar.bz2
cd htslib-1.6
make
make install
export PATH=/where/to/install/bin:$PATH

Refer here for installation debugging.

ANNOVAR

(It is possible to proceed without installing ANNOVAR. Variants will only be annotated with MyVariant.info. In that case, users can skip the next steps and go straight to the section Known Variant Annotation and Storage)

Users who wish to annotate novel variants will also need t

Related Skills

View on GitHub
GitHub Stars37
CategoryData
Updated4mo ago
Forks3

Languages

Python

Security Score

87/100

Audited on Nov 27, 2025

No findings