VAPr
VAPr: A Python package for NoSQL variant data storage, annotation and prioritization
Install / Use
/learn @ucsd-ccbb/VAPrREADME
VAPr
Variant Annotation and Prioritization package
This package is aimed at providing a way of retrieving variant information using ANNOVAR and myvariant.info. In particular, it is suited for bioinformaticians interested in aggregating variant information into a single NoSQL database (MongoDB solely at the moment).
Documentation now live at: http://vapr.readthedocs.io/en/latest/
DOI: Efficient population-scale variant analysis and prioritization with VAPr
Authors
- Amanda Birmingham (abirmingham@ucsd.edu)
- Adam Mark, M.S. (a1mark@ucsd.edu)
- Carlo Mazzaferro
- Guorong Xu, Ph.D.
- Kathleen Fisch, Ph.D. (kfisch@ucsd.edu)
License
This project is licensed under the MIT License - see the LICENSE file for details
<a id='toc'></a>
Table of contents
- Background
1.1. Data Models - Getting Started
- Tutorial
3.1. Workflow Overview
3.2. VaprAnnotator - Tips on usage
3.2.1 Arguments
3.3. Core Methods
3.3.1 Annovar
3.3.2 Annotation
3.3.3 Filtering
3.3.1 Output Files
<a id='background'></a>
Background
VAPr was developed to simplify the steps required to get mutation data from a VCF file to a downstream analysis process. A query system was implemented allowing users to quickly slice the genomic variant (GV) data and select variants according to their characteristics, allowing researchers to focus their analysis only on the subset of data that contains meaningful information. Further, this query system allows the user to select the format in which the data can be retrieved. Most notably, CSV or VCF files can be retrieved from the database, allowing any researcher to quickly filter variants and retrieve them in commonly used formats. The package can also be installed and used without having to download ANNOVAR. In that case, variant data can be retrieved solely by MyVariant.info and rapidly parsed to the MongoDB instance.
<a id='datamodels'></a>
Data Models
The annotation process identifies every unique variant in the union of variants found for the input samples; it then submits batches (of a user-specifiable size) of variant ids to MyVariant.info and stores the resulting annotation information to the local MongoDB. Subsequent filtering and output of the resulting annotations is done against the MongoDB rather than via additional calls to MyVariant.info, allowing the user to investigate multiple different filtering strategies on a given annotation run without additional overhead. Note that, by design, each run of annotate() performs new annotation calls to MyVariant.info rather than attempting to find potentially relevant past annotations in the MongoDB; this is because MyVariant.info is continually updated live, and we anticipate that users will want to receive the latest annotations each time they choose to annotate, rather than potentially “stale” annotations from past runs.
Intuitively, variant data could be stored in SQL-like databases, since annotation files are usually produced in VCF or CSV formats. However, a different approach may be more fruitful. As explained on our paper (currently under review), the abundance and diversity of genomic variant data causes SQL schemas to perform poorly for variant storage and querying. As it can be the case for many variants, the number of different fields and sub-fields it can have can be over 500, with even more diverse nested sub-fields. Creating a pre-defined schema (as required by SQL-like engined) becomes rather impossible: representing such variant in a table format would thus result in a highly sparse and inefficient storage. Representing instead a variant atomically, that is, as a standalone JSON object having no pre-defied schema, it is possible to compress the rich data into a more manageable format. A sample entry in the Mongo Database will look like this. The variety of data that can be retrieved from the sources results from the richness of databases that can be accessed through MyVariant.info. However, not every variant will have such data readily available. In some cases, the data will be restricted to what can be inferred from the vcf file and the annotation carried out with Annovar. In that case, the entries that will be found in the document will be the following:
{'1000g2015aug_all': 0.00579073,
'_id': ObjectId('5a0d4c5b59f987f13d76aa17'),
'alt': 'A',
'cadd': {'1000g': {'af': 0.01, 'afr': 0.002, 'amr': 0.01, 'eur': 0.02},
'_license': 'http://goo.gl/bkpNhq',
'esp': {'af': 0.017, 'afr': 0.005, 'eur': 0.023},
'gerp': {'n': 3.47, 'rs': 350.8, 'rs_pval': 8.50723e-58, 's': 1.47},
'phred': 19.55,
'polyphen': {'cat': 'benign', 'val': 0.017},
'sift': {'cat': 'tolerated', 'val': 0.43}},
'chr': '1',
'clinvar': {'_license': 'https://goo.gl/OaHML9',
'rcv': [{'accession': 'RCV000017600',
'clinical_significance': 'risk factor',
'conditions': {'identifiers': {'medgen': 'C2751604'},
'name': 'Epilepsy, juvenile myoclonic 7 '
'(EJM7)',
'synonyms': ['EPILEPSY, JUVENILE '
'MYOCLONIC, SUSCEPTIBILITY '
'TO, 7',
'EPILEPSY, IDIOPATHIC '
'GENERALIZED, SUSCEPTIBILITY '
'TO, 10; EPILEPSY, JUVENILE '
'MYOCLONIC, SUSCEPTIBILITY '
'TO, 7']}},
{'accession': 'RCV000017599',
'clinical_significance': 'risk factor',
'conditions': {'identifiers': {'medgen': 'C3150401'},
'name': 'Generalized epilepsy with '
'febrile seizures plus type 5 '
'(GEFSP5)'}},
{'accession': 'RCV000022558',
'clinical_significance': 'risk factor',
'conditions': {'identifiers': {'medgen': 'C2751603',
'omim': '613060'},
'name': 'Epilepsy, idiopathic generalized '
'10 (EIG10)',
'synonyms': ['EPILEPSY, IDIOPATHIC '
'GENERALIZED, SUSCEPTIBILITY '
'TO, 10']}}]},
'dbsnp': {'_license': 'https://goo.gl/Ztr5rl', 'rsid': 'rs41307846'},
'end': 1959699,
'exonicfunc_knowngene': 'nonsynonymous SNV',
'func_knowngene': 'exonic',
'gene_knowngene': 'GABRD',
'hgvs_id': 'chr1:g.1959699G>A',
'ref': 'G',
'samples': [{'AD': [17, 20],
'genotype': '0/1',
'genotype_likelihoods': [400.0, 0.0, 314.0],
'genotype_subclass_by_class': {'heterozygous': 'reference'},
'sample_id': 'S1'}],
'start': 1959699,
'wellderly': {'_license': 'https://goo.gl/e8OO17',
'alleles': [{'allele': 'A', 'freq': 0.015},
{'allele': 'G', 'freq': 0.985}]}}
<a id='getstarted'></a>
Getting started
These instructions will get you a copy of the package up and running on your local machine, and will enable you to run annotation jobs on any number of vcf files while storing the data in MongoDB. See the workflow
<a id='setup'></a>
Prerequisites
- MongoDB Community Edition. Installation instructions
- Python (2.7 and 3.5 currently supported and tested)
- BCFtools
- Tabix
- Annovar scripts (optional)
Python 3 and MongoDB
VAPr is written in Python and stores variant annotations in NoSQL database, using a locally-installed instance of MongoDB. Installation instructions
BCFtools
BCFtools will be used for VCF file merging between samples. To download and install:
wget https://github.com/samtools/bcftools/releases/download/1.6/bcftools-1.6.tar.bz2
tar -vxjf bcftools-1.6.tar.bz2
cd bcftools-1.6
make
make install
export PATH=/where/to/install/bin:$PATH
Refer here for installation debugging.
Tabix
Tabix and bgzip binaries are available through the HTSlib project:
wget https://github.com/samtools/htslib/releases/download/1.6/htslib-1.6.tar.bz2
tar -vxjf htslib-1.6.tar.bz2
cd htslib-1.6
make
make install
export PATH=/where/to/install/bin:$PATH
Refer here for installation debugging.
ANNOVAR
(It is possible to proceed without installing ANNOVAR. Variants will only be annotated with MyVariant.info. In that case, users can skip the next steps and go straight to the section Known Variant Annotation and Storage)
Users who wish to annotate novel variants will also need t
Related Skills
feishu-drive
351.8k|
things-mac
351.8kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
351.8kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
postkit
PostgreSQL-native identity, configuration, metering, and job queues. SQL functions that work with any language or driver
