Clinvcf
Generate an enhanced VCF files from ClinVar XML Full releases
Install / Use
/learn @SeqOne/ClinvcfREADME
ClinVCF

ClinVCF generates a VCF file from a ClinVar Full Release (XML format). It was first developped because we observed missing variants in VCF files provided by NCBI. We later extended its capabilities to provived enhanced Clinvar VCF files by :
- Improving Clinvar classification and aggregation method by deciphering "conflicting intepretation" records where almost all submissions goes into the same direction.
- Implementing a more robust gene annotation module based NCBI GFF files.
ClinVCF is developped in NimLang, is highly efficient* (~ 5 minutes to generate the VCF from the XML) and supports GRCh37 and GRCh38 genomes builds.
clinVCF is a part of the Genome Alert! framework - Website https://genomealert.univ-grenoble-alpes.fr/.
Table of content
Quick start
You need to have nimlang installed and hts-nim to compile and install clinVCF. If you use Mac M1/M2 processor please read the M1 Install section
A clean install script of nim and hts-nim is proposed by Brent Pedersen nimlang and hts-nim installed
# Git clone and install
git clone https://github.com/SeqOne/clinvcf.git && cd clinvcf && nimble install
# Download (latest) Clinvar XML release
wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/RCV_release/ClinVarRCVRelease_00-latest.xml.gz
# Download GFF for gene annotation (GRCh37 or 38)
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
# Generate clinvar VCF
## For GRCh37
clinvcf --coding-first --genome GRCh37 ClinVarRCVRelease_00-latest.xml.gz | bgzip -c > clinvar_GRCh37.vcf.gz
## For GRCh38
clinvcf --coding-first --genome GRCh38 ClinVarRCVRelease_00-latest.xml.gz | bgzip -c > clinvar_GRCh38.vcf.gz
Usage
Usage: clinvcf [options] --genome <version> <clinvar.xml.gz>
Arguments:
--genome <version> Genome assembly to use
Options:
--filename-date Use xml filename date instead of inner date which may differ
--hgnc <table> HGNC table used for gene name alias corrections
Gene annotation:
--gff <file> NCBI GFF to annotate variations with genes
--coding-first Give priority to coding gene in annotation (even if intronic and exonic for another gene)
--gene-padding <int> Padding to annotation upstream/downstream genes (not applied for MT) [default: 5000]
Output format
ClinVCF generates a VCF with almost identical format as the original NCBI VCF.
However, not all VCF fields are currently support by ClinVCF (see table bellow), and additionnal fields are provided.
| VCF Info field | Status* | Format | Description | Example |
| -------------- | ------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------- |
| ALLELEID | Same | Integer | the ClinVar Allele ID | 1234 |
| CLNREVSTAT | Same | String | ClinVar review status for the Variation ID | no_assertion_criteria_provided |
| CLNSIG | Same | String | Clinical significance for this single variant | Pathogenic/Likely_Pathogenic |
| SUBDETAILS | New | String | Equivalent to Clinvar's CLNSIGCONF but for all variant (not just conficting classification) | SUBDETAILS=Uncertain_significance(5)\|Likely_benign(2) |
| CLNDISEASE | New | String | Clinical pathology(ies) ranked as Disease referenced for a variant. Same as Clinvar's CLNDN but wil all listed disease. First one will be the Clinvar's "preferred" one. | CLNDISEASE=breast_ovarian_cancer_familial_2\|hereditary_breast_and_ovarian_cancer_syndrome |hereditary_cancer_predisposing_syndrome| | **OLD_CLNSIG** | New | String | Orignial Clinical significance if variant reclassified by clinVCF correction module |Conflicting_interpretations_of_pathogenicity| | **CLNRECSTAT** | New | Integer | [3-levels stars confidence](#clinicalsignificance-correction-module) of Variant Alert! automatic reclassfication. |3 | | **GENEINFO** | Same | String | Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|) | FTCD:10841|FTCD-AS1:100861507 | | **MC** | Same | String | comma separated list of molecular consequence in the form of Sequence OntologyID|molecular_consequence|SO:0001583|missense_variant | | **RS** | Same | String | dbSNP ID (i.e. rs number) |80358507 | | **PUBMED** | Same | String | PubMed ids associated to the variant |1612597|2565038` |
Status: Same (identical as in original Clinvar VCF), new (New field from clinVCF)
Methodology
ClinicalSignificance correction module
According to the 1.5 * IQR method, we remove outliers submissions and reclassify conflicting status variants according to ClinVar policies. We apply a 3-level star metrics according to our reclassification confidence. 4 or more submission is needed. We only reclassify variants from conflicting status to benign, likely benign, likely pathogenic and pathogenic status.
- ⭐ (1 star) : default
- ⭐⭐ (2 stars) : reclassification remains even if we add a virtual VUS submission
- ⭐⭐⭐ (3 stars) : 2 stars requirements and at least 1 pathogenic (or benign) classification
Gene annotation
- We load all genes from the input GFF and add them to the index with a padding (5000bp by default and 2bp for MT genes), to annotate upstream / downstream variants.
- For each variant we query the gene index and retrieve all overlapping genes.
- Overlapped genes are later prioritize in the
GENEINFOfield with two different procedures (depending of clinVCF parameter)- If
--coding-firstoption is activated :- We take coding genes over all other genes (except for MT genome)
- If we have an equality we take exonic (+/-20bp padding) over intronic/intergenic candidates
- If none are exonic, we take the gene with closest exon
- If both are exonic, we take the oldest gene ID in NCBI Entrez database
- Default procedure :
- We take coding gene over all other genes (except for MT genome) if the variant is exonic (+/- 20bp)
- If we have an equality we take exonic (+/-20bp padding) over intronic/intergenic candidates
- If none are exonic, we take the gene with closest exon
- If both are exonic, we take the oldest gene ID in NCBI Entrez database
- If
How to cite
If you use a tool of the Genome Alert! framework, please cite:
Yauy et al., Genome Alert!: a standardized procedure for genomic variant reinterpretation and automated genotype-phenotype reassessment in clinical routine. medRxiv (2021). https://doi.org/10.1101/2021.07.13.21260422
License
clinVCF is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.
Misc
clinVCF is a part of the Genome Alert! framework, a collaboration of :
Install ClinVCF on MacOS M1
First install the correct version nim with choosenim :
curl https://nim-lang.org/choosenim/init.sh -sSf | sh
choosenim 1.6.14
This version is x86 only, so we need the correct HTSLIB dynamic library
git clone https://github.com/samtools/htslib.git & cd htslib
git submodule update --init --recursive
brew install automake # if not already done
arch -x86_64 autoreconf -i # Build the configure script and install files it uses
arch -x86_64 ./configure # Optional but recommended, for choosing extra functionality
arch -x86_64 make
sudo make install
cd ..
```
Then everythuing
Related Skills
node-connect
341.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR



