Vcfcache
Accelerate variant annotation by caching common variants. Build a normalized blueprint, annotate once, and reuse across samples. Genome-agnostic and tool-agnostic. Achieve 2.5-5× speed-ups for typical workflows.
Install / Use
/learn @julius-muller/VcfcacheREADME
VCFcache – cache once, annotate fast
VCFcache accelerates variant annotation by caching annotations for a user-defined variant set and reusing them across runs, so only novel variants are processed at runtime.
When VCFcache helps
VCFcache is useful when you either (a) repeatedly annotate many samples with a stable pipeline, or (b) want to quickly apply common annotations (e.g., VEP --everything) to a large VCF/BCF. Speedup depends on the cache hit rate of the input sample and the per-variant cost of the original annotation pipeline.
Key properties
- Drop-in integration: keep your existing annotation command; place it into a simple
annotation.yamland runvcfcache annotate. - Cache reuse with automatic fallback: cache hits are reused; cache misses are annotated with your configured command and merged into one output.
- Genome- and tool-agnostic: works with arbitrary reference builds and organisms, and with any annotator or pipeline that can be expressed as a command (stdin/stdout or file-based).
- Pre-built caches available: published caches built from public aggregation resources and annotation tools can be downloaded and used immediately (currently hg19/hg38 annotated with VEP --everything; additional configurations can be generated on request); the same tooling can generate highly efficient custom caches for your specific options and datasets.
- BCF-native I/O: VCFcache reads and writes BCF for performance and indexing; use
bcftools viewto convert VCF/VCF.gz at the boundaries.
Important: to use a prebuilt cache, you must have the same annotation tool (and compatible version) installed locally.
See WIKI.md for full documentation, performance notes, and cache distribution via Zenodo.
Quick Start
Installation
# Via pip (requires Python 3.11+ and bcftools >= 1.20)
uv pip install vcfcache
# Via Docker (includes bcftools)
docker pull ghcr.io/julius-muller/vcfcache:latest
# Via Apptainer
apptainer exec docker://ghcr.io/julius-muller/vcfcache:latest vcfcache --help
See WIKI.md - Section 2 (Quick Start) for development installation and troubleshooting.
Minimal example
This minimal example shows the complete workflow using a public cache.
1. List available caches
vcfcache list caches
2. Pull a specific cache
# Cache auto-downloads on first use, or download explicitly:
vcfcache cache-build --doi 10.5281/zenodo.18189447
3. Check cache requirements
vcfcache annotate --requirements -a cache-hg38-gnomad-4.1joint-AF0100-vep-115.2-basic
This shows:
- Required annotation tool version (e.g., VEP 115.2)
- Required params (reference cache paths, etc.)
- The exact annotation command that will run
Critical: Install the exact tool version shown here to ensure compatibility with the cache annotations.
4. Annotate your sample
vcfcache annotate \
-a cache-hg38-gnomad-4.1joint-AF0100-vep-115.2-basic \
--vcf sample.bcf \
--output sample_annotated.bcf \
--stats-dir ./results
Input format: The --vcf flag accepts VCF, VCF.gz, BCF, or stdin. VCFcache handles conversion automatically:
vcfcache annotate -a <cache-alias> --vcf sample.vcf.gz --output out.bcf # VCF.gz input
vcfcache annotate -a <cache-alias> --vcf sample.vcf --output out.bcf # VCF input
vcfcache annotate -a <cache-alias> --vcf sample.bcf --output out.bcf # BCF input
bcftools view sample.vcf | vcfcache annotate -a <cache-alias> -i - -o - # stdin/stdout
See WIKI.md - Section 7 (Using a cache to annotate samples) for all annotation options.
Building Your Own Cache
If you need different annotation settings (plugins, flags, tool version):
From a public cache's blueprint
# Downloaded caches include the blueprint! Use it to build a cache variant:
# 1. Download cache (includes blueprint)
vcfcache cache-build --doi 10.5281/zenodo.18189447
# 2. Create annotation.yaml with your pipeline (see below)
# 3. Build cache from the downloaded blueprint
vcfcache cache-build \
--db ~/.cache/vcfcache/caches/<cache-alias> \
--name my_vep_cache \
-a annotation.yaml \
-y params.yaml
From a public blueprint only
# 1. Download blueprint (smaller, no annotations)
vcfcache blueprint-init --doi <blueprint_DOI> -o ./cache_root
# 2. Create annotation.yaml with your pipeline (see below)
# 3. Build cache
vcfcache cache-build \
--db ./cache_root \
--name my_vep_cache \
-a annotation.yaml \
-y params.yaml
From your own variants
# Use your cohort's common variants for maximum cache hit rate
vcfcache blueprint-init --vcf cohort_common.bcf --output ./cache_root -y params.yaml
vcfcache cache-build --db ./cache_root --name my_cache -a annotation.yaml -y params.yaml
See WIKI.md - Section 6 (Building your own cache) for the complete workflow including sharing via Zenodo.
Setting Up annotation.yaml
The annotation.yaml defines your annotation pipeline at cache-build time. It is immutable: once baked into a cache, it cannot be changed without rebuilding.
Start with your existing command
Begin with the exact command you already run:
vep --offline --cache --vcf \
--dir_cache /data/vep/cache \
--plugin ExACpLI,/data/plugins/ExACpLI.tsv.gz \
--assembly GRCh38 \
-i my.vcf.gz -o my_annotated.vcf.gz
Adapt for VCFcache
Inside annotation.yaml, ${INPUT_BCF} is always BCF (regardless of what format you provide at runtime). Since most annotation tools (like VEP) require VCF input, you need to convert it:
Step 1: Wrap for BCF↔VCF conversion:
# ▼ Convert BCF input to VCF for your tool
${params.bcftools_cmd} view -Ov ${INPUT_BCF} | \
# ▼ Your existing VEP command (change files to stdin/stdout)
${params.annotation_tool_cmd} --offline --cache --vcf \ # ◄ Tool command (vep, docker run, etc.)
--dir_cache /data/vep/cache \ # ◄ Environment-specific path
--plugin ExACpLI,/data/plugins/ExACpLI/ExACpLI.tsv.gz \ # ◄ Environment-specific path
--assembly GRCh38 \
-i stdin -o stdout | \ # ◄ Changed from -i/-o files
# ▼ Convert VCF output back to BCF
${params.bcftools_cmd} view -Ob -o ${OUTPUT_BCF}
What to change:
- Replace
bcftools→${params.bcftools_cmd} - Replace
vep→${params.annotation_tool_cmd}(allows docker, custom paths, etc.) - Add
${params.bcftools_cmd} view -Ov ${INPUT_BCF} |at the start (BCF → VCF) - Add
| ${params.bcftools_cmd} view -Ob -o ${OUTPUT_BCF}at the end (VCF → BCF) - Change your tool's I/O from files to stdin/stdout
- Note which paths are environment-specific (marked with ◄)
Step 2: Final annotation.yaml with all substitutions:
# annotation.yaml
annotation_cmd: |
${params.bcftools_cmd} view -Ov ${INPUT_BCF} | \
${params.annotation_tool_cmd} --offline --cache --vcf \
--dir_cache ${params.vep_cache} \
--plugin ExACpLI,${params.vep_plugins}/ExACpLI/ExACpLI.tsv.gz \
--assembly GRCh38 \
-i stdin -o stdout | \
${params.bcftools_cmd} view -Ob -o ${OUTPUT_BCF}
must_contain_info_tag: CSQ # Required: VCFcache validates this tag exists
required_tool_version: "115.2" # Required: VCFcache checks tool version match
genome_build: "GRCh38" # Required: must match params.yaml
Step 3: Define environment-specific settings in params.yaml:
# params.yaml
# Required fields:
genome_build: "GRCh38" # Must match annotation.yaml
annotation_tool_cmd: "vep" # Command to invoke annotation tool
bcftools_cmd: "bcftools" # Command to invoke bcftools
temp_dir: "/tmp" # Directory for temporary files
threads: 8 # Number of threads for bcftools
# Optional fields:
tool_version_command: "vep | awk '/ensembl-vep/ {print $NF}'" # Command to check tool version (optional but recommended)
# Custom fields (for use in annotation.yaml):
vep_cache: "/data/vep/cache" # Machine-specific path
vep_plugins: "/data/plugins" # Machine-specific path
About tool_version_command: If provided, VCFcache runs this command and compares the output against required_tool_version in annotation.yaml. This validates that you have the correct tool version installed before annotating.
Key rules
Required fields in annotation.yaml:
annotation_cmd: Shell command to annotate variantsmust_contain_info_tag: INFO tag to validate (e.g.,CSQfor VEP)required_tool_version: Tool version string (e.g.,"115.2")genome_build: Reference build (e.g.,GRCh38) — must match params.yaml
Required fields in params.yaml:
annotation_tool_cmd: Command to invoke annotation toolbcftools_cmd: Command to invoke bcftoolstemp_dir: Directory for temporary filesthreads: Number of threads for bcftools operationsgenome_build: Reference build (e.g.,GRCh38) — must match annotation.yaml
VCFcache-provided variables (always use these):
${INPUT_BCF}: Path to input BCF that VCFcache provides (the variants to annotate)${OUTPUT_BCF}: Path where
