Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Install / Use
/learn @google/DeepvariantREADME
DeepVariant is a deep learning-based variant caller that takes aligned reads (in BAM or CRAM format), produces pileup image tensors from them, classifies each tensor using a convolutional neural network, and finally reports the results in a standard VCF or gVCF file.
DeepVariant supports germline variant-calling in diploid organisms.
DeepVariant case-studies for germline variant calling:
- NGS (Illumina or Element) data for either a whole genome or whole exome.
- PacBio HiFi data PacBio case study.
- Oxford Nanopore R10.4.1 Simplex case study.
- Complete Genomics T7 case study; G400 case study.
- Roche SBX case study for SBX-D and SBX-Fast data.
- Pangenome-mapping-based case-study: vg case study.
- RNA data for PacBio Iso-Seq/MAS-Seq case study and Illumina RNA-seq Case Study.
- Hybrid PacBio HiFi + Illumina WGS, see the hybrid case study.
Pangenome-aware DeepVariant case-studies:
- Pangenome-aware DeepVariant WGS (Illumina or Element): Mapped with BWA, Mapped with VG.
- Pangenome-aware DeepVariant WES (Illumina or Element): Mapped with BWA.
We have also adapted DeepVariant for somatic calling. See the DeepSomatic repo for details.
Please also note:
- DeepVariant currently supports variant calling on organisms where the ploidy/copy-number is two. This is because the genotypes supported are hom-alt, het, and hom-ref.
- The models included with DeepVariant are only trained on human data. For other organisms, see the blog post on non-human variant-calling for some possible pitfalls and how to handle them.
DeepTrio
DeepTrio is a deep learning-based trio variant caller built on top of DeepVariant. DeepTrio extends DeepVariant's functionality, allowing it to utilize the power of neural networks to predict genomic variants in trios or duos. See this page for more details and instructions on how to run DeepTrio.
DeepTrio supports germline variant-calling in diploid organisms for the following types of input data:
- NGS (Illumina) data for either whole genome or whole exome.
- PacBio HiFi data, see the PacBio case study.
Please also note:
- All DeepTrio models were trained on human data.
- It is possible to use DeepTrio with only 2 samples (child, and one parent).
- External tool GLnexus is used to merge output VCFs.
How to run DeepVariant
We recommend using our Docker solution. The command will look like this:
BIN_VERSION="1.10.0"
docker run \
-v "YOUR_INPUT_DIR":"/input" \
-v "YOUR_OUTPUT_DIR:/output" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \ **Replace this string with exactly one of the following [WGS,WES,PACBIO,ONT_R104,HYBRID_PACBIO_ILLUMINA]**
--ref=/input/YOUR_REF \
--reads=/input/YOUR_BAM \
--output_vcf=/output/YOUR_OUTPUT_VCF \
--output_gvcf=/output/YOUR_OUTPUT_GVCF \
--num_shards=$(nproc) \ **This will use all your cores to run make_examples. Feel free to change.**
--vcf_stats_report=true \ **Optional. Creates VCF statistics report in html file. Default is false.
--disable_small_model=true \ **Optional. Disables the small model from make_examples stage. Default is false.
--logging_dir=/output/logs \ **Optional. This saves the log output for each stage separately.
--haploid_contigs="chrX,chrY" \ **Optional. Heterozygous variants in these contigs will be re-genotyped as the most likely of reference or homozygous alternates. For a sample with karyotype XY, it should be set to "chrX,chrY" for GRCh38 and "X,Y" for GRCh37. For a sample with karyotype XX, this should not be used.
--par_regions_bed="/input/GRCh3X_par.bed" \ **Optional. If --haploid_contigs is set, then this can be used to provide PAR regions to be excluded from genotype adjustment. Download links to this files are available in this page.
--dry_run=false **Default is false. If set to true, commands will be printed out but not executed.
For details on X,Y support, please see DeepVariant haploid support and the case study in DeepVariant X, Y case study. You can download the PAR bed files from here: GRCh38_par.bed, GRCh37_par.bed.
To see all flags you can use, run: docker run google/deepvariant:"${BIN_VERSION}"
If you're using GPUs, or want to use Singularity instead, see Quick Start for more details.
If you are running on a machine with a GPU, an experimental mode is available
that enables running the make_examples stage on the CPU while the
call_variants stage runs on the GPU simultaneously.
For more details, refer to the Fast Pipeline case study.
For more information, also see:
- Full documentation list
- Detailed usage guide with more information on the input and output file formats and how to work with them.
- Best practices for multi-sample variant calling with DeepVariant
- (Advanced) Training tutorial
- DeepVariant's Frequently Asked Questions, FAQ
How to cite
If you're using DeepVariant in your work, please cite:
A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983–987 (2018). <br/> Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T. Afshar, Sam S. Gross, Lizzie Dorfman, Cory Y. McLean, and Mark A. DePristo.<br/> doi: https://doi.org/10.1038/nbt.4235
Additionally, if you are generating multi-sample calls using our DeepVariant and GLnexus Best Practices, please cite:
Accurate, scalable cohort variant calls using DeepVariant and GLnexus. Bioinformatics (2021).<br/> Taedong Yun, Helen Li, Pi-Chuan Chang, Michael F. Lin, Andrew Carroll, and Cory Y. McLean.<br/> doi: https://doi.org/10.1093/bioinformatics/btaa1081
Why Use DeepVariant?
- High accuracy - DeepVariant won 2020 PrecisionFDA Truth Challenge V2 for All Benchmark Regions for ONT, PacBio, and Multiple Technologies categories, and 2016 PrecisionFDA Truth Challenge for best SNP Performance. DeepVariant maintains high accuracy across data from different sequencing technologies, prep methods, and species. For lower coverage, using DeepVariant makes an especially great difference. See metrics for the latest accuracy numbers on each of the sequencing types.
- Flexibility - Out-of-the-box use for PCR-positive samples and low quality sequencing runs, and easy adjustments for different sequencing technologies and non-human species.
- Ease of use - No filtering is needed beyond setting your preferred minimum quality threshold.
- Cost effectiveness - With a single non-preemptible n1-standard-16 machine on Google Cloud, it costs ~$11.8 to call a 30x whole genome and ~$0.89 to call an exome. With preemptible pricing, the cost is $2.84 for a 30x whole genome and $0.21 for whole exome (not considering preemption).
- Speed - See metrics for the runtime of all supported datatypes on a 96-core CPU-only machine</sup>. Multiple options for acceleration exist.
- Usage options - DeepVariant can be run via Docker or binaries, using both on-premise hardware or in the cloud, with support for hardware accelerators like GPUs and TPUs.
<a name="myfootnote1">(1)</a>: Time estimates do not include
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
sec-edgar-agentkit
10AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.
last30days-skill
4.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
