VarDictJava
VarDict Java port
Install / Use
/learn @AstraZeneca-NGS/VarDictJavaREADME
This is the Final Version of VarDict. No longer maintained.
VarDictJava
Introduction
VarDictJava is a variant discovery program written in Java and Perl. It is a Java port of VarDict variant caller.
The original Perl VarDict is a sensitive variant caller for both single and paired sample variant calling from BAM files. VarDict implements several novel features such as amplicon bias aware variant calling from targeted sequencing experiments, rescue of long indels by realigning bwa soft clipped reads and better scalability than many other Java based variant callers. The Java port is around 10x faster than the original Perl implementation.
Please cite VarDict:
Lai Z, Markovets A, Ahdesmaki M, Chapman B, Hofmann O, McEwen R, Johnson J, Dougherty B, Barrett JC, and Dry JR. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 2016, pii: gkw227.
The link to is article can be accessed through: https://academic.oup.com/nar/article/44/11/e108/2468301?searchresult=1
Original coded by Zhongwu Lai 2014.
VarDictJava can run in single sample (see Single sample mode section), paired sample (see Paired variant calling section), or amplicon bias aware modes. As input, VarDictJava takes reference genomes in FASTA format, aligned reads in BAM format, and target regions in BED format.
Requirements
- JDK 1.8 or later
- R language (uses /usr/bin/env R)
- Perl (uses /usr/bin/env perl)
- Internet connection to download dependencies using gradle.
To see the help page for the program, run
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -H.
Getting started
Getting source code
The VarDictJava source code is located at https://github.com/AstraZeneca-NGS/VarDictJava.
To load the project, execute the following command:
git clone --recursive https://github.com/AstraZeneca-NGS/VarDictJava.git
Note that the original VarDict project is placed in this repository as a submodule and its contents can be found in the sub-directory VarDict in VarDictJava working folder. So when you use teststrandbias.R and var2vcf_valid.pl. (see details and examples below), you have to add prefix VarDict: VarDict/teststrandbias.R and VarDict/var2vcf_valid.pl.
Compiling
The project uses Gradle and already includes a gradlew script.
To build the project, in the root folder of the project, run the following command:
./gradlew clean installDist
Clean will remove all old files from build folder.
To generate Javadoc, in the build/docs/javadoc folder, run the following command. If you want to save content of build folder as it is (for example after building the project), run it without clean option:
./gradlew clean javadoc
To generate release version in the build/distributions folder as tar or zip archive, run the following command:
./gradlew distZip
or
./gradlew distTar
Distribution Package Structure
When the build command completes successfully, the build/install/VarDict folder contains the distribution package.
The distribution package has the following structure:
bin/- contains the launch scriptslib/- has the jar file that contains the compiled project code and the jar files of the third-party libraries that the project uses.
You can move the distribution package (the content of the build/install/VarDict folder) to any convenient location.
Generated zip and tar releases will also contain scripts from VarDict Perl repository in bin/ directory (teststrandbias.R,
testsomatic.R, var2vcf_valid.pl, var2vcf_paired.pl).
You can add VarDictJava on PATH by adding this line to .bashrc:
export PATH=/path/to/VarDict/bin:$PATH
After that you can run VarDict by Vardict command instead of full path to <path_to_vardict_folder>/build/install/VarDict/bin/VarDict.
Third-Party Libraries
Currently, the project uses the following third-party libraries:
- JRegex (http://jregex.sourceforge.net, BSD license) is a regular expressions library that is used instead of the standard Java library because its performance is much higher than that of the standard library.
- Commons CLI (http://commons.apache.org/proper/commons-cli, Apache License) – a library for parsing the command line.
- HTSJDK (http://samtools.github.io/htsjdk/) is an implementation of a unified Java library for accessing common file formats, such as SAM and VCF.
- Mockito and TestNG are the testing frameworks (not included in distribution, used only in tests).
Single sample mode
To run VarDictJava in single sample mode, use a BAM file specified without the | symbol and perform Steps 3 and 4
(see the Program workflow section) using teststrandbias.R and var2vcf_valid.pl.
The following is an example command to run in single sample mode with BED file.
You have to set options -c, -S, -E, -g using number of columns in your BED file for chromosome, start, end
and gene of region respectively:
AF_THR="0.01" # minimum allele frequency
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N sample_name -b /path/to/my.bam -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f $AF_THR > vars.vcf
VarDictJava can also be invoked without a BED file if the region is specified in the command line with -R option.
The following is an example command to run VarDictJava for a region (chromosome 7, position from 55270300 to 55270348, EGFR gene) with -R option:
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f 0.001 -N sample_name -b /path/to/sample.bam -R chr7:55270300-55270348:EGFR | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f 0.001 > vars.vcf
In single sample mode, output columns contain a description and statistical info for variants in the single sample. See section Output Columns for list of columns in the output.
Paired variant calling
To run paired variant calling, use BAM files specified as BAM1|BAM2 and perform Steps 3 and 4
(see the Program Workflow section) using testsomatic.R and var2vcf_paired.pl.
In this mode, the number of statistics columns in the output is doubled: one set of columns is for the first sample, the other - for second sample.
The following is an example command to run in paired mode.
You have to set options -c, -S, -E, -g using number of columns in your bed file for chromosome, start,
end and gene of region respectively:
AF_THR="0.01" # minimum allele frequency
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N tumor_sample_name -b "/path/to/tumor.bam|/path/to/normal.bam" -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/testsomatic.R | VarDict/var2vcf_paired.pl -N "tumor_sample_name|normal_sample_name" -f $AF_THR
Amplicon based calling
This mode is active if the BED file uses 8-column format and the -R option is not specified.
In this mode, only the first list of BAM files is used even if the files are specified as BAM1|BAM2 - like for paired variant calling.
For each segment, the BED file specifies the list of positions as start and end positions (columns 7 and 8 of
the BED file). The Amplicon based calling mode outputs a record for every position between start and end that has
any variant other than the reference one (all positions with the -p option). For any of these positions,
VarDict in amplicon based calling mode outputs the following:
- Same columns as in the single sample mode for the most frequent variant
- Good variants for this position with the prefixes
GOOD1,GOOD2, etc. - Bad variants for this position with the prefixes
BAD1, `
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
