EVG

A rapid and accurate ensemble pipeline for graph-based variant genotyping with lower depth of short reads

Generate Convert Improve

Install / Use

/learn @JiaoLab2021/EVG

About this skill

Quality Score

0/100

README

EVG

Introduction

A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline

Requirements

Please note the following requirements before building and running the software:

Linux operating system
cmake version 3.12 or higher
Python version 3.9
C++ compiler that supports C++17 or higher, and the zlib library installed (we recommend using GCC version "7.3.0" or newer) for building graphvcf and fastAQ
The following dependencies must also be installed: tabix, bwa, samtools, VG, GraphAligner, Paragraph, BayesTyper, GraphTyper2, PanGenie

Recent major updates:

(2025/04/30, v1.2.2)

Updated Giraffe indexing and alignment commands for vg ≥1.63.0.
Pinned BayesTyper to 1.5=h176a8bc_0 due to bugs in newer conda versions.

(2024/06/25, v1.2.0)

If a sample's genotype information is missing in the VCF file, the previous version would throw a segmentation fault. In version v1.2.0, it will be replaced with 0|0.

Installation

Install via Anaconda

The easiest way to install EVG is through Anaconda, but please note that in this case, the Python version must be 3.9. Conda will automatically set the Python version for you, so please ensure that your system can install Python 3.9.

# Create a new environment named evg_env
conda create -n evg_env
# Activate the environment
conda activate evg_env
# Install EVG with all dependencies
conda install -c bioconda -c conda-forge -c kdm801 -c duzezhen evg

Building on Linux

Use the following script to build the software:

First, obtain the source code.

git clone https://github.com/JiaoLab2021/EVG.git
cd EVG

Next, compile the software and add the current directory to your system's PATH environment variable. Please make sure that EVG, graphvcf, and fastAQ are all in the same folder, as EVG will call these two programs from its own directory.

cmake ./
make
chmod +x EVG.py
ln -sf EVG.py EVG
echo 'export PATH="$PATH:'$(pwd)'"' >> ~/.bashrc
source ~/.bashrc

Assuming that you have installed all the required software dependencies, please make sure they have been added to your environment path or activated in the corresponding code environment. If you haven't installed them yet, you can use the following code to install all the dependencies:

# Create a new environment named evg_env
conda create -n evg_env
# Activate the environment
conda activate evg_env
# Install software using conda
conda install -c bioconda -c conda-forge -c kdm801 tabix bwa samtools vg graphaligner paragraph 'bayestyper==1.5=h176a8bc_0' graphtyper kmc pangenie
# "ModuleNotFoundError: No module named 'pysam.bcftools'", therefore it is recommended to upgrade pysam in this case
conda update pysam

Note

The default version of PanGenie installed by conda is 2.1.0, but EVG requires version 3.0 or higher. If you choose PanGenie as your downstream tool, please remove the current PanGenie from your conda environment and manually install the latest version of PanGenie, then add it to your environment variables.

Test

To verify that the software has been installed correctly, perform a test run using the following steps:

EVG -h
graphvcf -h
fastAQ -h
tabix -h
bwa
samtools
vg -h
GraphAligner -h
paragraph -h
bayesTyper -h
graphtyper -h
PanGenie -h
kmc -h
jellyfish -h
# test
cd test
EVG -r test.fa -v test.vcf.gz -s sample.txt --software VG-MAP VG-Giraffe GraphAligner Paragraph BayesTyper GraphTyper2 PanGenie &>log.txt &

Usage

Input Files

Reference Genome
VCF File of Population Variants
Sample File:

# Sample File
sample1 sample1.r1.fq.gz sample1.r2.fq.gz
sample2 sample2.r1.fq.gz sample2.r2.fq.gz
...
sampleN sampleN.r1.fq.gz sampleN.r2.fq.gz

Please note that the Sample file must be formatted exactly as shown above, where each sample is listed with its corresponding read files.

Running

For convenience, let's assume the following file names for the input:

refgenome.fa
input.vcf.gz
sample.txt

EVG automatically selects suitable software based on the genome, mutation and sequencing data. If desired, users can also use the "--software" command to specify their preferred software. The default running command is as follows:

EVG -r refgenome.fa -v input.vcf.gz -s sample.txt

The results are stored in the merge/ folder, and each file is named after the corresponding sample listed in sample.txt: sample1.vcf.gz, sample2.vcf.gz, ..., sampleN.vcf.gz.

$ tree merge/
merge/
├── test1.vcf.gz
└── test2.vcf.gz

0 directories, 2 files

Parameter

--depth: This parameter specifies the maximum sequencing data depth allowed for downstream analysis. If this value is exceeded, EVG will randomly downsample reads to the specified level in order to speed up the run. The default downsampling level is set at 15×, but it can be adjusted to meet specific requirements.
--mode: This parameter determines the operating mode of EVG. In fast mode, only certain software is utilized to genotype SNPs and indels, while precise mode employs all software to genotype all variants.
--force: If there are pre-existing files in the running directory of EVG, this parameter can be used to forcibly empty the folder. Otherwise, the software will encounter an error and exit.
--restart: This parameter allows the software to resume from where it left off if it unexpectedly stops, enabling a breakpoint restart. Note that software completion is determined by file existence. It's recommended to manually check for incomplete or empty files before using this parameter and delete them.

graphvcf

If you already have results from different genotyping software and do not need to use EVG, you can directly use graphvcf to merge your results.

graphvcf merge -v merged.vcf.gz --Paragraph xx.vcf.gz --BayesTyper xx.vcf.gz --VG-Giraffe xx.vcf.gz -n sample1 -o sample.vcf.gz

Detailed instructions for using graphvcf can be found on the Wiki page.

Citation

When using the following tools, please cite the corresponding articles:

EVG:
- Du, ZZ., He, JB. & Jiao, WB. A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline. Genome Biol 25, 91 (2024).
vg map:
- Hickey, G., Heller, D., Monlong, J. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol 21, 35 (2020).
vg giraffe:
- Jouni Sirén et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).
GraphAligner:
- Rautiainen, M., Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 21, 253 (2020).
Paragraph:
- Chen, S., Krusche, P., Dolzhenko, E. et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol 20, 291 (2019).
- Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics. (2013).
BayesTyper:
- Sibbesen, J.A., Maretty, L., The Danish Pan-Genome Consortium. et al. Accurate genotyping across variant classes and lengths using variant graphs. Nat Genet 50, 1054–1059 (2018).
- Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics. (2013).
GraphTyper2:
- Eggertsson, H.P., Kristmundsdottir, S., Beyter, D. et al. [GraphTyper2 enables population-sca

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。