Clair - deep neural network based variant caller

Contact: Ruibang Luo
Email: rbluo@cs.hku.hk

Introduction

Clair3 released in May 2021 is the successor of Clair, please try out Clair3 at https://github.com/HKU-BAL/Clair3.

Single-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly, and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited the new technologies from being more widely used. In this study, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single molecule sequencing data. For ONT data, Clair achieves the best precision, recall and speed as compared to several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks.

This is the formal release of Clair (Clair v2, Dec 2019). You can find the experimental Clair v1 (Jan 2019) at https://github.com/aquaskyline/Clair. The preprint of Clair v2 is available in bioRxiv.

What are we working on right now
What's new
Installation
Quick Demo
Usage
What's new
Submodule Descriptions
Download Pretrained Models
Advanced Guides
Model Training
Post Processing

What are we working on right now?

A full alignment representation for higher performance in the low complexity genomics regions.
Testing small technics to resolve some complex variants, e.g. a deletion that spans a SNP closely followed.

What's new?

20200831
- added support for alternative allele "*". "GetTruth.py" now requires a reference genome as input. You don't need to change your usage if you use "callVarBam.py" for automatic scripts generation.
20200416
- added two new options for haploid calling, --haploid_precision and --haploid_sensitive (in #24)
- added a simple after calling solution to handle overlapped variants (in #15)
- fixed haploid GT output (in #17)
20200309
- an ONT model trained with up to 578-fold coverage HG002 data from The Human Pangenome Reference Consortium is now available in Pretrained Models. The below table shows the biased test results, i.e. testing samples were included in training, thus are not for benchmarking but suggest the performance cap of each model at different coverages. The new model shows significantly improved performance at high coverages.

Installation

Option 1. Bioconda

# make sure channels are added in conda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# create conda environment named "clair-env"
conda create -n clair-env -c bioconda clair
conda activate clair-env

# store clair.py PATH into $CLAIR variable
CLAIR=`which clair.py`

# run clair like this afterwards
python $CLAIR --help

The conda environment has the Pypy3 interpreter installed, but one Pypy3 package intervaltree is still missing. The reason why this is not installed by default is because this is not yet available in any conda repositories. To install the package for Pypy3, after activating the conda environment, please run the following commands:

pypy3 -m ensurepip
pypy3 -m pip install --no-cache-dir intervaltree==3.0.2

Then download the trained models:

# download the trained model for ONT
mkdir ont && cd ont
wget http://www.bio8.cs.hku.hk/clair_models/ont/122HD34.tar
tar -xf 122HD34.tar
cd ../

# download the trained model for PacBio CCS
mkdir pacbio && cd pacbio
wget http://www.bio8.cs.hku.hk/clair_models/pacbio/ccs/15.tar
tar -xf 15.tar
cd ../

# download the trained model for Illumina
mkdir illumina && cd illumina
wget http://www.bio8.cs.hku.hk/clair_models/illumina/12345.tar
tar -xf 12345.tar
cd ../

Option 2. Build an anaconda virtual environment step by step

Please install anaconda using the installation guide at https://docs.anaconda.com/anaconda/install/

# create and activate the environment named clair
conda create -n clair python=3.7
conda activate clair

# install pypy and packages on clair environemnt
conda install -c conda-forge pypy3.6
pypy3 -m ensurepip
pypy3 -m pip install intervaltree==3.0.2

# install python packages on clair environment
pip install numpy==1.18.0 blosc==1.8.3 intervaltree==3.0.2 tensorflow==1.13.2 pysam==0.15.3 matplotlib==3.1.2
conda install -c anaconda pigz==2.4
conda install -c conda-forge parallel=20191122 zstd=1.4.4
conda install -c bioconda samtools=1.10 vcflib=1.0.0 bcftools=1.10.2

# clone Clair
git clone --depth 1 https://github.com/HKU-BAL/Clair.git
cd Clair
chmod +x clair.py
export PATH=`pwd`":$PATH"

# store clair.py PATH into $CLAIR variable
CLAIR=`which clair.py`

# run clair like this afterwards
python $CLAIR --help

Then download the trained models referring to download the trained model in Installation - Option 1

Option 3. Docker

# clone Clair
git clone --depth 1 https://github.com/HKU-BAL/Clair.git
cd Clair

# build a docker image named clair_docker_image
docker build -f ./Dockerfile -t clair_docker_image . # You might need root privilege

# run docker image
docker run -it clair_docker_image # You might need root privilege

# store clair.py PATH into $CLAIR variable
CLAIR=`which clair.py`

# run clair like this afterwards
python $CLAIR --help

Then download the trained models referring to download the trained model in Installation - Option 1

After Installation

To check the version of Tensorflow you have installed:

python -c 'import tensorflow as tf; print(tf.__version__)'

To do variant calling using trained models, CPU will suffice. Clair uses 4 threads by default in callVarBam. The number of threads to be used can be controlled using the parameter --threads. To train a new model, a high-end GPU and the GPU version of Tensorflow is needed. To install the GPU version of tensorflow:

pip install tensorflow-gpu==1.13.2

The installation of the blosc library might fail if your CPU doesn't support the AVX2 instruction set. Alternatively, you can compile and install from the latest source code available in GitHub with the DISABLE_BLOSC_AVX2 environment variable set.

Quick demo

Step 1. Install Clair, preferably using Installation - Option 1
Step 2. Run

conda activate clair-env
mkdir clairDemo
cd clairDemo
wget 'http://www.bio8.cs.hku.hk/clair_models/demo/clairDemo.sh'
bash clairDemo.sh

Step 3. Check the results using less -S ./training/chr21.vcf

Usage

General usage

CLAIR="[PATH_TO_CLAIR]/clair.py"

# to run a submodule using python
python $CLAIR [submodule] [options]

# to run a Pypy-able submodule using pypy (if `pypy3` is the executable command for Pypy)
pypy3 $CLAIR [submodule] [options]

Setup variables for variant calling commands afterwards

CLAIR="[PATH_TO_CLAIR]/clair.py"                         # e.g. clair.py
MODEL="[MODEL_PATH]"                                     # e.g. [PATH_TO_CLAIR_MODEL]/ont/model
BAM_FILE_PATH="[YOUR_BAM_FILE]"                          # e.g. chr21.bam
REFERENCE_FASTA_FILE_PATH="[YOUR_REFERENCE_FASTA_FILE]"  # e.g. chr21.fa
KNOWN_VARIANTS_VCF="[YOUR_VCF_FILE]"                     # e.g. chr21.vcf

Notes

Each model has three files model.data-00000-of-00001, model.index, model.meta. Please give the MODEL variable, the prefix model.

Call variants at known variant sites or in a chromosome (using `callVarBam`)

For whole genome variant calling, please use callVarBamParallel to generate multiple commands that invoke callVarBam on smaller chromosome chucks.

Call variants in a chromosome

# variables
VARIANT_CALLING_OUTPUT_PATH="[YOUR_OUTPUT_PATH]"         # e.g. calls/chr21.vcf (please make sure the directory exists)
CONTIG_NAME="[CONTIG_NAME_FOR_VARIANT_CALLING]"          # e.g. chr21
SAMPLE_NAME="[SAMPLE_NAME]"                              # e.g. HG001

python $CLAIR callVarBam \
--chkpnt_fn "$MODEL" \
--ref_fn "$REFERENCE_FASTA_FILE_PATH" \
--bam_fn "$BAM_FILE_PATH" \
--ctgName "$CONTIG_NAME" \
--sampleName "$SAMPLE_NAME" \
--call_fn "$VARIANT_CALLING_OUTPUT_PATH"

cd "$VARIANT_CALLING_OUTPUT_PATH"

Call variants at known variant sites in a chromosome

# variables
VARIANT_CALLING_OUTPUT_PATH="[YOUR_OUTPUT_PATH]"         # e.g. calls/chr21.vcf (please make sure the directory exists)
CONTIG_NAME="[CONTIG_NAME_FOR_VARIANT_CALLING]"          # e.g. chr21
SAMPLE_NAME="[SAMPLE_NAME]"                              # e.g. HG001
KNOW

Clair

Install / Use

README

Clair - deep neural network based variant caller

Introduction

Contents

What are we working on right now?

What's new?

Installation

Option 1. Bioconda

Option 2. Build an anaconda virtual environment step by step

Please install anaconda using the installation guide at https://docs.anaconda.com/anaconda/install/

Option 3. Docker

After Installation

Quick demo

Usage

General usage

Setup variables for variant calling commands afterwards

Notes

Call variants at known variant sites or in a chromosome (using `callVarBam`)

Call variants in a chromosome

Call variants at known variant sites in a chromosome

Clair

Install / Use

README

Clair - deep neural network based variant caller

Introduction

Contents

What are we working on right now?

What's new?

Installation

Option 1. Bioconda

Option 2. Build an anaconda virtual environment step by step

Please install anaconda using the installation guide at https://docs.anaconda.com/anaconda/install/

Option 3. Docker

After Installation

Quick demo

Usage

General usage

Setup variables for variant calling commands afterwards

Notes

Call variants at known variant sites or in a chromosome (using callVarBam)

Call variants in a chromosome

Call variants at known variant sites in a chromosome

Call variants at known variant sites or in a chromosome (using `callVarBam`)