SkillAgentSearch skills...

ClairS

ClairS - a deep-learning method for long-read somatic small variant calling

Install / Use

/learn @HKU-BAL/ClairS

README

<div align="center"> <img src="images/clairs_icon.png" width = "200" alt="ClairS"> </div>

ClairS - a deep-learning method for long-read somatic small variant calling

License

Contact: Ruibang Luo, Zhenxian Zheng
Email: rbluo@cs.hku.hk, zxzheng@cs.hku.hk


Introduction

ClairS is a somatic variant caller designed for paired samples and primarily ONT long-read. It uses Clair3 to eliminate germline variants. It ensembles the pileup and full-alignment models in Clair3, trusts them equally, and decides on the result using a set of rules and post-processing filters. With 50-fold HCC1395 (tumor) and 25-fold HCC1395BL (normal) of ONT R10.4.1 data, benchmarking against the truth SNVs (Fang et al., 2021), ClairS achieved 86.86%/93.01% recall/precision rate for SNVs when targeting VAF ≥0.05. For variants with VAF ≥0.2, the numbers go up to 94.65%/96.63%. Detailed performance figures are shown below.

ClairS means Clair-Somatic, or the masculine plural of "Clair" in french (thus, 's' is silent).

The logo of ClairS was generated using DALL-E 2 with prompt "A DNA sequence with genetic variant that looks like a letter 'S'".

A preprint describing ClairS's algorithms and results is at bioRxiv.

For germline variant calling using DNA-seq sample, please try Clair3.

For germline variant calling using long-read RNA-seq sample, please try Clair3-RNA.

For somatic variant calling using tumor only sample, please try ClairS-TO.


Performance figures

ONT Q20+ chemistry performance

The latest performance figures as of Oct 10th, 2024 (ClairS v0.4.0) is available in this technical note.

Performance comparison between “ClairS v0.4.0 with the SS model”, “ClairS v0.4.0 with the SS+RS model”, and “DeepSomatic v1.7.0”, at (a) different coverages, and (b) at different AF ranges for SNV and Indel, respectively.

PacBio Revio SNV performance

  • HCC1395/HCC1395BL tumor/normal of PacBio Revio data, using SMRTbell prep kit 3.0
  • Truth:High confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset

The performance of ClairS at multiple VAF ranges and multiple tumor coverages with the normal coverage fixed at 25x

The performance of ClairS at multiple VAF ranges and multiple normal coverages with the tumor coverage fixed at 50x

Illumina SNV performance

  • HCC1395/HCC1395BL tumor/normal of of Illumina NovaSeq 6000 and HiSeq 4000 data
  • Truth:High confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset

The precision-recall curve of different tumor/normal purity combinations with tumor coverage fixed at 50x and normal coverage fixed at 25x


Contents


Latest Updates

v0.4.4 (Nov 28, 2025) : Documentation update. Added the document to illustrate how to integrate LongPhase-S for post filter. By reconstructing somatic haplotypes and inferring tumor purity, LongPhase-S identifies false somatic variants that are inconsistent with the somatic haplotypes and flags them as “LowQual”. ONT upgraded its sequencing kit and chemistry from 4 kHz to 5 kHz in early 2024. While ClairS offers a 4 kHz model (ont_r10_dorado_sup_4khz) for legacy data, the 4kHz model will not receive future updates. Consider that some existing datasets were sequenced with 4kHz, while it is not a recommended practice to apply 5kHz model to 4kHz data, in the following table, we still give our 5kHz model's performance on COLO829/BL 4kHz data, as an anchor.

<div align="center"> <img src="images/longphase-s_benchmark.png" width = "700" alt="longphase_benchmark"> </div>

v0.4.4 (Nov 18, 2025) : Updated the ONT and PacBio ssrs model with base quality jittering and more training samples with a wider range of tumor/normal coverages and tumor purities in model training. Performance improved consistently compared with v0.4.3.

v0.4.3 (Jul 9, 2025) : Added parsing the model_specific_settings.conf file in the folder of a model and set parameters accordingly. Initially in this version, snv_min_qual= and indel_min_qual= are supported in the configuration file.

v0.4.2 (Jun 29, 2025) : Added --snv_min_qual and --indel_min_qual options to independently set the minimum QUAL threshold for SNVs and Indels to be marked as 'PASS', while deprecating the legacy --qual option.

v0.4.1 (Nov 29) : Added ssrs model for PacBio Revio (hifi_revio_ssrs) and illumina (ilmn_ssrs) platforms.

v0.4.0 (Oct 11) : This version is a major update. The new features and benchmarks are explained in a technical note titled “Improving the performance of ClairS and ClairS-TO with new real cancer cell-line datasets and PoN”. A summary of changes: 1. Starting from this version, ClairS will provide two model types. ssrs is a model trained initially with synthetic samples and then real samples augmented (e.g., ont_r10_dorado_sup_5khz_ssrs), ss is a model trained from synthetic samples (e.g., ont_r10_dorado_sup_5khz_ss). The ssrs model provides better performance and fits most usage scenarios. ss model can be used when missing a cancer-type in model training is a concern. In v0.4.0, four real cancer cell-line datasets (HCC1937/BL, HCC1954/BL, H1437/BL, and H2009/BL) covering two cancer types (breast cancer, lung cancer) published by Park et al. were used for ssrs model training. 2. Added BQ jittering in model training to address the BQ distribution difference between the training and calling datasets that leads to performance drop. 3. Added the --indel_min_af option and adjusted the default minimum allelic fraction requirement to 0.1 for Indels in ONT platform.

v0.3.1 (Aug 16) : 1. Added four options i. --use_heterozygous_snp_in_tumor_sample_and_normal_bam_for_intermediate_phasing, ii. --use_heterozygous_snp_in_normal_sample_and_normal_bam_for_intermediate_phasing, iii. --use_heterozygous_snp_in_tumor_sample_and_tumor_bam_for_intermediate_phasing, and iv. --use_heterozygous_snp_in_normal_sample_and_tumor_bam_for_intermediate_phasing. iii is equivalent to --use_heterozygous_snp_in_tumor_sample_for_intermediate_phasing added in v0.2.0. iv is equivalent to --use_heterozygous_snp_in_normal_sample_for_intermediate_phasing added in v0.2.0. Use normal bam for intermediate phasing was a request from @Sergey Aganezov. When the coverage of normal and tumor are similar, using normal bam for intermediate phasing has negligible difference from using tumor bam in our experiments using HCC1395/BL. 2. Added --haplotagged_tumor_bam_provided_so_skip_intermediate_phasing_and_haplotagging to use the haplotype information provided in the tumor bam directly and skip intermediate phasing and haplotagging. This option is useful when using ClairS in a pipeline in which the phasing of the tumor bam is done before running ClairS. BAM haplotagged by WhatsHap and LongPhase are accepted. 3. Bumped up Clair3 dependency to version 1.0.10, LongPhase to version 1.7.3.

v0.3.0 (Jul 5) : 1. Added a module called “verdict” (Option --enable_verdict) to statistically classify a called variant into either a germline, somatic, or subclonal somatic variant based on the CNV profile and tumor purity estimation. Please find out more technical details about the Verdict module here. 2. Improved model training speed, reduced model training time cost by about three times.

v0.2.0 (Apr 29) : 1. Added --use_heterozygous_snp_in_normal_sample_for_intermediate_phasing/--use_heterozygous_snp_in_tumor_sample_for_intermediate_phasing option to support using either heterozygous SNPs in the normal sample or tumor sample for intermediate phasing. The previous versions used in_tumor_sample for phasing. In this new version, when testing with ONT 4kkz HCC1395/BL and using in_normal_sample for intermediate phasing, the SNV precision improved ~2%, while recall remained unchanged. in_normal_sample becomes the default from this version. However, if the coverage of normal sample is low, please consider switching back to using in_tumor_sample (#22, idea contributed by the longphase team @sloth-eat-pudding). 2. Added --use_heterozygous_indel_for_intermediate_phasing to include high quality heterozygous Indels for intermediate phasing. With

Related Skills

View on GitHub
GitHub Stars106
CategoryEducation
Updated10d ago
Forks10

Languages

Python

Security Score

100/100

Audited on Mar 20, 2026

No findings