SkillAgentSearch skills...

Wakhan

Haplotype-specific somatic copy number aberrations/profiling from long reads sequencing data

Install / Use

/learn @KolmogorovLab/Wakhan
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Wakhan

Version 0.4.2

A tool to analyze haplotype-specific chromosome-scale somatic copy number aberrations and aneuploidy using long reads (Oxford Nanopore, PacBio). Wakhan takes long-read alignment and phased heterozygous variants as input, and first extends the phased blocks and corrects phase-switch errors, taking advantage of the copy numbers differences between the haplotypes. Wakhan estimates purity and ploidy of the sample and generates inetractive haplotype-specific copy number and coverage plots.

A detailed algorithm description and evaluation is available in our preprint.

Breakpoints/SVs based segmentation and Copy numbers estimation:

<img width="1373" alt="plots_example" src="examples/images/1437.png">

Bioconda installation

conda create -n wakhan_env wakhan
conda activate wakhan_env

Source installation

git clone https://github.com/KolmogorovLab/Wakhan.git
cd Wakhan/
conda env create -f environment.yml -n wakhan
conda activate wakhan

Usage

Prerequisites

Wakhan takes tumor alignment (in bam format), a phased germline variant calls (in vcf format) and somatic SV calls (in vcf format). Please refer to prerequisite section to generate the required inputs.

Alternatively, you can use the Lumos pipeline that generates all prerequisites and runs Wakhan.

Note that if the organism has low-to-zero heterozygosity (e.g. mouse models), you should use the unphased input mode.

1. Standalone mode

Tumor-Normal Mode (requires tumor BAM and normal phased VCF)

python wakhan.py all --threads <24> --reference <ref.fa>  --target-bam <tumor.bam>  --normal-phased-vcf <normal_phased.vcf.gz>  --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>

Tumor-only (requires tumor BAM and tumor phased VCF)

python wakhan.py all --threads <24> --reference <ref.fa>  --target-bam <tumor.bam>  --tumor-phased-vcf <tumor_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>

2. Phased SVs/Breakpoints pipeline mode

Severus also produces phased breakpoints/structural variations after rephasing tumor (tumor-only mode) or normal (tumor/normal pair mode) phased VCF which can be used in Wakhan by setting --use-sv-haplotypes param. This option enables to segment copy numbers boundaries in only one appropriate haplotype.

To use phased SVs/breakpoints Wakhan works in two steps, in first step it uses hapcorrect() module for phase-correction and generates rephased VCF, which is used to haplotag BAMs through Whatshap, then Severus uses haplotagged BAMs to generate phased SVs. In second step, Wakhan takes this resultant Severus phased SVs VCF and runs cna() module with --use-sv-haplotypes param.

For this purpose, we have developed Nextflow based Long-Read Somatic Variant Calling pipeline. This pipeline (Wakhan - hapcorrect -> Whatshap - haplotagging -> Severus - sv/breakpoints -> Wakhan - cna) can generate Severus phased SVs/breakpoints, which could be used in Wakhan by setting --use-sv-haplotypes param.

Alternatively, user can use following commands to mimic this workflow:

Tumor-Normal Mode

#Wakhan hapcorrect()
python wakhan.py hapcorrect --threads 16 --reference  ${REF_FASTA}  --target-bam ${BAM_T}  --normal-phased-vcf ${VCF}  --genome-name ${SAMPLE_T}

VCF='<genome_abc_output>/phasing_output/rephased.vcf.gz'

#Index for rephased VCF
tabix ${VCF}

#Haplotag Normal with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_N} -o ${SAMPLE_N}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4

#Haplotag Tumor with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_T} -o ${SAMPLE_T}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4

#Index for Normal haplotagged BAM 
samtools index ${SAMPLE_N}.haplotagged.bam
#Index for Tumor haplotagged BAM 
samtools index ${SAMPLE_T}.haplotagged.bam

#Severus tumor-normal mode
severus --target-bam ${SAMPLE_T}.haplotagged.bam --control-bam ${SAMPLE_N}.haplotagged.bam --out-dir severus_out \
    -t 16 --phasing-vcf ${VCF} --vntr-bed ./vntrs/human_GRCh38_no_alt_analysis_set.trf.bed

#Wakhan cna() tumor-normal mode
python wakhan.py cna --threads 16 --reference  ${REF_FASTA}  --target-bam ${BAM_T}  --normal-phased-vcf ${VCF}  --genome-name ${SAMPLE_T} \
    --breakpoints severus_out/severus_somatic.vcf --use-sv-haplotypes

Tumor-only

#Wakhan hapcorrect()
python wakhan.py hapcorrect --threads 16 --reference  ${REF_FASTA}  --target-bam ${BAM_T}  --tumor-phased-vcf ${VCF}  --genome-name ${SAMPLE_T} --out-dir-plots ${SAMPLE_T}

VCF='<genome_abc_output>/phasing_output/rephased.vcf.gz'

#Index for rephased VCF
tabix ${VCF}

#Haplotag Tumor with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_T} -o ${SAMPLE_T}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4

#Index for Tumor haplotagged BAM 
samtools index  ${SAMPLE_T}.haplotagged.bam

#Severus tumor-only mode
severus --target-bam ${SAMPLE_T}.haplotagged.bam --out-dir severus_out -t 16 --phasing-vcf ${VCF} \
    --vntr-bed ./vntrs/human_GRCh38_no_alt_analysis_set.trf.bed --PON ./pon/PoN_1000G_hg38.tsv.gz

#Wakhan cna() tumor-only mode
python wakhan.py cna --threads 16 --reference  ${REF_FASTA}  --target-bam ${BAM_T}  --tumor-phased-vcf ${VCF}  --genome-name ${SAMPLE_T} \
    --breakpoints severus_out/severus_somatic.vcf --use-sv-haplotypes

Key required/recommended parameters

  • --reference Reference file path

  • --target-bam path to target bam files (must be indexed)

  • --out-dir-plots path to output coverage plots

  • --genome-name genome cellline/sample name to be displayed on plots

  • --normal-phased-vcf normal phased VCF file (tumor/normal pair mode) to generate het SNPs frequncies pileup for tumor BAM (if tumor-only mode, use phased --tumor-phased-vcf instead)

  • --tumor-phased-vcf phased VCF is required in tumor-only mode

  • --breakpoints (Highly recommended) Somatic SV calls in vcf format

  • --threads number of threads to use

Useful optional parameters

  • --contigs List of contigs (chromosomes, default: chr1-22,chrX) to be included in the plots [e.g., chr1-22,chrX,chrY], Note: Please use 1-22,X [e.g., 1-22,X,Y] in case REF, BAM, and VCFs entries don't contain chr name/notion, same should be observed in --centromere and --cancer-genes params in case no chr in names, use *_nochr.bed instead available in src/annotations or customized.

  • --use-sv-haplotypes To use phased Severus SV/breakpoints [default: disabled]

  • --cpd-internal-segments For change point detection algo on internal segments after breakpoint/cpd algo for more precise segmentation.

  • --cut-threshold Maximum cut threshold for coverage (readdepth) plots [default: 100]

  • --centromere Path to centromere annotations BED file [default: annotations/grch38.cen_coord.curated.bed]

  • --cancer-genes Path to Cancer Genes in TSV format to display in CNA plots [default: annotations/CancerGenes.tsv]

  • --pdf-enable Enabling PDF output for plots

3. Unphased mode

Wakhan can also be used in unphased mode, in case the phasing quality is low or genome is effectively haploid (e.g. mouse models). Use the --without-phasing to enable unphased mode.

A sample command-line for running a mouse genome analysis in unphased mode (note additional argument for centromere annotation)

python wakhan.py all --threads 16 --reference MM_10  --target-bam TUMOR.bam --tumor-phased-vcf TUMOR.vcf --out-dir-plots OUT_DIR --genome-name SAMPLE --breakpoints SEVERUS_SOMATIC.vcf --without-phasing --centromere WAKHAN_DIR/annotations/mouse_chr.bed> [--contigs chr1-19,chrX]

Here is a sample copy number/breakpoints output plot without phasing for a mouse subline dataset. <img width="1373" alt="plots_example" src="examples/images/C15.png">

Breakpoints/Structural variations or change point detection algo for copy number model

Wakhan accepts Severus structural variants VCF as breakpoints with param --breakpoints inputs to detect copy number changes and this option is highly recommended. However, if --breakpoints option is not used, --change-point-detection-for-cna should be used instead to use change point detection algorithm [ruptures](https://centre-borelli.github

View on GitHub
GitHub Stars71
CategoryProduct
Updated6d ago
Forks5

Languages

Python

Security Score

95/100

Audited on Mar 24, 2026

No findings