Wakhan
Haplotype-specific somatic copy number aberrations/profiling from long reads sequencing data
Install / Use
/learn @KolmogorovLab/WakhanREADME
Wakhan
Version 0.4.2
A tool to analyze haplotype-specific chromosome-scale somatic copy number aberrations and aneuploidy using long reads (Oxford Nanopore, PacBio). Wakhan takes long-read alignment and phased heterozygous variants as input, and first extends the phased blocks and corrects phase-switch errors, taking advantage of the copy numbers differences between the haplotypes. Wakhan estimates purity and ploidy of the sample and generates inetractive haplotype-specific copy number and coverage plots.
A detailed algorithm description and evaluation is available in our preprint.
Breakpoints/SVs based segmentation and Copy numbers estimation:
<img width="1373" alt="plots_example" src="examples/images/1437.png">Bioconda installation
conda create -n wakhan_env wakhan
conda activate wakhan_env
Source installation
git clone https://github.com/KolmogorovLab/Wakhan.git
cd Wakhan/
conda env create -f environment.yml -n wakhan
conda activate wakhan
Usage
Prerequisites
Wakhan takes tumor alignment (in bam format), a phased germline variant calls (in vcf format) and somatic SV calls (in vcf format). Please refer to prerequisite section to generate the required inputs.
Alternatively, you can use the Lumos pipeline that generates all prerequisites and runs Wakhan.
Note that if the organism has low-to-zero heterozygosity (e.g. mouse models), you should use the unphased input mode.
1. Standalone mode
Tumor-Normal Mode (requires tumor BAM and normal phased VCF)
python wakhan.py all --threads <24> --reference <ref.fa> --target-bam <tumor.bam> --normal-phased-vcf <normal_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>
Tumor-only (requires tumor BAM and tumor phased VCF)
python wakhan.py all --threads <24> --reference <ref.fa> --target-bam <tumor.bam> --tumor-phased-vcf <tumor_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>
2. Phased SVs/Breakpoints pipeline mode
Severus also produces phased breakpoints/structural variations after rephasing tumor (tumor-only mode) or normal (tumor/normal pair mode) phased VCF which can be used in Wakhan by setting --use-sv-haplotypes param.
This option enables to segment copy numbers boundaries in only one appropriate haplotype.
To use phased SVs/breakpoints Wakhan works in two steps, in first step it uses hapcorrect() module for phase-correction and generates rephased VCF, which is used to haplotag BAMs through Whatshap,
then Severus uses haplotagged BAMs to generate phased SVs. In second step, Wakhan takes this resultant Severus phased SVs VCF and runs cna() module with --use-sv-haplotypes param.
For this purpose, we have developed Nextflow based Long-Read Somatic Variant Calling pipeline.
This pipeline (Wakhan - hapcorrect -> Whatshap - haplotagging -> Severus - sv/breakpoints -> Wakhan - cna) can generate Severus phased SVs/breakpoints, which could be used in Wakhan by setting --use-sv-haplotypes param.
Alternatively, user can use following commands to mimic this workflow:
Tumor-Normal Mode
#Wakhan hapcorrect()
python wakhan.py hapcorrect --threads 16 --reference ${REF_FASTA} --target-bam ${BAM_T} --normal-phased-vcf ${VCF} --genome-name ${SAMPLE_T}
VCF='<genome_abc_output>/phasing_output/rephased.vcf.gz'
#Index for rephased VCF
tabix ${VCF}
#Haplotag Normal with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_N} -o ${SAMPLE_N}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4
#Haplotag Tumor with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_T} -o ${SAMPLE_T}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4
#Index for Normal haplotagged BAM
samtools index ${SAMPLE_N}.haplotagged.bam
#Index for Tumor haplotagged BAM
samtools index ${SAMPLE_T}.haplotagged.bam
#Severus tumor-normal mode
severus --target-bam ${SAMPLE_T}.haplotagged.bam --control-bam ${SAMPLE_N}.haplotagged.bam --out-dir severus_out \
-t 16 --phasing-vcf ${VCF} --vntr-bed ./vntrs/human_GRCh38_no_alt_analysis_set.trf.bed
#Wakhan cna() tumor-normal mode
python wakhan.py cna --threads 16 --reference ${REF_FASTA} --target-bam ${BAM_T} --normal-phased-vcf ${VCF} --genome-name ${SAMPLE_T} \
--breakpoints severus_out/severus_somatic.vcf --use-sv-haplotypes
Tumor-only
#Wakhan hapcorrect()
python wakhan.py hapcorrect --threads 16 --reference ${REF_FASTA} --target-bam ${BAM_T} --tumor-phased-vcf ${VCF} --genome-name ${SAMPLE_T} --out-dir-plots ${SAMPLE_T}
VCF='<genome_abc_output>/phasing_output/rephased.vcf.gz'
#Index for rephased VCF
tabix ${VCF}
#Haplotag Tumor with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_T} -o ${SAMPLE_T}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4
#Index for Tumor haplotagged BAM
samtools index ${SAMPLE_T}.haplotagged.bam
#Severus tumor-only mode
severus --target-bam ${SAMPLE_T}.haplotagged.bam --out-dir severus_out -t 16 --phasing-vcf ${VCF} \
--vntr-bed ./vntrs/human_GRCh38_no_alt_analysis_set.trf.bed --PON ./pon/PoN_1000G_hg38.tsv.gz
#Wakhan cna() tumor-only mode
python wakhan.py cna --threads 16 --reference ${REF_FASTA} --target-bam ${BAM_T} --tumor-phased-vcf ${VCF} --genome-name ${SAMPLE_T} \
--breakpoints severus_out/severus_somatic.vcf --use-sv-haplotypes
Key required/recommended parameters
-
--referenceReference file path -
--target-bampath to target bam files (must be indexed) -
--out-dir-plotspath to output coverage plots -
--genome-namegenome cellline/sample name to be displayed on plots -
--normal-phased-vcfnormal phased VCF file (tumor/normal pair mode) to generate het SNPs frequncies pileup for tumor BAM (if tumor-only mode, use phased--tumor-phased-vcfinstead) -
--tumor-phased-vcfphased VCF is required in tumor-only mode -
--breakpoints(Highly recommended) Somatic SV calls in vcf format -
--threadsnumber of threads to use
Useful optional parameters
-
--contigsList of contigs (chromosomes, default: chr1-22,chrX) to be included in the plots [e.g., chr1-22,chrX,chrY], Note: Please use 1-22,X [e.g., 1-22,X,Y] in case REF, BAM, and VCFs entries don't containchrname/notion, same should be observed in--centromereand--cancer-genesparams in case nochrin names, use*_nochr.bedinstead available insrc/annotationsor customized. -
--use-sv-haplotypesTo use phased Severus SV/breakpoints [default: disabled] -
--cpd-internal-segmentsFor change point detection algo on internal segments after breakpoint/cpd algo for more precise segmentation. -
--cut-thresholdMaximum cut threshold for coverage (readdepth) plots [default: 100] -
--centromerePath to centromere annotations BED file [default: annotations/grch38.cen_coord.curated.bed] -
--cancer-genesPath to Cancer Genes in TSV format to display in CNA plots [default: annotations/CancerGenes.tsv] -
--pdf-enableEnabling PDF output for plots
3. Unphased mode
Wakhan can also be used in unphased mode, in case the phasing quality is low or genome is effectively haploid (e.g. mouse models).
Use the --without-phasing to enable unphased mode.
A sample command-line for running a mouse genome analysis in unphased mode (note additional argument for centromere annotation)
python wakhan.py all --threads 16 --reference MM_10 --target-bam TUMOR.bam --tumor-phased-vcf TUMOR.vcf --out-dir-plots OUT_DIR --genome-name SAMPLE --breakpoints SEVERUS_SOMATIC.vcf --without-phasing --centromere WAKHAN_DIR/annotations/mouse_chr.bed> [--contigs chr1-19,chrX]
Here is a sample copy number/breakpoints output plot without phasing for a mouse subline dataset. <img width="1373" alt="plots_example" src="examples/images/C15.png">
Breakpoints/Structural variations or change point detection algo for copy number model
Wakhan accepts Severus structural variants VCF as breakpoints with param --breakpoints inputs to detect copy number changes and this option is highly recommended.
However, if --breakpoints option is not used, --change-point-detection-for-cna should be used instead to use change point detection algorithm [ruptures](https://centre-borelli.github
