SVclone
A computational method for inferring the cancer cell fraction of tumour structural variation from whole-genome sequencing data.
Install / Use
/learn @mcmero/SVcloneREADME
This package is used to cluster structural variants of similar cancer cell fraction (CCF). SVclone is divided into five components: annotate, count, filter, cluster and post-assign. The annotate step infers directionality of each breakpoint (if not supplied), recalibrates breakpoint position to the soft-clip boundary and subsequently classifies SVs using a rule-based approach. The count step counts the variant and non-variant reads from breakpoint locations. Both the annotate and count steps utilise BAM-level information. The filter step removes SVs based on a number of adjustable parameters and prepares the variants for clustering. SNVs can also be added at this step as well as CNV information, which is matched to SV and SNV loci. The post-assign step then allows SVs to be assigned to the derived model from SNV clustering. Optionally, SVs and SNVs can also be post-assigned to a joint SV + SNV model.
How do I get set up?
First install conda. SVclone can then be installed via:
conda install svclone -c bioconda -c conda-forge
svclone --help
Alternatively, you may wish to install SVclone in its own conda virtual environment:
conda create -n svclone -c bioconda -c conda-forge svclone
conda activate svclone
svclone --help
If your site supports Modules and EasyBuild SVclone can be installed with:
eb SVclone-1.1.2-foss-2022b.eb
module load SVclone
Example data
Example data is provided to test your SVclone installation (data contains simulated clonal deletions). If you would like to run the tests, this can be done via:
git clone https://github.com/mcmero/SVclone.git
cd SVclone
./run_example.sh
You can check the following output plot, which will summarise the clustering result:
- tumour_p80_DEL/ccube_out/tumour
You can also test the simulated SNV data by running:
./run_example_wsnvs.sh
The simulated data contains a 100% CCF clone and a 30% subclone.
Annotate step
Before running SVclone on real data, first download the configuration file via:
wget https://raw.githubusercontent.com/mcmero/SVclone/master/svclone_config.ini
Check the settings carefully and make sure that the config is approproate for your data set. Make sure that this file is in the directory from which you're running SVclone, or that you've specified the location using -cfg <config_file>.
An indexed whole-genome sequencing BAM and a list of paired breakpoints from an SV caller of choice is required. This step is required for clustering of SVs, however, classifiation and directionality information from your choice of SV caller can be used rather than being inferred.
svclone annotate -i <sv_input> -b <indexed_bamfile> -s <sample_name>
Input is expected in VCF format (directionality inferred from the ALT field is also supported). Each defined SV must have a matching mate, given in the MATEID value in the INFO section. Please see the VCF spec (section 3) for representing SVs using the VCF format. SVclone does not support unpaired break-ends, which means that the INFO field PARID must be specified (please see Section 5.4.4 in the VCF spec for an example).
Input may also be entered in Socrates or simple format (must be specified with --sv_format simple or --sv_format socrates). Simple format is as follows:
chr1 pos1 chr2 pos2
22 18240676 22 18232335
22 19940482 22 19937820
22 21383572 22 21382745
22 21395573 22 21395746
We recommend that directions from the SV caller of choice be used (use_dir must be set to True in the configuration file in this case). Optionally, if you already know the SV classifications, the name of the classification field can be specified in the config file (e.g. sv_class_field: classification).
An example of the 'full' SV simple format is as follows:
chr1 pos1 dir1 chr2 pos2 dir2 classification
22 18240676 - 22 18232335 - INV
22 19940482 - 22 19937820 + DEL
22 21383572 - 22 21382745 + DUP
22 21395573 + 22 21395746 + INV
Note that your classifications in your SV input will have to match those specified in the configuration file (these may be comma-separated):
[SVclasses]
# Naming conventions used to label SV types.
inversion_class: INV
deletion_class: DEL
dna_gain_class: DUP,INTDUP
dna_loss_class: DEL,INV,TRX
itrx_class: INTRX
Note that dna_gain_class will include any SV classification involving DNA duplication and dna_loss_class is any intra-chromosomal rearrangement not involving a gain (including balanced rearrangements). itrx_class refers to all inter-chromosomal translocations.
A blacklist (bed file) can also be supplied at this step to not process areas to remove SVs where any of its breakpoints fall into one of these areas.
Annotate Output
The above input example also corresponds with the output of this step (output to <out>/<sample>_svin.txt), with an added SV ID present in the column. Events that are considered part of the same event will have the same ID (which may be multiple breakpoints).
Required Parameters
- -i or --input : structural variants input file (see above).
- -b or --bam : bam file with corresponding index file.
- -s or --sample : Sample name. Will create processed output file as <out>/<sample>_svinfo.txt, parameters output as <out>/<sample>_params.txt.
Optional Parameters
- -o or --out <directory> : output directory to create files. Default: the sample name.
- -cgf or --config <config.ini> : SVclone configuration file with additional parameters (svclone_config.ini is the default).
- --sv_format <vcf, simple, socrates> : input format of SV calls, VCF by default, but may also be simple (see above) or from the SV caller Socrates.
- --blacklist <file.bed> : Takes a list of intervals in BED format. Skip processing of any break-pairs where either SV break-end overlaps an interval specified in the supplied bed file. Using something like the DAC blacklist is recommended.
Count step
Run SV processing submodule to obtain read counts around breakpoints on each sample BAM file like so:
svclone count -i <svs> -b <indexed_bamfile> -s <sample_name>
The classification strings are not used by the program, except for DNA-gain events (such as duplications). The classification names for these types of SVs should be specified in the svclone_config.ini file (see configuration file section).
Count output
The count step will create a tab-separated <out>/<sample>_svinfo.txt file containing count information. For example:
ID chr1 pos1 dir1 chr2 pos2 dir2 classification split_norm1 norm_olap_bp1 span_norm1 win_norm1 split1 sc_bases1 total_reads1 split_norm2 norm_olap_bp2 span_norm2 win_norm2 split2 sc_bases2 total_reads2 anomalous spanning norm1 norm2 support vaf1 vaf2
1 12 227543 + 12 228250 - DEL 12 405 13 96 4 215 189 15 473 8 94 4 149 190 32 4 25 23 12 0.32432432432432434 0.34285714285714286
2 12 333589 + 12 338298 - DEL 19 585 23 132 1 69 222 19 492 13 100 8 385 213 18 12 42 32 21 0.33333333333333331 0.39622641509433965
3 12 461142 + 12 465988 - DEL 14 490 12 120 6 247 202 12 374 16 104 6 149 214 20 6 26 28 18 0.40909090909090912 0.39130434782608697
4 12 526623 + 12 554937 - DEL 11 322 18 112 8 312 220 17 567 15 106 8 232 205 12 9 29 32 25 0.46296296296296297 0.43859649122807015
5 12 693710 + 12 696907 - DEL 13 433 15 104 9 329 212 16 446 21 138 5 245 229 20 9 28 37 23 0.45098039215686275 0.38333333333333336
The output fields are briefly described:
- split: split read count at each locus
- split_norm/span_norm: number of normal split and spanning reads crossing the boundary at locus 1 and 2 respectively.
- norm_olap_bp: count of normal read base-pairs overlapping the break (for normal reads that cross the break boundary).
- win_norm: normal read count (no soft-clips, normal insert size) for all normal reads extracted from the locus window (+/- insert size from locus).
- sc_bases: count of soft-clipped bases corresponding to split reads crossing the break.
- norm: normal read count at each locus.
- spanning: number of spanning reads supporting the break.
- support: split1 + split2 + spanning.
- anomalous: reads not counted in any other category.
- vaf: support / (norm + support).
Required Parameters
- -i or --input : structural variants input file. This should be the output file from the annotate step.
- -b or --bam : bam file with corresponding index file.
- -s or --sample : Sample name. Will create processed output file as <out>/<sample>_svinfo.txt, parameters output as <out>/<sample>_params.txt.
Optional Parameters
- -o or --out <directory> : output directory to create files. Default: the sample name.
- -cgf or --config <config.ini>: SVclone configuration file with additional parameters (svclone_config.ini is the default).
Filter step (Filter SVs and/or SNVs and attach CNV states)
To filter the data obtained from the SV counting program and/or filter SNV data, can be done like so:
svclone filter -i <sv_info.txt> -s <sample_name>
Note that read length and insert sizes used by the filter step are provided as outputs from the count step (<out>/read_params.txt), based on the first 50,000 sampled reads in the bam file.
Filter output
The filter step outputs the file <out>/<sample>_filtered_svs.tsv and/or <out>/<sample>_filtered_snvs.tsv depending on input. For SVs, the output is akin to the _svinfo.txt file format with added fields:
- norm_mean: average of norm1 and norm2
- gtype1/2: copy-number states at loci 1 and 2: "major, minor, CNV fraction" for example, "1,1,1.0". May be subclonal if battenberg input is suppl
Related Skills
node-connect
329.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
81.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
329.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
81.1kCommit, push, and open a PR
