SkillAgentSearch skills...

SVclone

A computational method for inferring the cancer cell fraction of tumour structural variation from whole-genome sequencing data.

Install / Use

/learn @mcmero/SVclone
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="left"> <img src=https://github.com/mcmero/SVclone/blob/master/img/svclone_logo.png height=120/> </p>

This package is used to cluster structural variants of similar cancer cell fraction (CCF). SVclone is divided into five components: annotate, count, filter, cluster and post-assign. The annotate step infers directionality of each breakpoint (if not supplied), recalibrates breakpoint position to the soft-clip boundary and subsequently classifies SVs using a rule-based approach. The count step counts the variant and non-variant reads from breakpoint locations. Both the annotate and count steps utilise BAM-level information. The filter step removes SVs based on a number of adjustable parameters and prepares the variants for clustering. SNVs can also be added at this step as well as CNV information, which is matched to SV and SNV loci. The post-assign step then allows SVs to be assigned to the derived model from SNV clustering. Optionally, SVs and SNVs can also be post-assigned to a joint SV + SNV model.

How do I get set up?

First install conda. SVclone can then be installed via:

conda install svclone -c bioconda -c conda-forge
svclone --help

Alternatively, you may wish to install SVclone in its own conda virtual environment:

conda create -n svclone -c bioconda -c conda-forge svclone
conda activate svclone
svclone --help

If your site supports Modules and EasyBuild SVclone can be installed with:

eb SVclone-1.1.2-foss-2022b.eb
module load SVclone

Example data

Example data is provided to test your SVclone installation (data contains simulated clonal deletions). If you would like to run the tests, this can be done via:

git clone https://github.com/mcmero/SVclone.git
cd SVclone

./run_example.sh

You can check the following output plot, which will summarise the clustering result:

  • tumour_p80_DEL/ccube_out/tumour

You can also test the simulated SNV data by running:

./run_example_wsnvs.sh

The simulated data contains a 100% CCF clone and a 30% subclone.

Annotate step

Before running SVclone on real data, first download the configuration file via:

wget https://raw.githubusercontent.com/mcmero/SVclone/master/svclone_config.ini

Check the settings carefully and make sure that the config is approproate for your data set. Make sure that this file is in the directory from which you're running SVclone, or that you've specified the location using -cfg <config_file>.

An indexed whole-genome sequencing BAM and a list of paired breakpoints from an SV caller of choice is required. This step is required for clustering of SVs, however, classifiation and directionality information from your choice of SV caller can be used rather than being inferred.

svclone annotate -i <sv_input> -b <indexed_bamfile> -s <sample_name>

Input is expected in VCF format (directionality inferred from the ALT field is also supported). Each defined SV must have a matching mate, given in the MATEID value in the INFO section. Please see the VCF spec (section 3) for representing SVs using the VCF format. SVclone does not support unpaired break-ends, which means that the INFO field PARID must be specified (please see Section 5.4.4 in the VCF spec for an example).

Input may also be entered in Socrates or simple format (must be specified with --sv_format simple or --sv_format socrates). Simple format is as follows:

chr1	pos1	chr2	pos2
22	18240676	22	18232335
22	19940482	22	19937820
22	21383572	22	21382745
22	21395573	22	21395746

We recommend that directions from the SV caller of choice be used (use_dir must be set to True in the configuration file in this case). Optionally, if you already know the SV classifications, the name of the classification field can be specified in the config file (e.g. sv_class_field: classification).

An example of the 'full' SV simple format is as follows:

chr1	pos1	dir1	chr2	pos2	dir2	classification
22	18240676		-	22	18232335	-	INV
22	19940482		-	22	19937820    +	DEL
22	21383572		-	22	21382745	+	DUP
22	21395573		+	22	21395746	+	INV

Note that your classifications in your SV input will have to match those specified in the configuration file (these may be comma-separated):

[SVclasses]
# Naming conventions used to label SV types.
inversion_class: INV
deletion_class: DEL
dna_gain_class: DUP,INTDUP
dna_loss_class: DEL,INV,TRX
itrx_class: INTRX

Note that dna_gain_class will include any SV classification involving DNA duplication and dna_loss_class is any intra-chromosomal rearrangement not involving a gain (including balanced rearrangements). itrx_class refers to all inter-chromosomal translocations.

A blacklist (bed file) can also be supplied at this step to not process areas to remove SVs where any of its breakpoints fall into one of these areas.

Annotate Output

The above input example also corresponds with the output of this step (output to <out>/<sample>_svin.txt), with an added SV ID present in the column. Events that are considered part of the same event will have the same ID (which may be multiple breakpoints).

Required Parameters

  • -i or --input : structural variants input file (see above).
  • -b or --bam : bam file with corresponding index file.
  • -s or --sample : Sample name. Will create processed output file as <out>/<sample>_svinfo.txt, parameters output as <out>/<sample>_params.txt.

Optional Parameters

  • -o or --out <directory> : output directory to create files. Default: the sample name.
  • -cgf or --config <config.ini> : SVclone configuration file with additional parameters (svclone_config.ini is the default).
  • --sv_format <vcf, simple, socrates> : input format of SV calls, VCF by default, but may also be simple (see above) or from the SV caller Socrates.
  • --blacklist <file.bed> : Takes a list of intervals in BED format. Skip processing of any break-pairs where either SV break-end overlaps an interval specified in the supplied bed file. Using something like the DAC blacklist is recommended.

Count step

Run SV processing submodule to obtain read counts around breakpoints on each sample BAM file like so:

svclone count -i <svs> -b <indexed_bamfile> -s <sample_name>

The classification strings are not used by the program, except for DNA-gain events (such as duplications). The classification names for these types of SVs should be specified in the svclone_config.ini file (see configuration file section).

Count output

The count step will create a tab-separated <out>/<sample>_svinfo.txt file containing count information. For example:

ID	chr1	pos1	dir1	chr2	pos2	dir2	classification	split_norm1	norm_olap_bp1	span_norm1	win_norm1	split1	sc_bases1	total_reads1	split_norm2	norm_olap_bp2	span_norm2	win_norm2	split2	sc_bases2	total_reads2	anomalous	spanning	norm1	norm2	support	vaf1	vaf2
1	12	227543	+	12	228250	-	DEL	12	405	13	96	4	215	189	15	473	8	94	4	149	190	32	4	25	23	12	0.32432432432432434	0.34285714285714286
2	12	333589	+	12	338298	-	DEL	19	585	23	132	1	69	222	19	492	13	100	8	385	213	18	12	42	32	21	0.33333333333333331	0.39622641509433965
3	12	461142	+	12	465988	-	DEL	14	490	12	120	6	247	202	12	374	16	104	6	149	214	20	6	26	28	18	0.40909090909090912	0.39130434782608697
4	12	526623	+	12	554937	-	DEL	11	322	18	112	8	312	220	17	567	15	106	8	232	205	12	9	29	32	25	0.46296296296296297	0.43859649122807015
5	12	693710	+	12	696907	-	DEL	13	433	15	104	9	329	212	16	446	21	138	5	245	229	20	9	28	37	23	0.45098039215686275	0.38333333333333336

The output fields are briefly described:

  • split: split read count at each locus
  • split_norm/span_norm: number of normal split and spanning reads crossing the boundary at locus 1 and 2 respectively.
  • norm_olap_bp: count of normal read base-pairs overlapping the break (for normal reads that cross the break boundary).
  • win_norm: normal read count (no soft-clips, normal insert size) for all normal reads extracted from the locus window (+/- insert size from locus).
  • sc_bases: count of soft-clipped bases corresponding to split reads crossing the break.
  • norm: normal read count at each locus.
  • spanning: number of spanning reads supporting the break.
  • support: split1 + split2 + spanning.
  • anomalous: reads not counted in any other category.
  • vaf: support / (norm + support).

Required Parameters

  • -i or --input : structural variants input file. This should be the output file from the annotate step.
  • -b or --bam : bam file with corresponding index file.
  • -s or --sample : Sample name. Will create processed output file as <out>/<sample>_svinfo.txt, parameters output as <out>/<sample>_params.txt.

Optional Parameters

  • -o or --out <directory> : output directory to create files. Default: the sample name.
  • -cgf or --config <config.ini>: SVclone configuration file with additional parameters (svclone_config.ini is the default).

Filter step (Filter SVs and/or SNVs and attach CNV states)

To filter the data obtained from the SV counting program and/or filter SNV data, can be done like so:

svclone filter -i <sv_info.txt> -s <sample_name>

Note that read length and insert sizes used by the filter step are provided as outputs from the count step (<out>/read_params.txt), based on the first 50,000 sampled reads in the bam file.

Filter output

The filter step outputs the file <out>/<sample>_filtered_svs.tsv and/or <out>/<sample>_filtered_snvs.tsv depending on input. For SVs, the output is akin to the _svinfo.txt file format with added fields:

  • norm_mean: average of norm1 and norm2
  • gtype1/2: copy-number states at loci 1 and 2: "major, minor, CNV fraction" for example, "1,1,1.0". May be subclonal if battenberg input is suppl

Related Skills

View on GitHub
GitHub Stars42
CategoryDevelopment
Updated1mo ago
Forks11

Languages

Python

Security Score

90/100

Audited on Feb 2, 2026

No findings