SomaticSiMu
SomaticSiMu generates single and double base pair substitutions, and single base pair insertions and deletions of biologically representative mutation signature probabilities and combinations.
Install / Use
/learn @HillLab/SomaticSiMuREADME
Main
SomaticSiMu generates single and double base pair substitutions, and single base pair insertions and deletions of biologically representative mutation signature probabilities and combinations. SomaticSiMu_GUI is the GUI version of SomaticSiMu.
Description
Simulated genomes with imposed known mutational signatures associated with cancer can be useful for benchmarking machine learning-based classifiers of genomic sequences and mutational signature extraction tools from mutational catalogs. SomaticSiMu extracts known signature data from a reference dataset of 2,780 whole cancer genomes and 36 cancer types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) 2020 database, generates novel mutations on an input reference sequence that faithfully simulate real mutational signatures, and outputs the simulated mutated DNA sequence as a machine readable FASTA file and metadata in CSV files about the position, frequency, and local trinucleotide sequence context of each mutation.
SomaticSiMu is developed as a lightweight, stand alone, and parallel software tool with an optional graphical user interface, built in documentation, and visualization functions of mutation signature plots. The rich selection of input parameters and graphical user interface make SomaticSiMu both an easy to use application and effective as part of a wide range of experimental scenarios.
Requirements
SomaticSiMu has the following dependencies:
- Python version 3.8.8 or higher
- pandas 1.2.4 or higher
- numpy 1.19.2 or higher
- tqdm 4.60.0 or higher
- pillow 8.1.2 or higher
- matplotlib 3.4.1 or higher
Install the dependencies of SomaticSiMu to your working environment using the following command in a terminal.
pip install -r ./SomaticSiMu/requirements.txt
Installation
SomaticSiMu is freely available on GitHub. Installation requires git and git lfs installed.
Install SomaticSiMu to your working directory using the following command in a terminal.
git clone https://github.com/HillLab/SomaticSiMu
Usage
SomaticSiMu requires the absolute file path of a reference genomic sequence on the local machine as input data into the simulation. Users then select the simulation-related parameters (shown below) to specify the cancer type (mutational signatures observed in whole genomes of the selected cancer type), mutation rate, location for simulated mutations, and proportion of synonymous/non-synonymous mutations as part of the simulation.
The same set of arguments are offered for both SomaticSiMu.py and SomaticSiMu_GUI.py. The main difference is that SomaticSiMu.py is run using a terminal interface while SomaticSiMu_GUI.py uses a Tkinter graphical user interface to improve user accessibility along with a suite of built-in visualization functions for the simulated output data. Simulation speed and memory performance is comparable between SomaticSiMu.py and SomaticSiMu_GUI.py.
Simulations using SomaticSiMu.py (terminal interface)
To simulate 100 unique genomic sequences using NC_000022.11 as the reference input sequence and Skin-Melanoma associated mutational signatures, an example command in the terminal using SomaticSiMu.py would look like this:
python SomaticSiMu.py -g 100 -c Skin-Melanoma -r ./SomaticSiMu/Reference_genome/Homo_sapiens.GRCh38.dna.chromosome.22.fasta
Simulations using SomaticSiMu_GUI.py (graphical user interface)
To conduct the same simulation using SomaticSiMu_GUI.py, first run:
python SomaticSiMu_GUI.py
Then, select from drop down menus or type in the simulation parameters. Click on each simulation parameter name on the GUI interface to open up a new tab with a visual representation and description of each parameter.
Simulation Parameters
Short-Form Argument Name| Long-Form Argument Name| Argument Type | Argument Description | Argument Options
--- | --- | --- | --- | ---
-g | --generation | Integer | Number of simulated sequences | Default = 10 ; Recommended Range: 1-100
-c | --cancer | Character | Simulated mutational signatures observed in whole genomes of the selected cancer type from PCAWG | Options: Bladder-TCC, Bone-Benign, Bone-Epith, Bone-Osteosarc, Breast-AdenoCA, Breast-DCIS, Breast-LobularCA, CNS-GBM, CNS-Medullo, CNS-Oligo, CNS-PiloAstro, Cervix-AdenoCA, Cervix-SCC, ColoRect-AdenoCA, Eso-AdenoCA, Head-SCC, Kidney-ChRCC, Kidney-RCC, Liver-HCC, Lung-AdenoCA, Lung-SCC, Lymph-BNHL, Lymph-CLL, Myeloid-AML, Myeloid-MDS, Myeloid-MPN, Ovary-AdenoCA, Panc-AdenoCA, Panc-Endocrine, Prost-AdenoCA, SKin-Melanoma, SoftTissue-Leiomyo, SoftTissue-Liposarc, Stomach-AdenoCA, Thy-AdenoCA, Uterus-AdenoCA
-f | --reading_frame | Integer | Index (1-start) of first base of the first codon in reading frame | Default = 1; Options: 1, 2, 3
-s | --std | Integer | Exclude mutational signature data from hypermutated tumors with a mutational burden s standard deviations from the mean mutational burden of the selected cancer type in the PCAWG dataset | Default = 3; Recommended Range: 0-3
-a | --slice_start | Character/Integer | Simulate mutations starting from this base index (1-start) in the reference sequence | Default = all (simulate mutations anywhere in the reference sequence), Options: Any integer from 1 up to the length of the input reference sequence
-b | --slice_end | Character/Integer | Simulate mutations starting from the slice_start index in the reference sequence up to and including this base index (1-start) | Default = all (simulate mutations anywhere in the reference sequence), Options: Any integer greater than slice_start and up to the length of the input reference sequence
-p | --power | Integer | Multiply simulation mutation rate (baseline based on PCAWG dataset) by a scalar factor | Default = 1 (biologically representative) ; Recommended Range: 0.1-10
-x | --syn_rate | Float | Proportion of synonymous mutations out of all simulated mutations kept in the output simulated sequence | Default = 1 (keep all syn. mutations) ; Recommended Range: 0 (0% of syn mutations)-1 (100% of syn mutations)
-y | --non_syn_rate | Float | Proportion of non-synonymous mutations out of all simulated mutations kept in the output simulated sequence | Default = 1 (keep all non-syn. mutations) ; Recommended Range: 0 (0% of non-syn. mutations)-1 (100% of non-syn. mutations)
-r | --reference | Character | Absolute file path of reference sequence used as input for the simulation |
-n | --normalization | Character | Normalize mutation rates to simulate mutation types and proportions similar to the Homo Sapiens GChr38 whole genome. Different input reference sequences have different k-mer compositions compared to the whole genome that may impact the simulation of specific mutation types and their proportions. | Default: False ; Options: True, False
Visualizations using SomaticSiMu_GUI.py (graphical user interface)
Using the SomaticSiMu graphical user interface, built-in visualization functions can plot the mutations types/proportion as well as the total count of simulated mutations. These visualization functions work if the simulation has completed successfully.
Argument Name | Argument Type | Description | Argument Options --- | --- | --- | --- Cancer Type | Character (Drop Down Menu) | Simulated Cancer Type | Options: Bladder-TCC, Bone-Benign, Bone-Epith, Bone-Osteosarc, Breast-AdenoCA, Breast-DCIS, Breast-LobularCA, CNS-GBM, CNS-Medullo, CNS-Oligo, CNS-PiloAstro, Cervix-AdenoCA, Cervix-SCC, ColoRect-AdenoCA, Eso-AdenoCA, Head-SCC, Kidney-ChRCC, Kidney-RCC, Liver-HCC, Lung-AdenoCA, Lung-SCC, Lymph-BNHL, Lymph-CLL, Myeloid-AML, Myeloid-MDS, Myeloid-MPN, Ovary-AdenoCA, Panc-AdenoCA, Panc-Endocrine, Prost-AdenoCA, Skin-Melanoma, SoftTissue-Leiomyo, SoftTissue-Liposarc, Stomach-AdenoCA, Thy-AdenoCA, Uterus-AdenoCA Gen Start | Integer | Unique ID of the first simulated sequence in the range to plot | Default: 1 Gen End | Integer | Unique ID of the last simulated sequence in the range to plot | Default: Number of sequences simulated for the selected cancer type. Mut Type | Character (Drop Down Menu) | Type of plot to visualize| Options: SBS, DBS, Insertion, Deletion, Mutation Burden Visualization Type | Character (Drop Down Menu) | Visualize all simulated mutations | Default: End
Quick Start
The following quick-start examples use SomaticSiMu.py in a terminal interface to conduct simulation of mutational signatures and the SomaticSiMu_GUI.py graphica user interface to visualize simulated data. Arguments that are kept as their default value as listed in the Simulaton Parameters table are not shown for readability.
Example 1: Simulate 10 sequences with imposed mutational signatures associated with Biliary Adenocarcinoma. Exclude hypermutants with a mutational burden that is one standard deviation beyond the mean mutational burden of the selected cancer type.
python SomaticSiMu.py -g 10 -c Biliary-AdenoCA -r ./SomaticSiMu/Reference_genome/Homo_sapiens.GRCh38.dna.chromosome.22.fasta -s 1
Example 2: Simulate 50 sequences with imposed mutational signatures associated with Colorectal Adenocarcinoma. Only simulate mutations from base index 10,000,000 to 30,000,000.
python SomaticSiMu.py -g 50 -c ColoRect-AdenoCA -r ./SomaticSiMu/Reference_genome/Homo_sapiens.GRCh38.dna.chromosome.22.fasta -a 10000000 -b 30000000

