SigProfilerExtractor
SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
Install / Use
/learn @SigProfilerSuite/SigProfilerExtractorREADME
SigProfilerExtractor
SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
Documentation
Full documentation is available in docs/ (rendered via MkDocs): https://sigprofilersuite.github.io/SigProfilerExtractor/
Table of contents
<a name="installation"></a> Installation
To install the current version of this Github repo, git clone this repo or download the zip file. Unzip the contents of SigProfilerExtractor-master.zip or the zip file of a corresponding branch.
In the command line, please run the following:
$ cd SigProfilerExtractor-master
$ pip install .
For most recent stable pypi version of this tool, In the command line, please run the following:
$ pip install SigProfilerExtractor
Install your desired reference genome from the command line/terminal as follows (available reference genomes are: GRCh37, GRCh38, mm9, and mm10):
$ python
from SigProfilerMatrixGenerator import install as genInstall
genInstall.install('GRCh37')
This will install the human 37 assembly as a reference genome. You may install as many genomes as you wish.
Next, open a python interpreter and import the SigProfilerExtractor module. Please see the examples of the functions.
<a name="functions"></a> Functions
The list of available functions are:
- importdata
- sigProfilerExtractor
- estimate_solution
- decompose
And an additional script:
- plotActivity.py
<a name="importdata"></a> importdata
Imports the path of example data.
importdata(datatype="matrix")
importdata Example
from SigProfilerExtractor import sigpro as sig
path_to_example_table = sig.importdata("matrix")
data = path_to_example_table
# This "data" variable can be used as a parameter of the "project" argument of the sigProfilerExtractor function.
# To get help on the parameters and outputs of the "importdata" function, please use the following:
help(sig.importdata)
<a name="sigProfilerExtractor"></a> sigProfilerExtractor
Extracts mutational signatures from an array of samples.
sigProfilerExtractor(input_type, out_put, input_data, reference_genome="GRCh37", opportunity_genome = "GRCh37", context_type = "default", exome = False,
minimum_signatures=1, maximum_signatures=10, nmf_replicates=100, resample = True, batch_size=1, cpu=-1, gpu=False,
nmf_init="random", precision= "single", matrix_normalization= "gmm", seeds= "random",
min_nmf_iterations= 10000, max_nmf_iterations=1000000, nmf_test_conv= 10000, nmf_tolerance= 1e-15, get_all_signature_matrices= False)
| Category | Parameter | Variable Type | Parameter Description |
| --------- | --------------------- | -------- |-------- |
| Input Data | | | |
| | input_type | String | The type of input:<br><ul><li>"vcf": used for vcf format inputs.</li><li>"matrix": used for table format inputs using a tab separated file.</li><li>"bedpe": used for bedpe files with each SV annotated with its type, size bin, and clustered/non-clustered status. Please check the required format at https://github.com/SigProfilerSuite/SigProfilerMatrixGenerator#structural-variant-matrix-generation.</li><li>"seg:TYPE": used for a multi-sample segmentation file for copy number analysis. Please check the required format at https://github.com/SigProfilerSuite/SigProfilerMatrixGenerator#copy-number-matrix-generation. The accepted callers for TYPE are the following {"ASCAT", "ASCAT_NGS", "SEQUENZA", "ABSOLUTE", "BATTENBERG", "FACETS", "PURPLE", "TCGA"}. For example, when using segmentation file from BATTENBERG then set input_type to "seg:BATTENBERG".</li></ul> |
| | output | String | The name of the output folder. The output folder will be generated in the current working directory. |
| | input_data | String | <br>Path to input folder for input_type:<ul><li>vcf</li><li>bedpe</li></ul>Path to file for input_type:<ul><li>matrix</li><li>seg:TYPE</li></ul> |
| | reference_genome | String | The name of the reference genome (default: "GRCh37"). This parameter is applicable only if the input_type is "vcf". |
| | opportunity_genome | String | The build or version of the reference genome for the reference signatures (default: "GRCh37"). When the input_type is "vcf", the opportunity_genome automatically matches the input reference genome value. Only the genomes available in COSMIC are supported (GRCh37, GRCh38, mm9, mm10, mm39, rn6, and rn7). If a different opportunity genome is selected, the default genome GRCh37 will be used. |
| | context_type | String | Mutation context name(s), separated by commas (,), that define the mutational contexts for signature extraction (default: "96,DINUC,ID"). In the default value, 96 represents the SBS96 context, DINUC represents the dinucleotide context, and ID represents the indel context. |
| | exome | Boolean | Defines if the exomes will be extracted (default: False). |
| NMF Replicates | | | |
| | minimum_signatures | Positive Integer | The minimum number of signatures to be extracted (default: 1). |
| | maximum_signatures | Positive Integer | The maximum number of signatures to be extracted (default: 25). |
| | nmf_replicates | Positive Integer | The number of iteration to be performed to extract each number signature (default: 100). |
| | resample | Boolean | If True, add poisson noise to samples by resampling (default: True). |
| | seeds | String | Ensures reproducible NMF replicate resamples. Provide the path to the Seeds.txt file (found in the results folder from a previous analysis) to reproduce results (default: "random"). |
| NMF Engines | | | |
| | matrix_normalization | String | Method of normalizing the genome matrix before it is analyzed by NMF (default: "gmm"). Options are, "log2", "custom" or "none". |
| | nmf_init | String | The initialization algorithm for W and H matrix of NMF (default: "random"). Options are "random", "nndsvd", "nndsvda", "nndsvdar" and "nndsvd_min". |
| | precision | String | Values should be single or double (default: "single"). |
| | min_nmf_iterations | Integer | Value defines the minimum number of iterations to be completed before NMF converges (default: 10000). |
| | max_nmf_iterations | Integer | Value defines the maximum number of iterations to be completed before NMF converges (default: 1000000). |
| | nmf_test_conv | Integer | Value defines the number number of iterations to done between checking next convergence (default: 10000). |
| | nmf_tolerance | Float | Value defines the tolerance to achieve to converge (default: 1e-15).|
| Execution | | | |
| | cpu | Integer | The number of processors to be used to extract the signatures (default: all processors). |
| | assignment_cpu | Integer | Number of processors to be used by SigProfilerAssignment for the final signature assignment step (default: all available). This is independent of the cpu parameter. |
| | gpu | Boolean | Defines if the GPU resource will used if available (default: False). If True, the GPU resources will be used in the computation. Note: All available CPU processors are used by default, which may cause a memory error. This error can be resolved by reducing the number of CPU processes through the cpu parameter.|
| | batch_size | Integer | Will be effective only if the GPU is used. Defines the number of NMF replicates to be performed by each CPU during the parallel processing (default: 1). Note: For batch_size values greater than 1, each NMF replicate will update until max_nmf_iterations is reached.|
| Solution Estimation Thresholds | | | |
| | stability | Float | The cutoff thresh-hold of the average stability (default: 0.8). Solutions with average stabilities below this thresh-hold will not be considered. |
| | min_stability | Float | The cutoff thresh-hold of the minimum stability (default: 0.2). Solutions with minimum stabilities below this thresh-hold will not be considered. |
| | combined_stability | Float | The cutoff thresh-hold of the combined stability (sum of average and minimum stability) (default: 1.0). Solutions with combined stabilities below this thresh-hold will not be considered. |
| | allow_stability_drop | Boolean | Defines if solutions with a drop in stability with respect to the highest stable number of signatures will be considered (default: False). |
| Decomposition | | | |
| | cosmic_version | Float | Defines the version of the COSMIC reference signatures (default: 3.5). Takes a positive float among 1, 2, 3, 3.1, 3.2, 3.3, 3.4, and 3.5.|
| | **make_decomposition_plots
