⚠️ DEPRECATION NOTICE ⚠️ This repository is deprecated and no longer maintained as of v1.2.6. Its functionality is maintained in the Python version SigProfilerExtractor.

SigProfilerExtractorR

An R wrapper for running the SigProfilerExtractor framework.

INTRODUCTION

The purpose of this document is to provide a guide for using the SigProfilerExtractor framework to extract the De Novo mutational signatures from a set of samples and decompose the De Novo signatures into the COSMIC signatures. An extensive Wiki page detailing the usage of this tool can be found at https://osf.io/t6j7u/wiki/home/. For users that prefer working in a Python environment, the tool is written in Python and can be found and installed from: https://github.com/AlexandrovLab/SigProfilerExtractor

Installation
Functions
Citation
Copyright
Contact Information

<a name="installation"></a> Installation

PREREQUISITES

devtools (R)

>> install.packages("devtools")

reticulate* (R)

>> install.packages("reticulate")

*Reticulate has a known bug of preventing python print statements from flushing to standard out. As a result, some of the typical progress messages are delayed.

QUICK START GUIDE

This section will guide you through the minimum steps required to extract mutational signatures from genomes:

First, install the python package using pip. The R wrapper still requires the python package:

pip install SigProfilerExtractor

Open an R session and ensure that your R interpreter recognizes the path to your python installation:

$ R
>> library(reticulate)
>> use_python("path_to_your_python")
>> py_config()
python:         /anaconda3/bin/python
libpython:      /anaconda3/lib/libpython3.6m.dylib
pythonhome:     /anaconda3:/anaconda3
version:        3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
numpy:          /anaconda3/lib/python3.6/site-packages/numpy
numpy_version:  1.16.1

If you do not see your python path listed, restart your R session and rerun the above commands in order.

Install SigProfilerExtractorR using devtools:

>>library(devtools)
>>install_github("AlexandrovLab/SigProfilerExtractorR")

Load the package in the same R session and install your desired reference genome as follows (available reference genomes are: GRCh37, GRCh38, mm9, and mm10):

>> library(SigProfilerExtractorR)
>> install("GRCh37", rsync=FALSE, bash=TRUE)

This will install the human 37 assembly as a reference genome.

SUPPORTED GENOMES

Other available reference genomes are GRCh38, mm9 and mm10 (and genomes supported SigProfilerMatrixGenerator. Information about supported will be found at https://github.com/AlexandrovLab/SigProfilerMatrixGeneratorR

Quick Example:

Signatures can be extracted from vcf files or tab delimited mutational table using the sigprofilerextractor function.

>> help(sigprofilerextractor)

This will show the details about the sigprofilerextractor funtion.

>> library(SigProfilerExtractorR)
>> path_to_example_data <- importdata("matrix")
>> data <- path_to_example_data # here you can provide the path of your own data
>> sigprofilerextractor("matrix", 
                     "example_output", 
                     data, 
                     minimum_signatures=2,
                     maximum_signatures=3,
                     nmf_replicates=5,
                     min_nmf_iterations = 1000,
                     max_nmf_iterations =100000,
                     nmf_test_conv = 1000,
                     nmf_tolerance = 0.00000001)

The example file will generated in the working directory. Note that the parameters used in the above example are not optimal to get accurate signatures. Those are used only for a quick example.

<a name="functions"></a> Functions

The list of available functions are:

importdata
sigprofilerextractor
estimate_solution

<a name="importdata"></a> importdata

Imports the path of example data.

importdata(datatype)

datatype: Type of example data. There are two types: 1. "vcf", 2. "matrix".

importdata Example

library(SigProfilerExtractorR)
path_to_example_table = importdata("matrix")
data = path_to_example_table 
# This "data" variable can be used as a parameter of the "project" argument of the sigprofilerextractor function.

# To get help on the parameters and outputs of the "importdata" function, please use the following:
help(importdata)

<a name="sigprofilerextractor"></a> sigprofilerextractor

Extracts mutational signatures from an array of samples.

sigprofilerextractor(input_type, output, input_data, reference_genome="GRCh37",
                     opportunity_genome = "GRCh37", context_type = "default",
                     exome = False, minimum_signatures=1, maximum_signatures=10,
                     nmf_replicates=100, resample = T, batch_size=1, cpu=-1,
                     gpu=F, nmf_init="random", precision= "single",
                     matrix_normalization= "gmm", seeds= "random",
                     min_nmf_iterations= 10000, max_nmf_iterations=1000000,
                     nmf_test_conv= 10000, nmf_tolerance= 1e-15,
                     nnls_add_penalty=0.05, nnls_remove_penalty=0.01,
                     initial_remove_penalty=0.05, get_all_signature_matrices= False)

| Category | Parameter | Variable Type | Parameter Description | | --------- | --------------------- | -------- |-------- | | Input Data | | | | | | input_type | String | The type of input:<br><ul><li>"vcf": used for vcf format inputs.</li><li>"matrix": used for table format inputs using a tab seperated file.</li><li>"bedpe": used for bedpe file with each SV annotated with its type, size bin, and clustered/non-clustered status.</li><li>"seg:TYPE": used for a multi-sample segmentation file for copy number analysis. The accepted callers for TYPE are the following {"ASCAT", "ASCAT_NGS", "SEQUENZA", "ABSOLUTE", "BATTENBERG", "FACETS", "PURPLE", "TCGA"}. For example, when using segmentation file from BATTENBERG then set input_type to "seg:BATTENBERG".</li></ul> | | | output | String | The name of the output folder. The output folder will be generated in the current working directory. | | | input_data | String | <br>Path to input folder for input_type:<ul><li>vcf</li><li>bedpe</li></ul>Path to file for input_type:<ul><li>matrix</li><li>seg:TYPE</li></ul> | | | reference_genome | String | The name of the reference genome. The default reference genome is "GRCh37". This parameter is applicable only if the input_type is "vcf". | | | opportunity_genome | String | The build or version of the reference genome for the reference signatures. The default opportunity genome is GRCh37. If the input_type is "vcf", the opportunity_genome automatically matches the input reference genome value. Only the genomes available in COSMIC are supported (GRCh37, GRCh38, mm9, mm10 and rn6). If a different opportunity genome is selected, the default genome GRCh37 will be used. | | | context_type | String | A string of mutaion context name/names separated by comma (","). The items in the list defines the mutational contexts to be considered to extract the signatures. The default value is "96,DINUC,ID", where "96" is the SBS96 context, "DINUC" is the DINUCLEOTIDE context and ID is INDEL context. | | | exome | Boolean | Defines if the exomes will be extracted. The default value is "False". | | NMF Replicates | | | | | | minimum_signatures | Positive Integer | The minimum number of signatures to be extracted. The default value is 1. | | | maximum_signatures | Positive Integer | The maximum number of signatures to be extracted. The default value is 25. | | | nmf_replicates | Positive Integer | The number of iteration to be performed to extract each number signature. The default value is 100. | | | resample | Boolean | Default is True. If True, add poisson noise to samples by resampling. | | | seeds | String | It can be used to get reproducible resamples for the NMF replicates. A path of a tab separated .txt file containing the replicated id and preset seeds in a two columns dataframe can be passed through this parameter. The Seeds.txt file in the results folder from a previous analysis can be used for the seeds parameter in a new analysis. The Default value for this parameter is "random". When "random", the seeds for resampling will be random for different analysis. | | NMF Engines | | | | | | matrix_normalization | String | Method of normalizing the genome matrix before it is analyzed by NMF. Default is value is "gmm". Other options are, "log2", "custom" or "none". | | | nmf_init | String | The initialization algorithm for W and H matrix of NMF. Options are 'random', 'nndsvd', 'nndsvda', 'nndsvdar' and 'nndsvd_min'. Default is 'random'. | | | precision | String | Values should be single or double. Default is single. | | | min_nmf_iterations | Integer | Value defines the minimum number of iterations to be completed before NMF converges. Default is 10000. | | | max_nmf_iterations | Integer | Value defines the maximum number of iterations to be completed before NMF converges. Default is 1000000. | | | nmf_test_conv | Integer | Value defines the number number of iterations to done between checking next convergence. D

SigProfilerExtractorR

Install / Use

README

SigProfilerExtractorR

Table of contents

<a name="installation"></a> Installation

<a name="functions"></a> Functions

<a name="importdata"></a> importdata

importdata Example

<a name="sigprofilerextractor"></a> sigprofilerextractor