Decifer
DeCiFer is an algorithm that simultaneously selects mutation multiplicities and clusters SNVs by their corresponding descendant cell fractions (DCF).
Install / Use
/learn @raphael-group/DeciferREADME
DeCiFer
DeCiFer is an algorithm that simultaneously selects mutation multiplicities and clusters SNVs by their corresponding descendant cell fractions (DCF), a statistic that quantifies the proportion of cells which acquired the SNV or whose ancestors acquired the SNV. DCF is related to the commonly used cancer cell fraction (CCF) but further accounts for SNVs which are lost due to deleterious somatic copy-number aberrations (CNAs), identifying clusters of SNVs which occur in the same phylogenetic branch of tumour evolution.
The full description of the algorithm and its application on published cancer datasets are described in
Gryte Satas†, Simone Zaccaria†, Mohammed El-Kebir†,* and Ben Raphael*, 2021
† Joint First Authors
* Corresponding Authors
The results of the related paper are available at:
This repository includes detailed instructions for installation and requirements, demos and tutorials of DeCiFer, a list of current issues, and contacts. This repository is currently in a preliminary release and improved versions are released frequently. During this stage, please keep checking for updates.
Contents
<a name="algorithm"></a>
Algorithm
<img src="doc/decifer.png" width="500">DeCiFer uses the Single Split Copy Number (SSCN) assumption and evolutionary constraints to enumerate potential genotype sets. This allows DeCiFer to exclude genotype sets with constant mutation multiplicity (CMM) that are not biologically likely (red crosses) and include additional genotype sets (green star) that are. DeCiFer simultaneously selects a genotype set for each SNV and clusters all SNVs based on a probabilistic model of DCFs, which summarize both the prevalence of the SNV and its evolutionary history.
<a name="installation"></a>
Installation
DeCiFer is mostly written in Python 2.7 and has an optional component in C++. The recommended installation is through conda but we also provide custom instructions to install DeCiFer in any Python environment.
Automatic installation
The recommended installation is through bioconda and requires conda, which can be easily and locally obtained by installing one of the two most common freely available distributions: anaconda or miniconda. Please make sure to have executed the required channel setup for bioconda. Thus, the following one-time one-line command is sufficient to fully install DeCiFer within a virtual conda environment called decifer:
conda create -n decifer decifer -y -c bioconda
After such one-time installation, DeCiFer can be executed in every new session after activating the decifer environment as follows:
conda activate decifer
Manual installation
DeCiFer can also be installed in a conda environment directly from this repo. Thus, the following one-time commands are sufficient to fully install DeCiFer within a virtual conda environment called decifer from this Git repo:
git clone https://github.com/raphael-group/decifer.git && cd decifer/
conda create -c anaconda -c conda-forge -n decifer python=2.7 numpy scipy matplotlib-base pandas seaborn -y
pip install .
Custom installation
DeCiFer can be installed with pip by the command pip install . in any Python2.7 environment with the following packages or compatible versions:
| Package | Tested version | Comments | |---------|----------------|----------| | numpy | 1.16.1 | Efficient scientific computations | | scipy | 1.2.1 | Efficient mathematical functions and methods | | pandas | 0.20.1 | Dataframe management | | matplotlib | 2.0.2 | Basic plotting utilities | | seaborn | 0.7.1 | Advanced plotting utilities |
Installation of C++ component
DeCiFer includes C++ code to enumerate state/genotype trees. The dependencies for this code are as follows.
| Package | Tested version | Comments | |---------|----------------|----------| | cmake | >= 2.8 | Build environment | | lemon | 1.3.1 | C++ graph library | | boost | >= 1.69.0 | C++ library for scientific computing |
To build this code, enter the following commands from the root of the repository:
mkdir build
cd build
# OPTIONAL: specify lemon and/or Boost paths if not detected automatically.
cmake ../src/decifer/cpp/ -LIBLEMON_ROOT=/usr/local/ -DBOOST_ROOT=/scratch/software/boost_1_69_0/
make
<a name="usage"></a>
Usage
DeCiFer can be executed using the command decifer, whose manual describes the available parameters and argument options. See more details below.
- Required input data
- Optional input data
- Output
- System requirements
- Demos
- Reccomendations and quality control
<a name="requireddata"></a>
Required input data
DeCiFer requires two input data:
- Input mutations with nucleotide counts and related copy numbers in a tab-separated file (TSV) with three header lines ((1) The first specifies the number of mutations; (2) The second specifies the number of samples; and (3) The third is equal to:
#sample_index sample_label character_index character_label ref var) and where every other row has the following values for every mutation in every sample:
| Name | Description | Mandatory |
|------|-------------|-----------|
| Sample index | a unique number identifying the sample | Yes |
| Sample label | a unique name for the sample | Yes |
| Mutation index | a unique number identifying the mutation | Yes |
| Mutation label | a unique name identifying the mutation | Yes |
| REF | Number of reads with reference allele for the mutation | Yes |
| ALT | Number of reads with alternate allele for the mutation | Yes |
| Copy numbers and proportions | Tab-separated A B U where A,B are the inferred allele-specific copy numbers for the segment harboring the mutation and U is the corresponding proportion of cells (normal and tumour) with those copy numbers. Groups of cells/clones with the same allele-specific copy numbers must be combined into a single proportion. | Yes |
| Additional copy numbers | An arbitrary number of fields with the same format as of Copy numbers and proportions describing the proportions of cells with different copy numbers. Note that all proportions should always sum up to 1. | No |
- Input tumour purity in a two-column tab-separated file where every row
SAMPLE-INDEX TUMOUR-PURITYdefines the tumour purityTUMOUR-PURITYof a sample with indexSAMPLE-INDEX.
For generating the input files for DeCiFer, please see the scripts directory for more information. Examples may be found in the data directory.
<a name="optionaldata"></a>
Optional input data
DeCiFer can use the following additional and optional input data:
1. Data for fitting beta-binomial distributions to read count data
To use beta-binomial distributions to cluster mutations (default is binomial), pass the --betabinomial flag to decifer along with 2 additional arguments, --snpfile and --segfile, which are used to specify the locations of 2 files that contain information to parameterize the beta-binomial for each sample.
The file passed to DeCiFer via --snpfile contains information about the read counts of germline (not somatic) variants and has the following format:
| Field | Description |
|-------|-------------|
| SAMPLE | Name of a sample |
| CHR | Name of the chromosome |
| POS | Genomic position in CHR |
| REF_COUNT | Number of reads harboring reference allele in POS |
| ALT_COUNT | Number of reads harboring alternate allele in POS |
The file passed to DeCiFer via --segfile, which specifies the allele-specific copy number per segment, is the same as the best.seg.ucn file used by the vcf_2_decifer.py python script that generates the input files for DeCiFer. Please simply specify the location of this file.
Custom state trees
Users may pass a file containing the set of all possible state trees for DeCiFer to evaluate. State trees have been pre-generated for the set of most common copy numbers, however a dataset might have a combination of copy numbers which has not been included. In this case, the user can use the command generatestatetrees to generate all the state trees needed for their dataset, for instance, following the instructions in the scripts directory. The script in this directory not only generates input files for decifer, but also a file called cn_states.txt that lists all the unique CN states for your data. This file may be used with generatestatetrees as shown in the scripts directory under the section "Adressing the "Skipping mutation warning"".
<a name="output"></a>
Output
DeCiFer's main output file (ending with _output.tsv) corresponds to a single TSV file encoding a dataframe where every row corresponds to an input mutation and with the following fields:
| Name | Description |
|------|-------------|
| mut_index | Unique identified for a mutation |
| VAR_{SAMPLE} | Variant sequencing read cou
Related Skills
node-connect
341.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
