Scdiff
No description available
Install / Use
/learn @phoenixding/ScdiffREADME
____ ____ ____ _ _____ _____
/ ___\/ _\/ _ \/ \/ // /
| \| / | | \|| || __\| __\
\___ || \__| |_/|| || | | |
\____/\____/\____/\_/\_/ \_/
!!!!NEW!!!
For large single-cell datasets (e.g, > 2k cells), please use the new version of scdiff (scdiff2) at : https://github.com/phoenixding/scdiff2
SCDIFF 2.0 utilizes HDF5, Sparse matrix, and multi-threading techniques to reduce the resource requirement of the program while improving the efficiency. It also incorperates many new clustering and trajectory inference methods for more comprehensive and accurate predictions.
A few highlights:
(1) VERY EFFICIENT: Analyze 40k cells (~10k genes/cell) within 1-2 hours (--ncores 12 --maxloop 0)
(2) VERY FLEXIBLE: It was composed of many moving pieces, each can be customized by the users.
INTRODUCTION
<div style="text-align: justify"> Most existing single-cell trajectory inference methods have relied primarily on the assumption that descendant cells are similar to their parents in terms of gene expression levels. These assumptions do not always hold for in-vivo studies which often include infrequently sampled, un-synchronized and diverse cell populations. Thus, additional information may be needed to determine the correct ordering and branching of progenitor cells and the set of transcription factors (TFs) that are active during advancing stages of organogenesis. To enable such modeling we developed scdiff, which integrates expression similarity with regulatory information to reconstruct the dynamic developmental cell trajectories.SCDIFF is a package written in python and javascript, designed to analyze the cell differentiation trajectories using time-series single cell RNA-seq data. It is able to predict the transcription factors and differential genes associated with the cell differentiation trajectoreis. It also visualizes the trajectories using an interactive tree-stucture graph, in which nodes represent different sub-population cells (clusters).
</div>
PREREQUISITES
-
python (python 2 and python 3 are both supported)
It was installed by default for most Linux distribution and MAC.
If not, please check https://www.python.org/downloads/ for installation instructions. -
Python packages dependencies:
-- scikit-learn >=0.20
-- scipy >=0.13.3
-- numpy >=1.8.2
-- matplotlib >=2.2.3
-- pydiffmap >=0.1.1,<0.2.0
-- imbalanced_learn >=0.4.2
The python setup.py script (or pip) will try to install these packages automatically. However, please install them manually if, by any reason, the automatic installation fails.
INSTALLATION
There are 3 options to install scdiff.
-
Option 1: Install from download directory
cd to the downloaded scdiff package root directory$cd scdiffrun python setup to install
$python setup.py installMacOS or Linux users might need the sudo/root access to install. Users without the root access can install the package using the pip/easy_install with a --user parameter (install python libraries without root).
$sudo python setup.py installuse python3 instead of python in the above commands to install if using python3.
-
Option 2: Install from Github (recommended):
python 2:
$sudo pip install --upgrade https://github.com/phoenixding/scdiff/zipball/masterpython 3:
$sudo pip3 install --upgrade https://github.com/phoenixding/scdiff/zipball/master -
Option 3: Install from PyPI :
python2:
$sudo pip install --upgrade scdiffpython 3:
$sudo pip3 install --upgrade scdiff
The above pip installation options should be working for Linux, Window and MacOS systems.
For MacOS users, it's recommended to use python3 installation. The default python2 in MacOS has
some compatibility issues with a few dependent libraries. The users would have to install their own
version of python2 (e.g. via Anaconda) if they prefer to use python2 in MacOS.
USAGE
scdiff.py [-h] -i INPUT -t TF_DNA -k CLUSTERS -o OUTPUT [-l LARGE]
[-s SPEEDUP] [-d DSYNC] [-a VIRTUALANCESTOR]
[-f LOG2FOLDCHANGECUT] [-e ETFLISTFILE] [--spcut SPCUT]
-h, --help show this help message and exit
-i INPUT, --input INPUT, required
input single cell RNA-seq expression data
-t TF_DNA, --tf_dna TF_DNA, required
TF-DNA interactions used in the analysis
-k CLUSTERS, --clusters CLUSTERS, required
how to learn the number of clusters for each time
point? user-defined or auto? if user-defined, please
specify the configuration file path. If set as "auto"
scdiff will learn the parameters automatically.
-o OUTPUT, --output OUTPUT, required
output folder to store all results
-s SPEEDUP, --speedup SPEEDUP(1/None), optional
If set as 'True' or '1', SCIDFF will speedup the running
by reducing the iteration times.
-l LARGETYPE, --largetype LARGETYPE (1/None), optional
if specified as 'True' or '1', scdiff will use LargeType mode to
improve the running efficiency (both memory and time).
As spectral clustering is not scalable to large data,
PCA+K-Means clustering was used instead. The running speed is improved
significantly but the performance is slightly worse. If there are
more than 2k cells at each time point on average, it is highly
recommended to use this parameter to improve time and memory efficiency.
-d DSYNC, --dsync DSYNC (1/None), optional
If specified as 'True' or '1', the cell synchronization will be disabled.
If the users believe that cells at the same time point are similar in terms of
differentiation/development. The synchronization can be disabled.
-a VIRTUALANCESTOR, --virtualAncestor VIRTUALANCESTOR (1/None), optional
scdiff requires a 'Ancestor' node (the starting node,
all other nodes are descendants). By default,
the 'Ancestor' node is set as the first time point. The hypothesis behind is :
The cells at first time points are not differentiated yet
( or at the very early stage of differentiation and thus no clear sub-groups,
all Cells at the first time point belong to the same cluster).
If it is not the case, users can set -a as 'True' or '1' to enable
a virtual ancestor before the first time point. The expression of the
virtual ancestor is the median expression of all cells at first time point.
-f LOG2FOLDCHANGECUT, --log2foldchangecut LOG2FOLDCHANGECUT (Float), optional
By default, scdiff uses log2 Fold change 1(=>2^1=2)
as the cutoff for differential genes (together with t-test p-value cutoff 0.05).
However, users are allowed to customize the cutoff based on their
application scenario (e.g. log2 fold change 1.5).
-e ETFLISTFILE, --etfListFile ETFLISTFILE (String), optional
By default, scdiff recognizes 1.6k
TFs (we collected in human and mouse). Users are able
to provide a customized list of TFs instead using this
option. It specifies the path to the TF list file, in
which each line is a TF name. Here, it does not require
the targets information for the TFs, which will be used to infer
eTFs (TFs predicted based on the expression of themselves instead of the their targets).
--spcut SPCUT Float, optional
By default, scdiff uses p-value=0.05
as the cutoff to tell whether the DistanceToAncestor
(DTA) of clusters are significantly different.
Clusters with similar DTA will be placed in the same
level.
INPUTS AND PRE-PROCESSING
scdiff takes the two required input files (-i/--input and -t/--tf_dna), two optional files (-k/--cluster, -e/--etfListFile) and a few other optional parameters.
- -i/--input
(<span style="color:red">Note: The gene names in the expression file must be consistent with those in TF_DNA file. If using the provided TF_DNA file, gene symbols must be used to represent the genes in the expression file.</span>)
This specifies the single cell RNA-Seq expression data.
If the RNA-Seq data is not processed, the instruction about how to calculate expression based on RNA-Seq raw reads can be found in many other studies, e.g (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/). For example, users can use Tophat + Cufflink to calculate the gene expression in terms of FPKM. Please refer to corresponding tools for instructions. Once we get the RNA-Seq gene expression, the expression data should be transformed to log space for example by log2(x+1) where x could represent the gene expression in terms of RPKM, FPKM,TPM, umi-count depending on what tools are used to process the RNA-Seq expression data.
Note: For large expression datasets (e.g. >1Gb), it's recommended to filter the genes with very low variance to speed up and save memory. We provided a script utils/filterGenes.py in the utils folder for this purpose (please use "--help" parameter to show the usage information). Top 5000-10,000 genes are enough for most cases as the expression of many genes is quite stable (OR all zeros/very small values for non-expressing genes) and thus non-inf
