____  ____  ____  _  _____ _____
		/ ___\/   _\/  _ \/ \/    //    /
		|    \|  /  | | \|| ||  __\|  __\
		\___ ||  \__| |_/|| || |   | |   
		\____/\____/\____/\_/\_/   \_/

!!!!NEW!!!

For large single-cell datasets (e.g, > 2k cells), please use the new version of scdiff (scdiff2) at : https://github.com/phoenixding/scdiff2

SCDIFF 2.0 utilizes HDF5, Sparse matrix, and multi-threading techniques to reduce the resource requirement of the program while improving the efficiency. It also incorperates many new clustering and trajectory inference methods for more comprehensive and accurate predictions.

A few highlights:
(1) VERY EFFICIENT: Analyze 40k cells (~10k genes/cell) within 1-2 hours (--ncores 12 --maxloop 0)
(2) VERY FLEXIBLE: It was composed of many moving pieces, each can be customized by the users.

INTRODUCTION

<div style="text-align: justify"> Most existing single-cell trajectory inference methods have relied primarily on the assumption that descendant cells are similar to their parents in terms of gene expression levels. These assumptions do not always hold for in-vivo studies which often include infrequently sampled, un-synchronized and diverse cell populations. Thus, additional information may be needed to determine the correct ordering and branching of progenitor cells and the set of transcription factors (TFs) that are active during advancing stages of organogenesis. To enable such modeling we developed scdiff, which integrates expression similarity with regulatory information to reconstruct the dynamic developmental cell trajectories.

SCDIFF is a package written in python and javascript, designed to analyze the cell differentiation trajectories using time-series single cell RNA-seq data. It is able to predict the transcription factors and differential genes associated with the cell differentiation trajectoreis. It also visualizes the trajectories using an interactive tree-stucture graph, in which nodes represent different sub-population cells (clusters).

</div>

flowchart

PREREQUISITES

python (python 2 and python 3 are both supported)
It was installed by default for most Linux distribution and MAC.
If not, please check https://www.python.org/downloads/ for installation instructions.
Python packages dependencies:
-- scikit-learn >=0.20
-- scipy >=0.13.3
-- numpy >=1.8.2
-- matplotlib >=2.2.3
-- pydiffmap >=0.1.1,<0.2.0
-- imbalanced_learn >=0.4.2

The python setup.py script (or pip) will try to install these packages automatically. However, please install them manually if, by any reason, the automatic installation fails.

INSTALLATION

There are 3 options to install scdiff.

Option 1: Install from download directory
cd to the downloaded scdiff package root directory
```
$cd scdiff
```
run python setup to install
```
$python setup.py install
```
MacOS or Linux users might need the sudo/root access to install. Users without the root access can install the package using the pip/easy_install with a --user parameter (install python libraries without root)．
```
$sudo python setup.py install 
```
use python3 instead of python in the above commands to install if using python3.

Option 2: Install from Github (recommended):

python 2:

$sudo pip install --upgrade https://github.com/phoenixding/scdiff/zipball/master

python 3:

$sudo pip3 install --upgrade https://github.com/phoenixding/scdiff/zipball/master

Option 3: Install from PyPI :

python2:

$sudo pip install --upgrade scdiff

python 3:

$sudo pip3 install --upgrade scdiff

The above pip installation options should be working for Linux, Window and MacOS systems.
For MacOS users, it's recommended to use python3 installation. The default python2 in MacOS has some compatibility issues with a few dependent libraries. The users would have to install their own version of python2 (e.g. via Anaconda) if they prefer to use python2 in MacOS.

USAGE

scdiff.py [-h] -i INPUT -t TF_DNA -k CLUSTERS -o OUTPUT [-l LARGE]
                 [-s SPEEDUP] [-d DSYNC] [-a VIRTUALANCESTOR]
                 [-f LOG2FOLDCHANGECUT] [-e ETFLISTFILE] [--spcut SPCUT]

	-h, --help            show this help message and exit

	-i INPUT, --input INPUT, required 
						input single cell RNA-seq expression data
						
	-t TF_DNA, --tf_dna TF_DNA, required
						TF-DNA interactions used in the analysis
						
	-k CLUSTERS, --clusters CLUSTERS, required
						how to learn the number of clusters for each time
						point? user-defined or auto? if user-defined, please
						specify the configuration file path. If set as "auto"
						scdiff will learn the parameters automatically.
						
	-o OUTPUT, --output OUTPUT, required
						output folder to store all results
						
	-s SPEEDUP, --speedup SPEEDUP(1/None), optional
						If set as 'True' or '1', SCIDFF will speedup the running
						by reducing the iteration times.
						
	-l LARGETYPE,  --largetype LARGETYPE (1/None), optional
						if specified as 'True' or '1', scdiff will use LargeType mode to 
						improve the running efficiency (both memory and time). 
						As spectral clustering is not scalable to large data,
						PCA+K-Means clustering was used instead. The running speed is improved 
						significantly but the performance is slightly worse. If there are
						more than 2k cells at each time point on average, it is highly 
						recommended to use this parameter to improve time and memory efficiency.
						
						
	-d DSYNC,  --dsync DSYNC (1/None), optional
						If specified as 'True' or '1', the cell synchronization will be disabled. 
						If the users believe that cells at the same time point are similar in terms of 
						differentiation/development. The synchronization can be disabled.

	-a VIRTUALANCESTOR, --virtualAncestor VIRTUALANCESTOR (1/None), optional
						scdiff requires a 'Ancestor' node (the starting node, 
						all other nodes are descendants).  By default, 
						the 'Ancestor' node is set as the first time point. The hypothesis behind is :  
						The cells at first time points are not differentiated yet
						( or at the very early stage of differentiation and thus no clear sub-groups, 
						all Cells at the first time point belong to the same cluster).  
						  
						If it is not the case, users can set -a as 'True' or '1' to enable
						a virtual ancestor before the first time point.  The expression of the 
						virtual ancestor is the median expression of all cells at first time point. 
						 
	-f LOG2FOLDCHANGECUT, --log2foldchangecut LOG2FOLDCHANGECUT (Float), optional
						By default, scdiff uses log2 Fold change 1(=>2^1=2)
						as the cutoff for differential genes (together with t-test p-value cutoff 0.05).
						However, users are allowed to customize the cutoff based on their 
						application scenario (e.g. log2 fold change 1.5). 
						
	-e ETFLISTFILE, --etfListFile ETFLISTFILE (String), optional  
						By default, scdiff recognizes 1.6k
						TFs (we collected in human and mouse). Users are able
						to provide a customized list of TFs instead using this
						option. It specifies the path to the TF list file, in
						which each line is a TF name. Here, it does not require 
						the targets information for the TFs, which will be used to infer
						eTFs (TFs predicted based on the expression of themselves instead of the their targets).
						
	--spcut SPCUT       Float, optional  
						By default, scdiff uses p-value=0.05
						as the cutoff to tell whether the DistanceToAncestor
						(DTA) of clusters are significantly different.
						Clusters with similar DTA will be placed in the same
						level.

INPUTS AND PRE-PROCESSING

scdiff takes the two required input files (-i/--input and -t/--tf_dna), two optional files (-k/--cluster, -e/--etfListFile) and a few other optional parameters.

-i/--input
(<span style="color:red">Note: The gene names in the expression file must be consistent with those in TF_DNA file. If using the provided TF_DNA file, gene symbols must be used to represent the genes in the expression file.</span>)
This specifies the single cell RNA-Seq expression data.
If the RNA-Seq data is not processed, the instruction about how to calculate expression based on RNA-Seq raw reads can be found in many other studies, e.g (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/). For example, users can use Tophat + Cufflink to calculate the gene expression in terms of FPKM. Please refer to corresponding tools for instructions. Once we get the RNA-Seq gene expression, the expression data should be transformed to log space for example by log2(x+1) where x could represent the gene expression in terms of RPKM, FPKM,TPM, umi-count depending on what tools are used to process the RNA-Seq expression data.
Note: For large expression datasets (e.g. >1Gb), it's recommended to filter the genes with very low variance to speed up and save memory. We provided a script utils/filterGenes.py in the utils folder for this purpose (please use "--help" parameter to show the usage information). Top 5000-10,000 genes are enough for most cases as the expression of many genes is quite stable (OR all zeros/very small values for non-expressing genes) and thus non-inf

Scdiff

Install / Use

README