Phantompeakqualtools
This package computes informative enrichment and quality measures for ChIP-seq/DNase-seq/FAIRE-seq/MNase-seq data. It can also be used to obtain robust estimates of the predominant fragment length or characteristic tag shift values in these assays.
Install / Use
/learn @kundajelab/PhantompeakqualtoolsREADME
WARNING: If you see the following error, then downgrade your
sppto 1.14 or 1.15.x.
Error in lwcc(x, y, s, e, return.peaks = return.peaks, bg.x = bg.x, bg.y = bg.y, :
INTEGER() can only be applied to a 'integer', not a 'character'
Summary
This package computes informative enrichment and quality measures for ChIP-seq/DNase-seq/FAIRE-seq/MNase-seq data. It can also be used to obtain robust estimates of the predominant fragment length or characteristic tag shift values in these assays.
Introduction
This set of programs operate on mapped Illumina single-end read datasets in tagAlign or BAM format. They can be used to
- Compute the predominant insert-size (fragment length) based on strand cross-correlation peak
- Compute Data quality measures based on relative phantom peak
- Call Peaks and regions for punctate binding datasets
Citations
If you are using the code or results in any formal publication please cite
[1] Landt SG1, Marinov GK, Kundaje A et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012 Sep;22(9):1813-31. doi: 10.1101/gr.136184.111.
[2] Kharchenko PK, Tolstorukov MY, Park PJ, Design and analysis of ChIP-seq experiments for DNA-binding proteins Nat Biotechnol. 2008 Dec;26(12):1351-9
Dependencies
NOTE: The current package does not run on a MacOS or Windows.
- unix, bash, R (>=3.1), awk, samtools, boost C++ libraries
- R packages: spp (1.14, 1.15.x), caTools, snow, snowfall, bitops, Rsamtools (in Bioconductor)
Files
spp_1.14.tar.gz: modified SPP peak-caller package (The original SPP-peak caller package was written by Peter Kharchenko [2], https://github.com/hms-dbmi/spp)run_spp.R: The script to compute the frag length, data quality characteristics based on cross-correlation analysis and/or peak calling
Installation
-
First make sure that you have installed R (version 3.1 or higher)
-
Also, you must have the Boost C++ libraries installed. Most linux distributions have these preinstalled. If not, you can easily get these from your standard package manager for your linux distribution. e.g synaptic package manager (
apt-get) for ubuntu or emerge for gentoo. -
Clone the repo.
$ git clone https://github.com/kundajelab/phantompeakqualtools -
Install the following R packages
- snow (if you want parallel processing)
- snowfall
- bitops
- caTools
- Rsamtools (in the Bioconductor package)
$ cd phantompeakqualtools $ R (From within R) > install.packages("snow", repos="http://cran.us.r-project.org") > install.packages("snowfall", repos="http://cran.us.r-project.org") > install.packages("bitops", repos="http://cran.us.r-project.org") > install.packages("caTools", repos="http://cran.us.r-project.org") > source("http://bioconductor.org/biocLite.R") > biocLite("Rsamtools",suppressUpdates=TRUE) > install.packages("./spp_1.14.tar.gz") -
If your alignment files are BAM, you must have the samtools executable in your path so that the R script
run_spp.Rcan call it using thesystem()command You can get samtools from (here)[http://samtools.sourceforge.net/] You can add the following line to your~/.bashrcfileexport PATH="<path_to_samtools_executable>:${PATH}" -
Run
run_spp.RRscript run_spp.R <options>
General usage
Usage: Rscript run_spp.R <options>
-
Mandatory arguments
| argument | description | |-------------------------|---------------------------------------------| |-c=<ChIP_alignFile> | full path and name (or URL) of tagAlign/BAM file (can be gzipped) (FILE EXTENSION MUST BE tagAlign.gz, tagAlign, bam or bam.gz) |
-
Mandatory arguments for peak calling
| argument | description | |-------------------------|---------------------------------------------| |-i=<Input_alignFile> | full path and name (or URL) of tagAlign/BAM file (can be gzipped) (FILE EXTENSION MUST BE tagAlign.gz, tagAlign, bam or bam.gz) |
-
Optional arguments
| argument | description | |---------------------------|-------------------------------------------------| |-s=<min>:<step>:<max>| strand shifts at which cross-correlation is evaluated, default=-500:5:1500 | |-speak=<strPeak> | user-defined cross-correlation peak strandshift | |-x=<min>:<max> | strand shifts to exclude (This is mainly to avoid region around phantom peak) default=10:(readlen+10) | |-p=<nodes> | number of parallel processing nodes, default=0 | |-fdr=<falseDisoveryRate> | false discovery rate threshold for peak calling | |-npeak=<numPeaks> | threshold on number of peaks to call | |-tmpdir=<tempdir> | temporary directory (if not specified R function
tempdir()is used) | |-filtchr=<chrnamePattern>| pattern to use to remove tags that map to specific chromosomes e.g. _ will remove all tags that map to chromosomes with _ in their name | -
Output arguments
| argument | description | |----------------------------|------------------------------------------------| |-odir=<outputDirectory> | name of output directory (If not set same as ChIP file directory is used) | |-savn=<narrowpeakfilename>| NarrowPeak file name (fixed width peaks) | |-savn | | |-savr=<regionpeakfilename>| RegionPeak file name (variable width peaks with regions of enrichment around peak summits) | |-savr | | |-savd=<rdatafile> | save Rdata file | |-savd | | |-savp=<plotdatafile> | save cross-correlation plot | |-savp | | |-out=<resultfile> | append peakshift/phantomPeak results to a file | |-rf | if plot or rdata or narrowPeak file exists replace it. If not used then the run is aborted if the plot or Rdata or narrowPeak file exists | |-clean | if used it will remove the original chip and control files after reading them in. CAUTION: Use only if the script calling
run_spp.Ris creating temporary files |
Typical usage
-
Determine strand cross-correlation peak / predominant fragment length OR print out quality measures.
-out=<outFile>will create and/or append to a file named <outFile> several important characteristics of the dataset.Rscript run_spp.R -c=<tagAlign/BAMfile> -savp -out=<outFile>The file contains 11 tab delimited columns.
|col.| abbreviation | description | |----|-----------------|------------------------------------------------------------------------------------------------------| |1 | Filename | tagAlign/BAM filename | |2 | numReads | effective sequencing depth i.e. total number of mapped reads in input file | |3 | estFragLen | comma separated strand cross-correlation peak(s) in decreasing order of correlation. | |4 | corr_estFragLen | comma separated strand cross-correlation value(s) in decreasing order (COL2 follows the same order) | |5 | phantomPeak | Read length/phantom peak strand shift | |6 | corr_phantomPeak| Correlation value at phantom peak | |7 | argmin_corr | strand shift at which cross-correlation is lowest | |8 | min_corr | minimum value of cross-correlation | |9 | NSC | Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8 | |10 | RSC | Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8) | |11 | QualityTag | Quality tag based on thresholded RSC (codes= -2:veryLow, -1:Low, 0:Medium, 1:High, 2:veryHigh) |
The top 3 local maxima locations that are within 90% of the maximum cross-correlation value are output. In almost all cases, the top (first) value in the list represents the predominant fragment length. If you want to keep only the top value simply run
sed -r 's/,[^\t]+//g' <outFile> > <newOutFile>You can run the program on multiple datasets in parallel and append all the quality information to the same <outFile> for a summary analysis.
NSC values range from a minimum of 1 to larger positive numbers. 1.1 is the critical threshold. Datasets with NSC values much less than 1.1 (< 1.05) tend to have low signal to noise or few peaks (this could be biological eg.a factor that truly binds only a few sites in a particular tissue type OR it could be due to poor quality)
RSC values range from 0 to larger positive values. 1 is the critical threshold. RSC values significantly lower than 1 (< 0.8) tend to have low signal to noise. The low scores can be due to failed and poor quality ChIP, low read sequence quality and hence lots of mismappings, shallow sequencing depth (significantly below saturation) or a combination of these. Like the NSC, datasets with few binding sites (< 200) which is biologically justifiable also show lo
