[[https://travis-ci.org/tdhock/PeakSegPipeline][https://travis-ci.org/tdhock/PeakSegPipeline.png?branch=master]]

PeakSegPipeline: an R package for genome-wide supervised ChIP-seq peak prediction, for a single experiment type (e.g. broad H3K36me3 or sharp H3K4me3 data), jointly using multiple samples and cell types.

Labeling. The first step of any supervised analysis is to label a few genomic regions with and without peaks in your data. The [[file:R/create_track_hub.R][PeakSegPipeline::create_track_hub]] function creates a track hub from bigWig files, which makes it easy to visualize data on the UCSC genome browser and then create labels (specific genomic regions where you have observed presence or absence of peaks in specific samples). For more details about labeling see [[#label-data][Input Files -> Label Data]] below.
Single-sample peak calling using PeakSegFPOP. We use the efficient [[https://github.com/tdhock/PeakSegDisk][PeakSegDisk]] implementation of Generalized Functional Pruning Optimal Partitioning ([[https://arxiv.org/abs/1810.00117][arXiv:1810.00117]]) to predict preliminary peaks for each sample independently.
Multiple-sample peak calling using PeakSegJoint. After single-sample analysis, peaks on different samples are clustered into genomic regions with overlapping peaks. In each cluster we run [[https://github.com/tdhock/PeakSegJoint][PeakSegJoint]] to approximately compute the most likely common peak positions across multiple samples -- this includes a prediction of which samples are up and down for each genomic region.

There are two major differences between PeakSegPipeline and all of the other peak detection algorithms for ChIP-seq data analysis:

Supervised machine learning: PeakSegFPOP and PeakSegJoint are trained by providing labels that indicate regions with and without peaks. So if you see false positives (a peak called where there is clearly only background noise) or false negatives (no peak called where there should be one) you can add labels to correct the models, and they learn and get more accurate as more labels are added. In contrast, other peak detection methods are unsupervised, meaning that they usually have 10-20 parameters, and no obvious way to train them, yielding arbitrarily inaccurate peaks that can NOT be corrected using labels.
Joint peak detection in any number of samples or cell types so the model can be easily interpreted to find similarities or differences between samples (PeakSegPipeline outputs a binary matrix, peaks x samples). In contrast, it is not easy to find similarities and differences between samples using single-sample peak detection methods (e.g. [[https://github.com/taoliu/MACS][MACS]]), and other multi-sample peak detection methods are limited to one (e.g. [[https://github.com/mahmoudibrahim/jamm][JAMM]]) or two (e.g. [[https://code.google.com/p/pepr-chip-seq/][PePr]]) cell types (assuming all samples of the same cell type are replicates with the same peak pattern).

** Installation and testing

Install UCSC command line programs, [[https://github.com/tdhock/PeakSegPipeline/wiki/FAQ#installing-ucsc-command-line-programs][as explained on our FAQ]]. These are necessary because the coverage data must be stored in bigWig files for efficient indexed access.

Then, install the PeakSegPipeline R package.

#+BEGIN_SRC R if(!require(devtools))install.packages("devtools") devtools::install_github("tdhock/PeakSegPipeline") #+END_SRC

Once everything has been installed, you can test your installation by executing the test script [[file:tests/testthat/test-pipeline-input.R]] -- in the shell, run:

#+BEGIN_SRC shell-script R -e 'source("https://raw.githubusercontent.com/tdhock/PeakSegPipeline/master/tests/testthat/test-pipeline-input.R")' #+END_SRC

This is the TEST_SUITE=pipeline-input test that is run by [[https://travis-ci.org/tdhock/PeakSegPipeline][Travis-CI]] after every push to this repo -- it should take about 20 minutes to run, depending on the speed of your computer ([[https://github.com/tdhock/PeakSegPipeline/wiki/test-pipeline-input][for debugging on your computer, it may be useful to look at the expected terminal output]]). The test script will first download some bigWigs and labels to the =~/PeakSegPipeline-test/input= directory, then run the PeakSegPipeline on them. If everything worked, you can view the results by opening =~/PeakSegPipeline-test/input/index.html= in a web browser, and it should be the same as the results shown on http://members.cbio.mines-paristech.fr/~thocking/hubs/test/input/

** Pipeline Input Files

PeakSegPipeline uses PeakSegFPOP + PeakSegJoint to predict common and different peaks in multiple samples. It requires three kinds of input data:

coverage data under =project_dir/samples=,
labels in =project_dir/labels=,
genomic segmentation problems in =project_dir/problems.bed=.

To give a concrete example, let us consider the data set that is used when you run [[file:tests/testthat/test-pipeline-demo.R]] -- you can do this via the shell command below: (note that this is not systematically run as a test since its runtime is usually over one hour).

#+BEGIN_SRC shell-script R -e 'source("https://raw.githubusercontent.com/tdhock/PeakSegPipeline/master/tests/testthat/test-pipeline-demo.R")' #+END_SRC

*** Coverage data

Each coverage data file should contain counts of aligned sequence reads at every genomic position, for one sample. These files must be in [[https://genome.ucsc.edu/goldenpath/help/bigWig.html][bigWig]] format, since it is indexed for fast lookup of coverage in arbitrary genomic regions. For example this test downloads 8 files:

#+BEGIN_SRC ~/PeakSegPipeline-test/demo/samples/bcell/MS026601/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/bcell_/MS010302/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/Input/MS002201/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/Input/MS026601/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/Input_/MS002202/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/Input_/MS010302/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/kidney/MS002201/coverage.bigWig ~/PeakSegPipeline-test/demo/samples/kidney_/MS002202/coverage.bigWig #+END_SRC

In the example above we have the =~/PeakSegPipeline-test/demo= directory which will contain all data sets, labels, and peak calls for this particular project. The =samples= directory contains a sub-directory for each sample group (experimental conditions or cell types, e.g. =bcell= or =kidney=). Each sample group directory should contain a sub-directory for each sample (e.g. =MS002201= or =MS010302=). Each sample sub-directory should contain a =coverage.bigWig= file with counts of aligned sequence reads (non-negative integers).

Note that in this demonstration project, the groups with underscores are un-labeled samples (e.g. =bcell_=), and the groups without underscores are labeled samples (e.g. =bcell=). In real projects typically you would combine those two groups into a single labeled group, but in this project we keep them separate in order to demonstrate the prediction accuracy of the learning algorithm.

*** Label Data

The =project_dir/labels/*.txt= files contain genomic regions with or without peaks. These labels will be used to train the peak prediction models (automatically select model parameters that yield optimal peak prediction accuracy). A quick and easy way to create labels is by visual inspection as in the [[http://cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/][McGill ChIP-seq peak detection benchmark]] (for details please read [[http://bioinformatics.oxfordjournals.org/content/early/2016/10/23/bioinformatics.btw672.abstract][Hocking et al, Bioinformatics 2016]]).

To visually label your data first create a project directory on a webserver. For example if your project directory is in your =~/public_html= directory, your directory structure should be =~/public_html/project_dir/samples/groupID/sampleID/coverage.bigWig=. To create a track hub, put the following in =~/public_html/project_dir/hub.sh=:

#+BEGIN_SRC shell.script #!/bin/bash Rscript -e 'PeakSegPipeline::create_track_hub("/home/tdhock/PeakSegPipeline-test/input", "http://members.cbio.mines-paristech.fr/~thocking/hubs/test/input", "hg19", "toby.hocking@r-project.org")' #+END_SRC

The arguments of the =create_track_hub= function are as follows:

The first argument =path/to/project_dir= is an absolute path to your data/project directory.
The second argument =http://your.server.com/~user/project_dir= is the URL where that directory will be made available on the web..
The third argument =hg19= is the UCSC genome ID for the genomes.txt file.
The fourth argument =email@domain.com= is your email address, which will be written to the hub.txt file.

If that command worked, then you should see a message =Created http://your.server.com/~user/project_dir/hub.txt= and then you can paste that URL into [[http://genome.ucsc.edu/cgi-bin/hgHubConnect#unlistedHubs][My Data -> Track Hubs -> My Hubs]] then click Add Hub to tell the UCSC genome browser to display your data. Navigate around the genome until you have found some peaks, then add positive and negative labels in =project_dir/labels/*.txt= files.

For example the test data set contains only one labels file,

#+BEGIN_SRC ~/PeakSegPipeline-test/demo/labels/some_labels.txt #+END_SRC

which contains lines such as the following

#+BEGIN_SRC chr10:33,061,897-33,162,814 noPeaks chr10:33,456,000-33,484,755 peakStart kidney chr10:33,597,317-33,635,209 peakEnd kidney chr10:33,662,034-33,974,942 noPeaks

chr10:35,182,820-35,261,001 noPeaks chr10:35,261,418-35,314,654 peakStart bcell kidney #+END_SRC

A chunk is a group of nearby labels. In the example above there are two chunks (far apart genomic regions, separated by an empty line). The first chunk has two regions with noPeaks labels in all samples, and two regions with

PeakSegPipeline

Install / Use

README