ATACProc - a pipeline for processing ATAC-seq data

Devloper: Sourya Bhattacharyya

Supervisors: Dr. Ferhat Ay and Dr. Pandurangan Vijayanand

La Jolla Institute for Immunology, CA 92037, USA

#######################

ATACProc is a pipeline to analyze ATAC-seq data. Currently datasets involving one of the four reference genomes, namely hg19, hg38, mm9 and mm10 are supported. Important features of this pipeline are:

Supports single or paired-end fastq or BAM formatted data.
Generates alignment summary and QC statistics.
Peak calls using MACS2, for multiple FDR thresholds (0.01 and 0.05)
Generating raw and coverage normalized BigWig tracks for visualizing the data in UCSC genome browser.
Irreproducible Discovery Rate (IDR) analysis (https://github.com/nboley/idr) between a set of peak calls or even a set of input alignment (BAM) files (in which case, peaks are estimated first) corresponding to a set of biological or technical ATAC-seq replicates.
New in version 2.0: Support discarding reads falling in blacklisted genomic regions
New in version 2.0: Support extracting nucleosome free reads (NFR), one or more nucleosome containing regions (denoted as +1M), for TF footprinting analysis.
New in version 2.0: Compatibility to the package ATAQV (https://github.com/ParkerLab/ataqv) for generating summary statistics across a set of samples.

#######################

Release notes

Version 2.2 - April 2022

Added -F option - corresponds to using different types of reads for footprinting.

Default = 1, means footprinting with nucleosome free reads (NFR) will be done.

Best for standard ATAC-seq protocols (Li et al. Genome Biology, 2019)

If -F option is 2, footprinting with nucleosome reads will also be separately computed in addition to the NFR based footprints (two different footprinting outputs).

If -F option is 3, footprinting with all the reads will also be separately computed in addition to the NFR based and nucleosome read based footprints (three different footprinting outputs).

Version 2.1 - July 2020

Minor change of picard duplicate removal syntax, according to the picard tool version 2.8.14 We recommend using this (or later) versions

Version 2.0 - November 2019

Included TF footprinting, optional discarding of blacklisted genomic regions, motif analysis
Updated summary statistics incorporating support for ATAQV package (https://github.com/ParkerLab/ataqv)
Discarded R package ATACseqQC (https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html) and corresponding operations, mainly due to its time complexity and reliability issues.

Version 1.0 - July 2018:

Released first version of ATAC-seq pipeline, supporting generation of QC metrics, peak calls, signal tracks for visualizing in UCSC genome browser.
Also supports IDR between a set of peaks / alignments for a set of replicates.

Theory

Papers / links for understanding ATAC-seq QCs:

https://github.com/crazyhottommy/ChIP-seq-analysis (very useful; contains many papers and links for understanding ChIP-seq and ATAC-seq data)
https://www.encodeproject.org/data-standards/terms/#library
https://www.biostars.org/p/187204/
http://seqanswers.com/forums/archive/index.php/t-59219.html
https://github.com/kundajelab/atac_dnase_pipelines
https://github.com/ParkerLab/bioinf525#sifting
https://github.com/taoliu/MACS/issues/145
https://www.biostars.org/p/207318/
https://www.biostars.org/p/209592/
https://www.biostars.org/p/205576/

Understanding peak calling

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-9-r137

Understanding TF footprinting

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1642-2

Understanding IDR analysis

https://github.com/nboley/idr

Installation

Following packages / libraries should be installed before running this pipeline:

Python 2.7
R environment (we have used 3.4.3)

User should also install the following R packages, by running the following command inside R prompt:

install.packages(c(“optparse”, “ggplot2”, “data.table”, “plotly”))

Also user needs to install the bioconductor package GenomicRanges https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html
Bowtie2 (we have used version 2.3.3.1) http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
samtools (we have used version 1.6) http://samtools.sourceforge.net/
PICARD tools (we have used 2.8.14 version now; previously we were using version 2.7.1) https://broadinstitute.github.io/picard/
Utilities "bedGraphToBigWig", "bedSort", "bigBedToBed", "hubCheck" and "fetchChromSizes" - to be downloaded from UCSC repository http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
deepTools (we have used version 2.0) https://deeptools.readthedocs.io/en/develop/
MACS2 (we have used version 2.1.1) https://github.com/taoliu/MACS
HOMER (we recommend using the latest version) http://homer.ucsd.edu/homer/
The package ataqv (https://github.com/ParkerLab/ataqv). User needs to download the GitHub release (.tar.gz) file in a convenient location, extract it, and provide corresponding path in a configuration file (mentioned below).
Regulatory genomics toolbox (https://www.regulatory-genomics.org/)

First user needs to install the module RGT using the following commands:
```
pip install --user cython numpy scipy
pip install --user RGT
```
A folder rgtdata would be created inside the home directory. Next step is to configure that folder by typing the following commands:
```
cd ~/rgtdata
python setupGenomicData.py --hg19
python setupGenomicData.py --hg38
python setupGenomicData.py --mm9
python setupGenomicData.py --mm10

(Note: it is better to run the last four commands together in a qsub / cluster environment, otherwise it'll be time consuming).
```
Then, user needs to set up the motif configuration data, via executing the following commands (preferable to run in qsub / cluster environment)
```
cd ~/rgtdata
python setupLogoData.py --all
```

User should include the PATH of above mentioned libraries / packages inside their SYSTEM PATH variable. Alternatively, installation PATHS for some of these packages are to be mentioned in a separate configuration file (described below)

Following packages / libraries are to be installed for executing IDR code

sambamba (we have used version 0.6.7) http://lomereiter.github.io/sambamba/
IDRCode (https://drive.google.com/file/d/0B_ssVVyXv8ZSX3luT0xhV3ZQNWc/view?usp=sharing). User should unzip the archieve and store in convenient location. Path of this archieve is to be provided for executing IDR code.

Execution

User should first clone this pipeline in a convenient location, using the following command:

git clone https://github.com/ay-lab/ATACProc.git

A sample script "pipeline_exec.sh" contains basic execution commands, to invoke the main executable "pipeline.sh" (located inside the folder "bin"). The executable has the following command line options:

Options:

Mandatory parameters:

-C  ConfigFile		    
     	Configuration file to be separately provided. Mandatory parameter. Current package includes four sample configuration files named "configfile_*" corresponding to the reference genomes hg19, hg38, mm9 and mm10. Detailed description of the entries in this configuration file are mentioned later.
              
-f  FASTQ1          
    	Read 1 (or forward strand) of paired-end sequencing data  [.fq|.gz|.bz2]. Or, even an aligned genome (.bam file; single or paired end alignment) can be provided.
        
-r  FASTQ2          
        R2 of pair-end sequencing data [.fq|.gz|.bz2]. If not provided, and the -f parameter is not a BAM file, the input is assumed to be single ended.

-n  PREFIX           
        Prefix string of output files. For example, -n "TEST" means that the output filenames start with the string "TEST". Generally, sample names with run ID, lane information, etc. can be used as a prefix string.

-g  BOWTIE2_GENOME   
        Bowtie2 indexed reference genome. Basically, the folder containing bwt2 indices (corresponding to the reference genome) are to be provided. Mandatory parameter if the user provides fastq files as input (-f and -r options). If user provides .bam files as an input (-f option) then this field is optional.

-d  OutDir 			  
        Output directory to store the results for the current sample.

-c  CONTROLBAM		 
     	Control file(s) used for peak calling using MACS2. One or more alignment files can be provided to be used as a control. It may not be specified at all, in which case MACS2 operates without any control. Control file can be either in *BAM* or in *tagalign.gz* format (the standalone script *bin/TagAlign.sh* in this repository converts BAM file to tagalign.gz format). For multiple control files, they all are required to be of the same format (i.e. either all BAM or all tagalign.gz). Example: -c control1.bam -c control2.bam puts two control files for using in MACS2.
	
-w BigWigGenome	 
		Reference genome as a string. Allowed values are hg19 (default), hg38, mm9 and mm10. If -g option is enabled (i.e. the Bowtie2 index genome is provided), this field is optional. Otherwise, mandatory parameter.				
	
-D  DEBUG_TXT		 
		Binary variable. If 1 (recommended), dumps QC statistics. For a set of samples, those QC statistics can be used later to profile QC variation among different samples.				
	
-O 	Overwrite		 
		Binary variable. If 1, overwrites the existing files (if any). Default = 0.

-F 	Footprint 	 	
		This flag specifies the footprinting option. Value can be 1 (default), 2, or 3
		1: footoprint using the nucleosome free reads (NFR) will be computed. 
		   Default setting. Best for default ATAC-seq protocol (check Li et. al. Genome Biology 2019)

ATACProc

Install / Use

README

ATACProc - a pipeline for processing ATAC-seq data

Release notes

Theory

Installation

Execution