ATACProc
ATAC-seq processing pipeline
Install / Use
/learn @ay-lab/ATACProcREADME
ATACProc - a pipeline for processing ATAC-seq data
Devloper: Sourya Bhattacharyya
Supervisors: Dr. Ferhat Ay and Dr. Pandurangan Vijayanand
La Jolla Institute for Immunology, CA 92037, USA
#######################
ATACProc is a pipeline to analyze ATAC-seq data. Currently datasets involving one of the four reference genomes, namely hg19, hg38, mm9 and mm10 are supported. Important features of this pipeline are:
-
Supports single or paired-end fastq or BAM formatted data.
-
Generates alignment summary and QC statistics.
-
Peak calls using MACS2, for multiple FDR thresholds (0.01 and 0.05)
-
Generating raw and coverage normalized BigWig tracks for visualizing the data in UCSC genome browser.
-
Irreproducible Discovery Rate (IDR) analysis (https://github.com/nboley/idr) between a set of peak calls or even a set of input alignment (BAM) files (in which case, peaks are estimated first) corresponding to a set of biological or technical ATAC-seq replicates.
-
New in version 2.0: Support discarding reads falling in blacklisted genomic regions
-
New in version 2.0: Support extracting nucleosome free reads (NFR), one or more nucleosome containing regions (denoted as +1M), for TF footprinting analysis.
-
New in version 2.0: Compatibility to the package ATAQV (https://github.com/ParkerLab/ataqv) for generating summary statistics across a set of samples.
#######################
Release notes
Version 2.2 - April 2022
Added -F option - corresponds to using different types of reads for footprinting.
Default = 1, means footprinting with nucleosome free reads (NFR) will be done.
Best for standard ATAC-seq protocols (Li et al. Genome Biology, 2019)
If -F option is 2, footprinting with nucleosome reads will also be separately computed in addition to the NFR based footprints (two different footprinting outputs).
If -F option is 3, footprinting with all the reads will also be separately computed in addition to the NFR based and nucleosome read based footprints (three different footprinting outputs).
Version 2.1 - July 2020
Minor change of picard duplicate removal syntax, according to the picard tool version 2.8.14 We recommend using this (or later) versions
Version 2.0 - November 2019
-
Included TF footprinting, optional discarding of blacklisted genomic regions, motif analysis
-
Updated summary statistics incorporating support for ATAQV package (https://github.com/ParkerLab/ataqv)
-
Discarded R package ATACseqQC (https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html) and corresponding operations, mainly due to its time complexity and reliability issues.
Version 1.0 - July 2018:
-
Released first version of ATAC-seq pipeline, supporting generation of QC metrics, peak calls, signal tracks for visualizing in UCSC genome browser.
-
Also supports IDR between a set of peaks / alignments for a set of replicates.
Theory
Papers / links for understanding ATAC-seq QCs:
-
https://github.com/crazyhottommy/ChIP-seq-analysis (very useful; contains many papers and links for understanding ChIP-seq and ATAC-seq data)
-
https://www.encodeproject.org/data-standards/terms/#library
-
https://www.biostars.org/p/187204/
-
http://seqanswers.com/forums/archive/index.php/t-59219.html
-
https://github.com/kundajelab/atac_dnase_pipelines
-
https://github.com/ParkerLab/bioinf525#sifting
-
https://github.com/taoliu/MACS/issues/145
-
https://www.biostars.org/p/207318/
-
https://www.biostars.org/p/209592/
-
https://www.biostars.org/p/205576/
Understanding peak calling
- https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-9-r137
Understanding TF footprinting
- https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1642-2
Understanding IDR analysis
- https://github.com/nboley/idr
Installation
Following packages / libraries should be installed before running this pipeline:
-
Python 2.7
-
R environment (we have used 3.4.3)
User should also install the following R packages, by running the following command inside R prompt:
install.packages(c(“optparse”, “ggplot2”, “data.table”, “plotly”))
Also user needs to install the bioconductor package GenomicRanges https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html
-
Bowtie2 (we have used version 2.3.3.1) http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
-
samtools (we have used version 1.6) http://samtools.sourceforge.net/
-
PICARD tools (we have used 2.8.14 version now; previously we were using version 2.7.1) https://broadinstitute.github.io/picard/
-
Utilities "bedGraphToBigWig", "bedSort", "bigBedToBed", "hubCheck" and "fetchChromSizes" - to be downloaded from UCSC repository http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
-
deepTools (we have used version 2.0) https://deeptools.readthedocs.io/en/develop/
-
MACS2 (we have used version 2.1.1) https://github.com/taoliu/MACS
-
HOMER (we recommend using the latest version) http://homer.ucsd.edu/homer/
-
The package ataqv (https://github.com/ParkerLab/ataqv). User needs to download the GitHub release (.tar.gz) file in a convenient location, extract it, and provide corresponding path in a configuration file (mentioned below).
-
Regulatory genomics toolbox (https://www.regulatory-genomics.org/)
First user needs to install the module RGT using the following commands:
pip install --user cython numpy scipy pip install --user RGTA folder rgtdata would be created inside the home directory. Next step is to configure that folder by typing the following commands:
cd ~/rgtdata python setupGenomicData.py --hg19 python setupGenomicData.py --hg38 python setupGenomicData.py --mm9 python setupGenomicData.py --mm10 (Note: it is better to run the last four commands together in a qsub / cluster environment, otherwise it'll be time consuming).Then, user needs to set up the motif configuration data, via executing the following commands (preferable to run in qsub / cluster environment)
cd ~/rgtdata python setupLogoData.py --all
User should include the PATH of above mentioned libraries / packages inside their SYSTEM PATH variable. Alternatively, installation PATHS for some of these packages are to be mentioned in a separate configuration file (described below)
Following packages / libraries are to be installed for executing IDR code
-
sambamba (we have used version 0.6.7) http://lomereiter.github.io/sambamba/
-
IDRCode (https://drive.google.com/file/d/0B_ssVVyXv8ZSX3luT0xhV3ZQNWc/view?usp=sharing). User should unzip the archieve and store in convenient location. Path of this archieve is to be provided for executing IDR code.
Execution
User should first clone this pipeline in a convenient location, using the following command:
git clone https://github.com/ay-lab/ATACProc.git
A sample script "pipeline_exec.sh" contains basic execution commands, to invoke the main executable "pipeline.sh" (located inside the folder "bin"). The executable has the following command line options:
Options:
Mandatory parameters:
-C ConfigFile
Configuration file to be separately provided. Mandatory parameter. Current package includes four sample configuration files named "configfile_*" corresponding to the reference genomes hg19, hg38, mm9 and mm10. Detailed description of the entries in this configuration file are mentioned later.
-f FASTQ1
Read 1 (or forward strand) of paired-end sequencing data [.fq|.gz|.bz2]. Or, even an aligned genome (.bam file; single or paired end alignment) can be provided.
-r FASTQ2
R2 of pair-end sequencing data [.fq|.gz|.bz2]. If not provided, and the -f parameter is not a BAM file, the input is assumed to be single ended.
-n PREFIX
Prefix string of output files. For example, -n "TEST" means that the output filenames start with the string "TEST". Generally, sample names with run ID, lane information, etc. can be used as a prefix string.
-g BOWTIE2_GENOME
Bowtie2 indexed reference genome. Basically, the folder containing bwt2 indices (corresponding to the reference genome) are to be provided. Mandatory parameter if the user provides fastq files as input (-f and -r options). If user provides .bam files as an input (-f option) then this field is optional.
-d OutDir
Output directory to store the results for the current sample.
-c CONTROLBAM
Control file(s) used for peak calling using MACS2. One or more alignment files can be provided to be used as a control. It may not be specified at all, in which case MACS2 operates without any control. Control file can be either in *BAM* or in *tagalign.gz* format (the standalone script *bin/TagAlign.sh* in this repository converts BAM file to tagalign.gz format). For multiple control files, they all are required to be of the same format (i.e. either all BAM or all tagalign.gz). Example: -c control1.bam -c control2.bam puts two control files for using in MACS2.
-w BigWigGenome
Reference genome as a string. Allowed values are hg19 (default), hg38, mm9 and mm10. If -g option is enabled (i.e. the Bowtie2 index genome is provided), this field is optional. Otherwise, mandatory parameter.
-D DEBUG_TXT
Binary variable. If 1 (recommended), dumps QC statistics. For a set of samples, those QC statistics can be used later to profile QC variation among different samples.
-O Overwrite
Binary variable. If 1, overwrites the existing files (if any). Default = 0.
-F Footprint
This flag specifies the footprinting option. Value can be 1 (default), 2, or 3
1: footoprint using the nucleosome free reads (NFR) will be computed.
Default setting. Best for default ATAC-seq protocol (check Li et. al. Genome Biology 2019)
