Snakemake Workflow for NGS Analysis

=================

Introduction
Usage
Description
- Common Tools
- ATAC-seq
- DNA-seq
- Assembly
- RNA-seq
Notes

Introduction

The snakemake analysis workflow for bioinformatics analysis, including

[x] ATAC-seq/Cut&Tag/ChIP-seq
[X] DNA-seq(WGS and WES)
[X] Assembly
[X] RNA-seq
[ ] HiC

To-Do list will be updated soon...

Usage

Configration

Globel Config

One-Click global environment configuration

sh setup.sh

involves

replacing the value of the root_dir parameter, which represents the project path.
subsequently, all dependency packages are installed.
finally, the file base.py from the utils folder is copied to the installed path of the snakemake-wrapper-utils package.

In the global configuration file, named config.yaml, both the pipeline and outdir parameters require manual configuration.

the pipeline parameter is used to specify the execution workflow.
the outdir parameter is employed to define the path for saving the output results.

Workflow Config

In the global configuration file, parameters will be overridden by the identically named parameters in the workflow configuration file, indicating that the workflow configuration file takes precedence with higher priority.

workflow include:

ATAC-seq: config file is ATAC-seq.yaml
DNA-seq: config file is DNA-seq.yaml
Assembly: config file is ATAC-seq.yaml
RNA-seq: config file is ATAC-seq.yaml

Sample Informations

example files in directory example, you need to modify the path of the files sample_info.json and sample_list.txt

sample_list.txt: using script generate_setting.py to generate or manally create

python generate_setting.py path/to/fastq.gz

| sample | fastq1 | fastq2(optional) | type(optional) | experiment(optional) | | ------ | ------ | ------ | ------ | ------ | | sp1 | path/to/sp1.R1.fq.gz | path/to/sp1.R2.fq.gz | control | chip | | sp2 | path/to/sp2.R1.fq.gz | path/to/sp2.R2.fq.gz | control | chip | | sp3 | path/to/sp3.R1.fq.gz | path/to/sp1.R2.fq.gz | test | atac | | sp4 | path/to/sp4.R1.fq.gz | path/to/sp2.R2.fq.gz | test | atac |

sample_info.json: paired samples

{
    "sp3": "sp1",
    "sp4": "sp2"
}

Running

# run in local
snakemake -s path/to/Snakefile --use-conda -c4
# or run in the slurm task management system
snakemake -s path/to/Snakefile --profile path/to/config/slurm

If you use the slurm task management system, you can write the parameters of sbatch command as the value of SBATCH_DEFAULTS in the settings.json file. such as

{
    "SBATCH_DEFAULTS": "--nodelist=node1",
    "CLUSTER_NAME": "",
    "CLUSTER_CONFIG": ""
}

Description

Common Tools

1. Trimming Tools

fastp: A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.
trimmomatic: A flexible read trimming tool for Illumina NGS data.
cutadapt: Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
trim_galore: Trim Galore is a wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data.

2. Mapping Tools

bwa: BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.
bwa-mem2: The tool bwa-mem2 is the next version of the bwa-mem algorithm in bwa. It produces alignment identical to bwa and is ~1.3-3.1x faster depending on the use-case, dataset and the running machine.
bowtie2: Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
hisat2: HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome).
star: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license.
minimap2: Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

3. Dedup Tools

Picard markduplicates: Picard is implemented using the HTSJDK Java library HTSJDK to support accessing file formats that are commonly used for high-throughput sequencing data such as SAM and VCF.
sambamba: Sambamba is a high performance, highly parallel, robust and fast tool (and library), written in the D programming language, for working with SAM and BAM files.

4. Others

samtools: mpileup and other tools for handling SAM, BAM, CRAM.
bedtools: The swiss army knife for genome arithmetic.

ATAC-seq

1. Visualization

deeptools: deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. deepTools contains useful modules to process the mapped reads data for multiple quality checks, creating normalized coverage files in standard bedGraph and bigWig file formats, that allow comparison between different files (for example, treatment and control). Finally, using such normalized and standardized files, deepTools can create many publication-ready visualizations to identify enrichments and for functional annotations of the genome.

2. Peak Calling

macs2: Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites.

3. Motif Discovery

homer: HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and ChIP-Seq analysis.

4. QC(Quality Control)

Referring to the QC methods used in the ENCODE ATAC-seq pipeline

Mapping quality

Total reads
Mapped reads

Enrichment

Fraction of reads in peaks (FRiP): Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)
Fraction of reads in annotated regions
Reads count distribution of chromatin

Library Complexity

ChIP-seq Standards

| PBC1 | PBC2 | Bottlenecking level | NRF | Complexity | | :--------------: | :-----------: | :-----------------: | :-------------: | ---------- | | < 0.5 | < 1 | Severe | < 0.5 | Concerning | | 0.5 ≤ PBC1 < 0.8 | 1 ≤ PBC2 < 3 | Moderate | 0.5 ≤ NRF < 0.8 | Acceptable | | 0.8 ≤ PBC1 < 0.9 | 3 ≤ PBC2 < 10 | Mild | 0.8 ≤ NRF < 0.9 | Compliant | | ≥ 0.9 | ≥ 10 | None | > 0.9 | Ideal |

ATAC-seq Standards

| PBC1 | PBC2 | Bott

Workflow

Install / Use

README