Workflow
The snakemake analysis workflow for bioinformatics
Install / Use
/learn @dxsbiocc/WorkflowREADME
Snakemake Workflow for NGS Analysis
Table of Contents
=================
Introduction
The snakemake analysis workflow for bioinformatics analysis, including
- [x] ATAC-seq/Cut&Tag/ChIP-seq
- [X] DNA-seq(WGS and WES)
- [X] Assembly
- [X] RNA-seq
- [ ] HiC
To-Do list will be updated soon...
Usage
Configration
Globel Config
One-Click global environment configuration
sh setup.sh
involves
- replacing the value of the root_dir parameter, which represents the project path.
- subsequently, all dependency packages are installed.
- finally, the file base.py from the utils folder is copied to the installed path of the
snakemake-wrapper-utilspackage.
In the global configuration file, named config.yaml, both the pipeline and outdir parameters require manual configuration.
- the
pipelineparameter is used to specify the execution workflow. - the
outdirparameter is employed to define the path for saving the output results.
Workflow Config
In the global configuration file, parameters will be overridden by the identically named parameters in the workflow configuration file, indicating that the workflow configuration file takes precedence with higher priority.
workflow include:
- ATAC-seq: config file is ATAC-seq.yaml

- DNA-seq: config file is DNA-seq.yaml
- Assembly: config file is ATAC-seq.yaml
- RNA-seq: config file is ATAC-seq.yaml
Sample Informations
example files in directory example, you need to modify the path of the files sample_info.json and sample_list.txt
- sample_list.txt: using script
generate_setting.pyto generate or manally create
python generate_setting.py path/to/fastq.gz
| sample | fastq1 | fastq2(optional) | type(optional) | experiment(optional) | | ------ | ------ | ------ | ------ | ------ | | sp1 | path/to/sp1.R1.fq.gz | path/to/sp1.R2.fq.gz | control | chip | | sp2 | path/to/sp2.R1.fq.gz | path/to/sp2.R2.fq.gz | control | chip | | sp3 | path/to/sp3.R1.fq.gz | path/to/sp1.R2.fq.gz | test | atac | | sp4 | path/to/sp4.R1.fq.gz | path/to/sp2.R2.fq.gz | test | atac |
- sample_info.json: paired samples
{
"sp3": "sp1",
"sp4": "sp2"
}
Running
# run in local
snakemake -s path/to/Snakefile --use-conda -c4
# or run in the slurm task management system
snakemake -s path/to/Snakefile --profile path/to/config/slurm
If you use the slurm task management system, you can write the parameters of sbatch command as the value of SBATCH_DEFAULTS in the settings.json file. such as
{
"SBATCH_DEFAULTS": "--nodelist=node1",
"CLUSTER_NAME": "",
"CLUSTER_CONFIG": ""
}
Description
Common Tools
1. Trimming Tools
- fastp: A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.
- trimmomatic: A flexible read trimming tool for Illumina NGS data.
- cutadapt: Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
- trim_galore: Trim Galore is a wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data.
2. Mapping Tools
- bwa: BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.
- bwa-mem2: The tool bwa-mem2 is the next version of the bwa-mem algorithm in bwa. It produces alignment identical to bwa and is ~1.3-3.1x faster depending on the use-case, dataset and the running machine.
- bowtie2: Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
- hisat2: HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome).
- star: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license.
- minimap2: Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.
3. Dedup Tools
- Picard
markduplicates: Picard is implemented using the HTSJDK Java library HTSJDK to support accessing file formats that are commonly used for high-throughput sequencing data such as SAM and VCF. - sambamba: Sambamba is a high performance, highly parallel, robust and fast tool (and library), written in the D programming language, for working with SAM and BAM files.
4. Others
- samtools: mpileup and other tools for handling SAM, BAM, CRAM.
- bedtools: The swiss army knife for genome arithmetic.
ATAC-seq
1. Visualization
- deeptools: deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. deepTools contains useful modules to process the mapped reads data for multiple quality checks, creating normalized coverage files in standard bedGraph and bigWig file formats, that allow comparison between different files (for example, treatment and control). Finally, using such normalized and standardized files, deepTools can create many publication-ready visualizations to identify enrichments and for functional annotations of the genome.
2. Peak Calling
- macs2: Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites.
3. Motif Discovery
- homer: HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and ChIP-Seq analysis.
4. QC(Quality Control)
Referring to the QC methods used in the ENCODE ATAC-seq pipeline
Mapping quality
- Total reads
- Mapped reads
Enrichment
-
Fraction of reads in peaks (FRiP): Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)
-
Fraction of reads in annotated regions
-
Reads count distribution of chromatin
Library Complexity
- ChIP-seq Standards
| PBC1 | PBC2 | Bottlenecking level | NRF | Complexity | | :--------------: | :-----------: | :-----------------: | :-------------: | ---------- | | < 0.5 | < 1 | Severe | < 0.5 | Concerning | | 0.5 ≤ PBC1 < 0.8 | 1 ≤ PBC2 < 3 | Moderate | 0.5 ≤ NRF < 0.8 | Acceptable | | 0.8 ≤ PBC1 < 0.9 | 3 ≤ PBC2 < 10 | Mild | 0.8 ≤ NRF < 0.9 | Compliant | | ≥ 0.9 | ≥ 10 | None | > 0.9 | Ideal |
- ATAC-seq Standards
| PBC1 | PBC2 | Bott
