MethylSnake

A Snakemake pipeline for RRBS data analysis

Overview

This pipeline implements a DNA methylation data analysis workflow. Currently, it is taylored toward RRBS data analysis but it can be easity tweaked to support other data types.

Starting from raw reads in FASTQ format, it runs basic QC reporting, reads trimming, alignment, methylation extraction, filtering of incomplete conversions and differential methylation calls both at single base resolution and fixed-size tiles. Moreover, it runs basic downstream genomic annotation.

The final output of the pipeline are a set of files for alignment, detected features, annotation and reports in a variety of formats (bam, bed, pdf, html). R objects are also returned for retrospective inspection and further downstream analysis.

Quick start

Prepare genome indexes for the Bismark aligner.
Write a sample sheet with per-sample specifications
Generate config file
Start the pipeline

Tools

QC: FastQC
Trimming: Trim_galore (Cutadapt wrapper)
Alignment + methlyation extraction: Bismark (using Bowtie2)
Differential methylation detection: MethylKit (Bioconductor package)

Dependencies

snakemake 5.20.1: pip install --user snakemake or via Conda
pyaml 20.4 (PyYAML 5.3.1): pip install --user pyaml
singularity 3.6.1 has to be installed system wide, ask your admin
slurm 18.08.8 has to be installed system wide, ask your admin

How to write the sample sheet

The sample sheet has to be saved in csv format, quotes are not mandatory but field separator has to a semicolon and headers are mandatory.

The table is composed of two columns:

sample_name: this field should match the input file name(s), e.g. if reads for a sample are stored in sample1.fq.gz, the sample_name column value should be sample1. In case of paired end reads, sample1_R1.fq.gz and sample1_R2.fq.gz the value of this column should be sample1.
treatment: a numerical value representing the treatment. The value 0 (zero) always represent control samples.

An example with two treatments, control and triplicate replicates per treatment:

sample_name;treatment
sample1;0
sample2;0
sample3;0
sample4;1
sample5;1
sample6;1
sample7;2
sample8;2
sample9;2

In this setup, samples 1, 2 and 3 are control; samples 4, 5, 6 correspond to treatment 1 and samples 7, 8, 9 to treatment 2.

How to generate the config file

The Snakemake config file holds all tunable parameters of the pipeline. Check this section for a complete list and relative explanation. For simplicity, all paths should be absolute.

A wrapper script with an example invocation can be found in run_make_config.sh, It can be tweaked and run to generate a valid config file.

How to run the pipeline

The pipeline runs inside a Singularity container. This container should contain all the necessary tools to run the analysis. See below for additional details.

The pipeline can be started with the wrapper script run_snakemake.sh.

The scripts parses the config YAML file searching for four folder to bind mount in the Singularity container: working directory, the genome directory, the genome index directory, the directory holding annotations and the temporary directory. All these paths can be specified when running the config script.

Slurm configuration

A Slurm configuration is given in cluster-config/cluster.json. In the json files, keys correspont to Snakemake rule names. Values are in turn objects with keys corresponding to Slurm parameters and values corresponding to desired values for the parameter:

...
"trim":
    {
        "time"      : "6:00:00",
        "mem"       : "8G"
    },
...

The example above assigns maximum runtime of six hours and 8GB or memory to the trim rule. The "__default__" key provides default parameters applied to all rules:

...
"__default__":
    {
        "jobname"   : "{rule}",
        "log"       : "slurm-%x-%j.log",
        "ntasks"    : 1,
        "cpus"      : 1,
        "mem"       : "4G",
        "time"      : "00-00:20:00",
        "partition" : "normal,parallel"
    }
...

By default, each job is spawned with the rule name and logs to a file named after the rule and the jobid generated by Slurm. Each job is assigned 1 task, 1 CPU and 4GB of RAM. Maximum runtime is 20 minutes. Allowed partitions are normal and parallel.

Singularity container

The pipeline is designed to run within a Singularity container. The path to the container can be configured with the config script.

A working container can be pulled from SingularityHub:

singularity pull --name methylsnake.sif library://ftabaro/default/methylsnake

Checkout the singularity folder for more info.

The container path has be specified in the command line arguments of the config script.

Output description

This pipeline generates a number of output files:

FastQC reports for input and trimmed reads
Fastq files for trimmed reads files with reports (Fastqc and text)
Alignment files with report and nucleotide statistics
Alignment files with incomplete conversions removed
Methylation files in GpG, CHG and CHH contexts
Bismark HTML reports
Bismark HTML summary
rds files with every object computed by MethylKit
bed files of differential methylation calls
Figures
- prelimimary coverage and methylation state histograms
- samples PCA and correlations
- number of differentially methylated features per chromosome (DMC and DMR)
- pie charts of annotation classes
  - all differentially methylated features
  - hyper methylated
  - hypo methylated
Tables
- distance to TSS for all, hyper and hypo methylated features (csv)

Ouput folder hierarchy

An example of output directory (all directory names can be changed from the config file)

.
├── alignments
├── bed
│   ├── dmc
│   └── dmr
├── fastqc
├── log
├── pictures
│   ├── dmc
│   └── dmr
├── reads
│   └── fastqc
├── RData
│   ├── dmc
│   └── dmr
├── reports
├── tables
│   ├── dmc
│   └── dmr
└── trimmed

Useful links

https://www.nature.com/articles/nmeth.1828
https://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md
https://www.bioinformatics.babraham.ac.uk/projects/bismark/
https://www.bioinformatics.babraham.ac.uk/projects/bismark/Bismark_User_Guide.pdf
https://github.com/FelixKrueger/Bismark/tree/master/Docs

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/ftabaro/rrbs-pipeline/issues

Complete list of config parameters

  --config-path CONFIG_PATH
                        Path for the config file
                        
  --wd WD               Working directory. Root directory of the project
  
  --genome-path GENOME_PATH
                        Path to a fasta file with genome sequence
                        
  --bismark-index-path BISMARK_INDEX_PATH
                        Path to Bismark index files (only index name required:
                        /path/to/index/folder/hg38*)
                        
  --sample-sheet SAMPLE_SHEET
                        Path to csv file holding samples information
                        
  --annotation-file ANNOTATION_FILE
                        Path to a gzipped GTF file.
                        
  --dmr-window-size DMR_WINDOW_SIZE
                        Window size for tiled differential methylation
                        analysis
                        
  --dmr-step-size DMR_STEP_SIZE
                        Step size for tiled differential methylation analysis
                        
  --dmr-difference DMR_DIFFERENCE [DMR_DIFFERENCE ...]
                        Difference in reads coverage threshold for
                        differential methylation analysis
                        
  --dmr-qvalue DMR_QVALUE [DMR_QVALUE ...]
                        Q-value threshold for differential methylation
                        analysis
                        
  --min-per-group MIN_PER_GROUP
                        An integer denoting minimum number of samples per
                        replicate needed to cover a region/base
                        
  --mate1-pattern MATE1_PATTERN
                        Pattern to identify mate 1 in paired sequencing files e.g. R1 or _R1 or _1
                        
  --mate2-pattern MATE2_PATTERN
                        Pattern to identify mate 2 in paired sequencing files e.g. R2 or _R2 or _2
                        
  --fastq-extension FASTQ_EXTENSION
                        File extension of reads fastq files e.g. fastq.gz or fq.gz or fastq
                        
  --genome-version GENOME_VERSION
                        Label for genome version used in the analysis e.g. hg38 or hg19
                        
  --singularity-container SINGULARITY_CON

MethylSnake

Install / Use

README

MethylSnake

Overview

Quick start

Tools

Dependencies

How to write the sample sheet

How to generate the config file

How to run the pipeline

Slurm configuration

Singularity container

Output description

Ouput folder hierarchy

Useful links

Contributing

Complete list of config parameters