MethylSnake
A Snakemake pipeline for RRBS data analysis
Install / Use
/learn @ftabaro/MethylSnakeREADME
MethylSnake
A Snakemake pipeline for RRBS data analysis
Overview
This pipeline implements a DNA methylation data analysis workflow. Currently, it is taylored toward RRBS data analysis but it can be easity tweaked to support other data types.
Starting from raw reads in FASTQ format, it runs basic QC reporting, reads trimming, alignment, methylation extraction, filtering of incomplete conversions and differential methylation calls both at single base resolution and fixed-size tiles. Moreover, it runs basic downstream genomic annotation.
The final output of the pipeline are a set of files for alignment, detected features, annotation and reports in a variety of formats (bam, bed, pdf, html). R objects are also returned for retrospective inspection and further downstream analysis.

Quick start
- Prepare genome indexes for the Bismark aligner.
- Write a sample sheet with per-sample specifications
- Generate config file
- Start the pipeline
Tools
- QC: FastQC
- Trimming: Trim_galore (Cutadapt wrapper)
- Alignment + methlyation extraction: Bismark (using Bowtie2)
- Differential methylation detection: MethylKit (Bioconductor package)
Dependencies
snakemake5.20.1:pip install --user snakemakeor via Condapyaml20.4 (PyYAML 5.3.1):pip install --user pyamlsingularity3.6.1 has to be installed system wide, ask your adminslurm18.08.8 has to be installed system wide, ask your admin
How to write the sample sheet
The sample sheet has to be saved in csv format, quotes are not mandatory but field separator has to a semicolon and headers are mandatory.
The table is composed of two columns:
sample_name: this field should match the input file name(s), e.g. if reads for a sample are stored insample1.fq.gz, thesample_namecolumn value should besample1. In case of paired end reads,sample1_R1.fq.gzandsample1_R2.fq.gzthe value of this column should besample1.treatment: a numerical value representing the treatment. The value 0 (zero) always represent control samples.
An example with two treatments, control and triplicate replicates per treatment:
sample_name;treatment
sample1;0
sample2;0
sample3;0
sample4;1
sample5;1
sample6;1
sample7;2
sample8;2
sample9;2
In this setup, samples 1, 2 and 3 are control; samples 4, 5, 6 correspond to treatment 1 and samples 7, 8, 9 to treatment 2.
How to generate the config file
The Snakemake config file holds all tunable parameters of the pipeline. Check this section for a complete list and relative explanation. For simplicity, all paths should be absolute.
A wrapper script with an example invocation can be found in run_make_config.sh, It can be tweaked and run to generate a valid config file.
How to run the pipeline
The pipeline runs inside a Singularity container. This container should contain all the necessary tools to run the analysis. See below for additional details.
The pipeline can be started with the wrapper script run_snakemake.sh.
The scripts parses the config YAML file searching for four folder to bind mount in the Singularity container: working directory, the genome directory, the genome index directory, the directory holding annotations and the temporary directory. All these paths can be specified when running the config script.
Slurm configuration
A Slurm configuration is given in cluster-config/cluster.json. In the json files, keys correspont to Snakemake rule names. Values are in turn objects with keys corresponding to Slurm parameters and values corresponding to desired values for the parameter:
...
"trim":
{
"time" : "6:00:00",
"mem" : "8G"
},
...
The example above assigns maximum runtime of six hours and 8GB or memory to the trim rule. The "__default__" key provides default parameters applied to all rules:
...
"__default__":
{
"jobname" : "{rule}",
"log" : "slurm-%x-%j.log",
"ntasks" : 1,
"cpus" : 1,
"mem" : "4G",
"time" : "00-00:20:00",
"partition" : "normal,parallel"
}
...
By default, each job is spawned with the rule name and logs to a file named after the rule and the jobid generated by Slurm. Each job is assigned 1 task, 1 CPU and 4GB of RAM. Maximum runtime is 20 minutes. Allowed partitions are normal and parallel.
Singularity container
The pipeline is designed to run within a Singularity container. The path to the container can be configured with the config script.
A working container can be pulled from SingularityHub:
singularity pull --name methylsnake.sif library://ftabaro/default/methylsnake
Checkout the singularity folder for more info.
The container path has be specified in the command line arguments of the config script.
Output description
This pipeline generates a number of output files:
- FastQC reports for input and trimmed reads
- Fastq files for trimmed reads files with reports (Fastqc and text)
- Alignment files with report and nucleotide statistics
- Alignment files with incomplete conversions removed
- Methylation files in GpG, CHG and CHH contexts
- Bismark HTML reports
- Bismark HTML summary
- rds files with every object computed by MethylKit
- bed files of differential methylation calls
- Figures
- prelimimary coverage and methylation state histograms
- samples PCA and correlations
- number of differentially methylated features per chromosome (DMC and DMR)
- pie charts of annotation classes
- all differentially methylated features
- hyper methylated
- hypo methylated
- Tables
- distance to TSS for all, hyper and hypo methylated features (csv)
Ouput folder hierarchy
An example of output directory (all directory names can be changed from the config file)
.
├── alignments
├── bed
│ ├── dmc
│ └── dmr
├── fastqc
├── log
├── pictures
│ ├── dmc
│ └── dmr
├── reads
│ └── fastqc
├── RData
│ ├── dmc
│ └── dmr
├── reports
├── tables
│ ├── dmc
│ └── dmr
└── trimmed
Useful links
- https://www.nature.com/articles/nmeth.1828
- https://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf
- https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
- https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md
- https://www.bioinformatics.babraham.ac.uk/projects/bismark/
- https://www.bioinformatics.babraham.ac.uk/projects/bismark/Bismark_User_Guide.pdf
- https://github.com/FelixKrueger/Bismark/tree/master/Docs
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/ftabaro/rrbs-pipeline/issues
Complete list of config parameters
--config-path CONFIG_PATH
Path for the config file
--wd WD Working directory. Root directory of the project
--genome-path GENOME_PATH
Path to a fasta file with genome sequence
--bismark-index-path BISMARK_INDEX_PATH
Path to Bismark index files (only index name required:
/path/to/index/folder/hg38*)
--sample-sheet SAMPLE_SHEET
Path to csv file holding samples information
--annotation-file ANNOTATION_FILE
Path to a gzipped GTF file.
--dmr-window-size DMR_WINDOW_SIZE
Window size for tiled differential methylation
analysis
--dmr-step-size DMR_STEP_SIZE
Step size for tiled differential methylation analysis
--dmr-difference DMR_DIFFERENCE [DMR_DIFFERENCE ...]
Difference in reads coverage threshold for
differential methylation analysis
--dmr-qvalue DMR_QVALUE [DMR_QVALUE ...]
Q-value threshold for differential methylation
analysis
--min-per-group MIN_PER_GROUP
An integer denoting minimum number of samples per
replicate needed to cover a region/base
--mate1-pattern MATE1_PATTERN
Pattern to identify mate 1 in paired sequencing files e.g. R1 or _R1 or _1
--mate2-pattern MATE2_PATTERN
Pattern to identify mate 2 in paired sequencing files e.g. R2 or _R2 or _2
--fastq-extension FASTQ_EXTENSION
File extension of reads fastq files e.g. fastq.gz or fq.gz or fastq
--genome-version GENOME_VERSION
Label for genome version used in the analysis e.g. hg38 or hg19
--singularity-container SINGULARITY_CON
