Analysis pipeline

METEORE: MEthylation deTEction with nanopORE sequencing :stars:

About METEORE

METEORE provides snakemake pipelines for various tools to detect DNA methylation from Nanopore sequencing reads. Additionally, it provides new predictive models (random forest and multiple linear regression) that combine the outputs from the tools to produce a consensus prediction with higher accuracy than the individual tools.

NEW UPDATES (Mar-2021)

METEORE can now produce two per-site result files in an augmented BED format for each tool except for DeepMod (which will be updated very soon). The first output file contains the following fields:

Reference chromosome
Start position in chromosome
End position in chromosome
Read coverage
Methylation (i.e. methylation frequency)
Strandedness

In the second output file, we combine the methylation predictions from both strands on CpG sites by averaging the methylation frequencies and adding up the coverage. This output file contains the following fields:

Reference chromosome
Start position in chromosome
End position in chromosome
Read coverage
Methylation (i.e. methylation frequency)

Pipeline
Installation
Tutorial on an example dataset
Combined model (random forest) usage
Combined model (multiple linear regression) usage

Pipeline

Analysis pipeline Fig 1. Pipeline for CpG methylation detection form nanopore sequencing data. All tools take the input fast5 files, detect modified bases (5-methylcytosine at CG dinucleotides in this case) in reads and predict per-site methylation frequency at genome level.

Installation

We recommend to install software dependencies via Conda on Linux. You can find Miniconda installation instructions for Linux here. Make sure you install the Miniconda Python3 distribution.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Accept the license terms during installation.

For performance and compatibility reasons you should install Mamba via conda to install Snakemake for each pipeline later. See Snakemake documentation for more details.

conda install -c conda-forge mamba

Once you have installed Conda and Mamba, you can download the Snakemake pipelines and the example datasets.

git clone https://github.com/comprna/METEORE.git
cd METEORE/

Tutorial on an example dataset

We provide an example dataset data/example along with a genome reference data/ecoli_k12_mg1655.fasta for you to try the pipelines with. The example contains 50 single-read fast5 files from the positive control dataset for E.coli generated by Simpson et al. (2017).

Run the pipelines with your own data:

You can run the pipeline with your own dataset by replacing example folder in the data directory with your folder containing the fast5 files. You will use the fast5 folder name to specify your target output file in the Snakemake pipeline. Simply replace example in the output file with your fast5 folder name in the command line below.
You should place the reference genome file in .fasta format in a folder named data, and re-define the reference genome file within the Snakefile (Nanopolish, Deepsignal1, Tombo, Guppy) by replacing ecoli_k12_mg1655.fasta with your specified reference genome.

Nanopolish snakemake pipeline

Create and activate the Conda environment

To install packages for Nanopolish pipeline, run one of the following:

Installing packages via Mamba

# Create an environment with Snakemake installed
mamba create -c conda-forge -c bioconda -n meteore_nanopolish_env snakemake
# Activate
conda activate meteore_nanopolish_env
# Install all required conda packages with mamba
mamba install -c bioconda nanopolish minimap2 samtools r-data.table r-dplyr r-plyr

Installing packages using .yml file**

mamba env create -f nanopolish.yml
conda activate meteore_nanopolish_env

Run the snakemake

Before executing the workflow below, make sure you have the basecalled fastq file in the METEORE directory. Nanopolish needs to link the read ids from the fastq file with their signal-level data in the fast5 files. An example fastq file example.fastq is provided.

A Snakefile named Nanopolish contains all rules for the Snakemake workflow. Run the snakemake to create the output files:

snakemake -s Nanopolish nanopolish_results/example_nanopolish-freq-perCG.tsv --cores all

This will produce four index files example.fastq.index, example.fastq.index.fai, example.fastq.index.gzi and example.fastq.index.readdb, and the nanopolish_results output directory containing all output files.

example_nanopolish-log.tsv is the raw output after running nanopolish call-methylation.
example_nanopolish-log-perCG.tsv contains per-read per-site data, which splits up the CpG group containing multiple nearby sites into its constituent CpG sites.

Chr           Pos         Strand    Log.like.ratio  Read_ID
NC_000913.3   3499494     +         -0.62           094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3   3499526     +         -0.33           094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3   3499546     +         -0.12           094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3   3499563     +         8.26            094dfe6b-23ed-4195-8876-805a399fade5

example_nanopolish-freq-perCG.tsv stores the final per-site data in a augmented BED format where the columns represent:
1. Reference chromosome
2. Start position in chromosome
3. End position in chromosome
4. Read coverage
5. Methylation (i.e. methylation frequency)
6. Strandedness

Chr             Pos_start   Pos_end   Coverage    Methylation   Strand
NC_000913.3     3503839     3503840   7           1             +
NC_000913.3     3503840     3503841   7           1             -
NC_000913.3     3503849     3503850   7           1             +
NC_000913.3     3503850     3503851   7           1             -

example_nanopolish-freq-perCG-combStrand.tsv also stores the final per-site data in the same augmented BED format but the methylation calls from both strands are merged into a single strand by averaging the methylation frequencies and adding up the coverage for a CpG site. Each column represents:
1. Reference chromosome
2. Start position in chromosome
3. End position in chromosome
4. Read coverage
5. Methylation (i.e. methylation frequency)

Chr             Pos_start   Pos_end   Coverage

METEORE

Install / Use

README