Upgrades

EpiNano 1.2 - current version

Includes pretrained m6A models derived from sequences base-called with Guppy v 3.1.5.
Pretrained models can also be used to detect other RNA modifications (tested for pseudouridine, other modifications: not tested).
This version of EpiNano allows to make predictions using two different strategies: EpiNano-Error and EpiNano-SVM.
This version now includes modules for visualizing your RNA modification predictions (EpiNano_Plot)

EpiNano-Error can only be run in pairwise mode (e.g. WT and KO or KD). It combines the different types of base-calling errors that appear in a given dataset (mismatches, deletions, insertions) as well as alterations in per-base-calling qualities. RNA modification predictions are based on the differences in error patterns observed in two matched samples. This strategy can be used with FASTQ data base-called with any given base-calling algorithm version.

EpiNano-SVM can be run using either pre-trained models for a given RNA modification, or by building your own models. However, we should note that using a matched control (e.g. KO or KD) is still highly recommended, due to the noisy nature of direct RNA sequencing reads, which are 'error'-rich. Moreover, in addition to SVM models trained with "raw" base-calling 'error' features (same as in EpiNano 1.0 and 1.1), in EpiNano 1.2 we now provide SVM models trained with features that capture differences between samples (i.e. difference in mismatch, rather than absolute mismatch frequency), which we find have improved performance.

EpiNano 1.1 - a slimmer version of version 1.0, written in python3 is available here.

This version is the one currently implemented in MasterOfPores, a workflow to analyze direct RNA sequencing data.
The major differences with EpiNano 1.0 are (i) it is faster (ii) Uses python3 instead of python2 (iii) Does not extract current intensity in the feature table, as this feature was not used to train the final models.
Includes pre-trained m6A models base-called with Albacore version 2.1.7.
Works both with Guppy and Albacore basecalled data, but the SVM predictions will be only accurate if your data has been base-called using Albacore 2.1.7.
Regardless of the basecallers used, EpiNano can be used as a toolkit to extract per k-mer base-calling 'errors' (mismatch, insertion, deletion, quality), which are a proxy of RNA modifications present in a given dataset. We recommend running EpiNano in paired mode, i.e. computing the features in two datasets (WT-KO) to then accurately predict the RNA modified sites (i.e. those showing largest differences in their base-calling 'error' features).

EpiNano 1.0 - original code used in Liu, Begik et al., Nature Comm 2019, which is available here.

Includes pre-trained m6A models base-called with Albacore version 2.1.7.
It extracts both base-calling 'errors' (mismatch, insertion, seletion, per-base quality) as well as current intensity values
Current intensity information is extracted from the base-called Albacore FAST5 files.
Does not have models trained with Guppy base-called datasets.

About EpiNano

EpiNano is a tool to identify RNA modifications present in direct RNA sequencing reads.

EpiNano will extract a set of 'features' from direct RNA sequencing reads, which will be in turn used to predict whether the 'error' is caused by the presence of an RNA modification or not. Features directly extracted and derived include:

current intensity and duration
read quality
base quality scores
mismatch frequency
deletion frequency
insertion frequency
sumErr

These features can be organized in per base and per kmer formats

Modes of Running EpiNano

In EpiNano 1.2, we introduce delta-features, features capturing difference between modified and un-modified sites and sum_err, a metric computed by combining different types of errors and even base quality scores. These new metrics represent our attempt to steer around the limitation related to the fact that different types of RNA base modifications tend to introduce different types of sequencing errors.

EpiNano version 1.2 can predict RNA-modified sites in two different ways:

EpiNano-Error

Base-calling algorithm independent.
Applicable to any given RNA modification that affects the base-calling features.

EpiNano-SVM

Base-calling algorithm dependent (data must be base-called with Guppy 3.1.5)
Can use both base-calling error features as well as current signals features
It can be used to train your own models as well as be applied to datasets for which a pre-trained model is available (m6A)
The available m6A SVM models has been trained and tested upon a set of 'unmodified' and 'modified' sequences containing m6A at known sites or A.
We also offer SVM models trained with delta features, i.e., features capturing difference between modified and un-modified samples. These models can be applied to detect other RNA modifications apart from m6A (tested on pseudouridine).

Considerations when using EpiNano

EpiNano relies on the use of base-calling 'errors' to detect RNA modifications; however, direct RNA sequencing base-calling produces a significant amount of 'errors' in unmodified sequences. Therefore, to obtain higher confidence m6A-modified sites, we recommend to sequence both modified and unmodified datasets (e.g. treated with demethylase, or comparing a wild-type vs knockout/knockdown). Coupling a "control" (KD/KO) is not required in earlier Epinano versions, but is highly recommended.
You can use EpiNano as a feature extractor to predict RNA modifications based on alterations in base-called features (i.e., EpiNano-Error, as used here), as well as use the pre-trained SVMs to detect m6A RNA modifications (i.e., EpiNano-SVM, as used here).
EpiNano does not have per-read resolution. We are currently working on an improved version of EpiNano to obtain predictions at per-read level.
The performance of the algorithm is dependent on the stoichiometry of the site (i.e. sites with very low stoichiometry will be often missed by the algorithm)
Pre-trained models to predict m6A sites are included in each release. Please note that if you use pre-trained m6A models, your data should be base-called with the SAME base-calling algorithm and version (i.e. Guppy 3.1.5 if you use EpiNano 1.2, and Albacore 2.1.7 if you use EpiNano 1.0 or 1.1).
If you are using a different base-calling algorithm version, we recommend you to use EpiNano-Error rather than EpiNano-SVM.

Installation

To download the latest version of EpiNano , you just need to clone the repo:

git clone git@github.com:novoalab/EpiNano.git

You can also find the last tagged releases HERE

The easier way to install the program is to use either a Conda environment or a Docker container.

For the former we suggest to use micromamba.

micromamba create -f environment.yaml -y
micromamba activate epinano

A Docker recipe is provided in dockerfile/ directory with some building instructions there.

Running EpiNano

a) Running EpiNano 1.2

To train models and assess prediction accuracies, please refer to commands in test_data/train_models/train_test.sh.

To make predictions with pre-trained models, please refer to commands in /test_data/make_predictions.

We will also update in wiki with specific examples of using different Epinano components.

Below is a simple introduction of programs' usage information.

STEP 1. Extract base-calling error features

Epinano_Variants, outputs a feature table sample.per.site.var.csv, which contains base-calling ‘error’ information for each reference position. Please note that by default, the feature table sample.per_site.5mer.csv that was generated by default in EpiNano 1.1 (which contains the same base-called features organized in 5-mer windows) is not generated by default any more. If you want to generate this file, please use the script Slide_Variants.py.

IMPORTANT: the code requires both bam and reference file to be indexed (.bai and -fai respectively).

python Epinano_Variants.py -h
usage: Epinano_Variants.py [-h] -b BAM -r REFERENCE [-c CPUS] [-o OUTPUT]


optional arguments:
  -h, --help            show this help message and exit
  -c CPUS, --cpus CPUS  Number of CPUs to use [4]
  -o OUTPUT, --output OUTPUT
                        Output directory. Default working directory.

Required Arguments:
  -r REFERENCE, --reference REFERENCE
                        reference file indexed with samtools faidx and with
                        sequence dictionary created using

EpiNano

Install / Use

README

Table of Contents