m6anet

alt text

m6anet is a python tool that leverages Multiple Instance Learning framework to detect m6a modifications from Nanopore Direct RNA Sequencing data

Running m6Anet<br>
- Installation<br>
- Dataprep<br>
- Inference<br>
Release Notes<br>
- 2.1.0<br>
- 2.0.0<br>
Getting Help<br>
Citing<br>
Contributors<br>
License<br>

Running m6Anet

Installation

m6anet requires Python version 3.7 or higher. To install the latest release with PyPI (recommended) run

$ pip install m6anet

Or, one can also install via conda with the following command:

$ conda install m6anet

See our documentation here!

Dataprep

m6Anet dataprep requires eventalign.txt from nanopolish eventalign:

    nanopolish eventalign --reads reads.fastq --bam reads.sorted.bam --genome transcript.fa --scale-events --signal-index --summary /path/to/summary.txt  --threads 50 > /path/to/eventalign.txt

This function segments raw fast5 signals to each position within the transcriptome, allowing m6Anet to predict modification based on the segmented signals. In order to run eventalign, users will need:

reads.fastq: fastq file generated from basecalling the raw .fast5 files
reads.sorted.bam: sorted bam file obtained from aligning reads.fastq to the reference transcriptome file
transcript.fa: reference transcriptome file

We have also provided a demo eventalign.txt dataset in the repository under /path/to/m6anet/m6anet/tests/data/eventalign.txt. Please see Nanopolish for more information.

After running nanopolish eventalign, we need to preprocess the segmented raw signal file using 'm6anet dataprep'::

    m6anet dataprep --eventalign /path/to/m6anet/m6anet/tests/data/eventalign.txt \
                    --out_dir /path/to/output --n_processes 4

The output files are stored in /path/to/output:

data.json: json file containing the features to feed into m6Anet model for prediction
data.log: Log file containing all the transcripts that have been successfully preprocessed
data.info: File containing indexing information of data.json for faster file access and the number of reads for each DRACH positions in eventalign.txt
eventalign.index: Index file created during dataprep to allow faster access of Nanopolish eventalign.txt during dataprep

Inference

Once m6anet dataprep finishes running, we can run m6anet inference on the dataprep output:

    m6anet inference --input_dir path/to/output --out_dir path/to/output  --n_processes 4 --num_iterations 1000

m6anet inference will run default human model trained on the HCT116 cell line. In order to run Arabidopsis-based model or the HEK293T-RNA004-based model, please supply the --pretrained_model argument:

    ## For the Arabidopsis-based model

    m6anet inference --input_dir path/to/output --out_dir path/to/output  --pretrained_model arabidopsis_RNA002 --n_processes 4 --num_iterations 1000

    ## For the HEK293T-RNA004-based model

    m6anet inference --input_dir path/to/output --out_dir path/to/output  --pretrained_model HEK293T_RNA004 --n_processes 4 --num_iterations 1000

m6Anet will sample 20 reads from each candidate site and average the probability of modification across several round of sampling according to the --num_iterations parameter. The output file data.indiv_proba.csv contains the probability of modification for each read

transcript_id: The transcript id of the predicted position
transcript_position: The transcript position of the predicted position
read_index: The read identifier from nanopolish that corresponds to the actual read_id from nanopolish summary.txt
probability_modified: The probability that a given read is modified

The output file data.site_proba.csv contains the probability of modification at each individual position for each transcript. The output file will have 6 columns

transcript_id: The transcript id of the predicted position
transcript_position: The transcript position of the predicted position
n_reads: The number of reads for that particular position
probability_modified: The probability that a given site is modified
kmer: The 5-mer motif of a given site
mod_ratio: The estimated percentage of reads in a given site that is modified

The mod_ratio column is calculated by thresholding the probability_modified from data.indiv_proba.csv based on the --read_proba_threshold parameter during m6anet inference call, with a default value of 0.033379376 for the default human model HCT116_RNA002 and 0.0032978046219796 for arabidopsis_RNA002 model. We also recommend a threshold of 0.9 to select m6A sites from the probability_modified column in data.site_proba.csv. The total run time should not exceed 10 minutes on a normal laptop.

m6Anet also supports pooling over multiple replicates. To do this, simply input multiple folders containing m6anet-dataprep outputs::

        m6anet inference --input_dir data_folder_1 data_folder_2 ... --out_dir output_folder --n_processes 4 --num_iterations 1000

Release Notes

Release Note 2.1.0

m6anet model trained with RNA004 chemistry (development version)

The default m6Anet model was trained with the currently available RNA002 direct RNA-Seq kit. Oxford Nanopore is currently providing access to the development version of the next version, RNA004. To make m6A detection possible with RNA004, we now provide an m6Anet model trained on direct RNA Seq data from the HEK293T cell line using the development version of RNA004. In order to call m6A on data from the RNA004 kit, the following commands can be used:

<b>Pre-processing/segmentation/dataprep.</b> <br>
- Please use f5c with the RNA004 kmer model, as described here
- The kmer model can be downloaded here
Then execute eventalign with --kmer-model pointing to the path to the downloaded k-mer model as follows:
```
f5c eventalign --rna -b reads.bam -r reads.fastq -g transciptome.fa -o eventalign.tsv \
--kmer-model /path/to/rna004.nucleotide.5mer.model --slow5 reads.blow5 --signal-index \
--scale-events
```
The output can then be used with m6Anet dataprep (see https://m6anet.readthedocs.io/en/latest/quickstart.html)
<b>Inference</b> <br> In order to identify m6A from RNA004 data, the RNA004 model has to be specified:
```
    m6anet inference --input_dir [INPUT_DIR] --out_dir [OUT_DIR] --pretrained_model HEK293T_RNA004
```
The RNA004 model is trained on the development version and only underwent limited evaluation on site-level prediction compared to the RNA002 model. The individual read probability accuracy for RNA004 has not been tested. Please report any feedback to us (https://github.com/GoekeLab/m6anet/discussions)

Training and evaluating the RNA004 m6anet

We trained m6anet using an RNA004 direct RNA-Seq run of the HEK293T cell line, with m6A positions defined by m6ACE-Seq. We then evaluated the RNA004-based m6anet performance on RNA004 data from the Hek293T and the Hct116 cell line. Here, we used the intersection of all sites identified both in the RNA002 and the RNA004 data to compare the RN004 model (tested on RNA004 data) and the RNA002 model (tested on RNA002 data), using m6ACE-Seq as ground truth (Figure 1-2). The results suggest a comparable performance between the RNA002-trained and the RNA004-trained m6anet.

Please note that the RNA004 will generate higher read numbers, which leads to a higher number of sites being tested.

Figure 1: ROC curve comparing the m6Anet model trained on RNA002 and evaluated on RNA002 data with the model trained on RNA004 and evaluated on RNA004. Only sites that were detected in both data sets are used in this comparison. Here, a MAPQ filter of 20 was applied.

Figure 2: ROC curve comparing the m6Anet model trained on RNA002 and evaluated on RNA002 data with the model trained on RNA004 and evaluated on RNA004. Only sites that were detected in both data sets are used in this comparison. Here, a MAPQ filter of 0 was applied to the RNA004 data, leading to a higher number of sites which are detected.

Acknowledgments

We thank Hasindu Gamaarachchi, Hiruna Samarakoon, James Ferguson, and Ira Deveson from the Garvan Institute of Medical Research in Sydney, Australia for enabling the eventalign of the RNA004 data with f5c. We thank Bing Shao Chia, Wei Leong Chew, Arnaud Perrin, Jay Shin, and Hwee Meng Low from the Genome Institute of Singapore for providing the RNA and generating the direct RNA-Seq data, and we thank Paola Florez De Sessions, Lin Yang, Adrien Leger, Lakmal Jayasinghe, Libby Snell, Etienne Raimondeau, and Oxford Nanopore

M6anet

Install / Use

README