DeepRM
Deep learning for RNA Modification
Install / Use
/learn @vadanamu/DeepRMREADME
DeepRM
Deep learning for RNA Modification

Table of Contents
- ✨ Introduction
- 🎯 Key Features
- 📦 Installation
- 🚀 Quickstart
- 💻 Usage
- 🔧 Troubleshooting
- 📐 Architecture
- 📝 Citation
- 📝 License
- 🏛️ Contributors
- 🏛️ Acknowledgements
✨ Introduction
DeepRM is a deep learning-based framework for RNA modification detection using Nanopore direct RNA sequencing. This repository contains the source code for training and running DeepRM.
🎯 Key Features
- High accuracy: Achieves state-of-the-art accuracy in RNA modification detection and stoichiometry measurement.
- Single-molecule resolution: Provides single-molecule level predictions for RNA modifications.
- End-to-end pipeline: Easy-to-use pipeline from raw reads to site-level predictions.
- Customizable: Supports training of custom models.
📦 Installation
Prerequisites
- Linux x86_64
- Python 3.9+
- Pytorch 2.3+
- https://pytorch.org/get-started/locally/
- Please ensure that you have installed the correct version of PyTorch with CUDA support if you want to use GPU for inference or training.
Optional
-
Torchmetrics 0.9.0+ (only for training)
-
python -m pip install torchmetrics
-
-
Dorado 0.7.3+ (optional, for basecalling)
- https://github.com/nanoporetech/dorado
-
SAMtools 1.16.1+ (optional, for BAM file processing)
- http://www.htslib.org/
-
Python package requirements are listed in
requirements.txtand will be installed automatically when you install DeepRM.
Installation options
- Estimated time: ~10 minutes
- Install via PIP (recommended)
python -m pip install deeprm
- Install from source (GitHub)
git clone https://github.com/vadanamu/deeprm
cd deeprm
python -m pip install -U pip
python -m pip install -e .
- If installation fails on old OS (e.g., CentOS 7) due to NumPy, you can try installing older versions of NumPy first:
-
python -m pip install "numpy<2.3.0,>2.0.0" python -m pip install -e .
Verify Installation
deeprm --version
deeprm check
- If everything is installed correctly, you should see the version of DeepRM and a message indicating that the installation is successful.
- If you encounter CUDA or torch-related errors, make sure you have installed the correct version of PyTorch with CUDA support.
Build from Source
- DeepRM can use a C++-based preprocessing tool for acceleration, which is both provided as a precompiled binary and source code.
- Depending on your system configuration, you may need to build the C++ preprocessing tool from source, located in the
cppdirectory of the DeepRM repository. - Please refer to the cpp/README.md page for detailed build instructions.
🚀 Quickstart
- For demonstration purposes, you can use examples POD5 and BAM files provided in the
examplesdirectory of the repository. - You can also use your own POD5 and BAM files.
RNA Modification Detection
- Estimated time: ~1 hours
1️⃣ Prepare data
deeprm call prep -p inference_example.pod5 -b inference_example.bam -o <prep_dir>
- (Alternative) To supply your own POD5 file:
dorado basecaller --reference <ref_fasta> --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 <pod5_dir> \ | tee >(samtools sort -@ <threads> -O BAM -o <bam_path> - && samtools index -@ <threads> <bam_path>) \ | deeprm call prep -p <pod5_dir> -b - -o <prep_dir>- If Dorado fails due to "illegal memory access", try adding
--chunksize <chunk_size>option (e.g., chunk_size=12000).
- If Dorado fails due to "illegal memory access", try adding
2️⃣ Run inference
deeprm call run -b inference_example.bam -i <prep_dir> -o <pred_dir> -s 1000
- Adjust the
-s(batch size) parameter according to your GPU memory capacity (default: 10000). - Expected output file:
- Site-level detection result file (.bed)
- Molecule-level detection result file (.npz)
Model Training
- Estimated time: ~1 hours
1️⃣ Prepare unmodified & modified training data
deeprm train prep -p training_a_example.pod5 -b training_a_example.bam -o <prep_dir>/a
deeprm train prep -p training_m6a_example.pod5 -b training_m6a_example.bam -o <prep_dir>/m6a
2️⃣ Compile training data
deeprm train compile -n <prep_dir>/a/data -p <prep_dir>/m6a/data -o <prep_dir>/compiled
3️⃣ Run training
deeprm train run -d <prep_dir>/compiled -o <output_dir> --batch 64
- Adjust the
--batchparameter according to your GPU memory capacity (default: 1024). - Expected output file:
- Trained DeepRM model file (.pt)
💻 Usage
Inference usage

Prepare Data
Accelerated preparation (recommended, default)
- This method uses precompiled C++ binary for accelerating the preprocessing step.
dorado basecaller --reference <ref_fasta> --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 <pod5_dir> \ | tee >(samtools sort -@ <threads> -O BAM -o <bam_path> - && samtools index -@ <threads> <bam_path>) \ | deeprm call prep -p <pod5_dir> -b - -o <prep_dir> - If Dorado fails due to "illegal memory access", try adding
--chunksize <chunk_size>option (e.g., chunk_size=12000). - If the precompiled binary does not work on your system, please refer to the cpp/README.md page for detailed build instructions.
- Adjust the
-g (--filter-flag)parameter according to your needs. If using a genomic reference, you may want to use-g 260.
Sequential preparation
-
This method is slower than the accelerated preparation method, but is supported for cases such as:
- The POD5 files are already basecalled to BAM files with move tags.
- You want to run basecalling and preprocessing in separate machines.
-
Basecall the POD5 files to BAM files with move tags (skip if already done):
- If Dorado fails due to "illegal memory access", try adding
--chunksize <chunk_size>option (e.g., chunk_size=12000).
- If Dorado fails due to "illegal memory access", try adding
dorado basecaller --reference <reference_path> --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 <pod5_dir> > <raw_bam_path>"
- Filter, sort, and index the BAM files:
- Adjust the
-Fparameter according to your needs. If using a genomic reference, you may want to use-F 260.
- Adjust the
samtools view -@ <threads> -bh -F 276 -o <bam_path> <raw_bam_path>
samtools sort -@ <threads> -o <bam_path> <bam_path>
samtools index -@ <threads> <bam_path>
- To preprocess the inference data (transcriptome), run the following command:
deeprm call prep -p <input_POD5_dir> -b <bam_path> -o <prep_dir>
- This will create the npz files for inference.
Run Inference
- The trained DeepRM model file is attached in the repository:
weight/deeprm_weights.pt. - For inference, run the following command:
- Adjust the
-s(batch size) parameter according to your GPU memory capacity (default: 10000).
- Adjust the
deeprm call run --model <model_file> --data <data_dir> --output <prediction_dir> --gpu-pool <gpu_pool>
- This will create a directory with the site-level and molecule-level result files.
- Optionally, if you used a transcriptomic reference for alignment, you can convert the result to genomic coordinates by supplying a RefFlat/GenePred/RefGene file (
--annot <annotation_file>).
Site-level BED file format
- The output BED file follows the standard bedMethyl format. Please see https://genome.ucsc.edu/goldenpath/help/bedMethyl.html for description.
- Please note that columns 14 to 18 are zero-filled for compatibility. These columns will be used for a planned future update.
Molecule-level BAM file format
- The output BAM file contains modification information in MM and ML tags. Please see https://samtools.github.io/hts-specs/SAMtags.pdf for description.
Molecule-level NPZ file format (advanced usage)
- The output NPZ file contains the following arrays:
1. read_id
2. label_id
3. pred: modification score (between 0 and 1)
- Read ID specification:
- The UUID4 format read ID (128 bits) is converted to two 64-bit integers for NumPy compatibility.
- You can convert the two 64-bit integers back to UUID4 using the following Python code:
import numpy as np import uuid def int_to_uuid(high, low): return uuid.UUID(bytes=b"".join([high.tobytes(),low.tobytes()]))
- Label ID specification:
- Label ID contains the reference, position, and strand information.
- You can decode the label ID using the following Python code:
import numpy as np def decode_label_id(label_id, label_div = 10**9): strand = np.sign(label_id) label_id_abs = np.abs(label_id) - 1 ref_id = label_id_abs // label_div pos = label_id_abs % label_div return ref_id, pos, strand- Reference ID is extracted from the input BAM f
