SkillAgentSearch skills...

DeepRM

Deep learning for RNA Modification

Install / Use

/learn @vadanamu/DeepRM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

DeepRM

Deep learning for RNA Modification

GitHub Repo CC BY-NC-SA 4.0 GitHub Repo stars GitHub last commit GitHub code size in bytes GitHub contributors GitHub language count DOI

deeprm.png

Table of Contents

✨ Introduction

DeepRM is a deep learning-based framework for RNA modification detection using Nanopore direct RNA sequencing. This repository contains the source code for training and running DeepRM.

🎯 Key Features

  • High accuracy: Achieves state-of-the-art accuracy in RNA modification detection and stoichiometry measurement.
  • Single-molecule resolution: Provides single-molecule level predictions for RNA modifications.
  • End-to-end pipeline: Easy-to-use pipeline from raw reads to site-level predictions.
  • Customizable: Supports training of custom models.

📦 Installation

Prerequisites

  • Linux x86_64
  • Python 3.9+
  • Pytorch 2.3+
    • https://pytorch.org/get-started/locally/
    • Please ensure that you have installed the correct version of PyTorch with CUDA support if you want to use GPU for inference or training.

Optional

  • Torchmetrics 0.9.0+ (only for training)

    • python -m pip install torchmetrics
      
  • Dorado 0.7.3+ (optional, for basecalling)

    • https://github.com/nanoporetech/dorado
  • SAMtools 1.16.1+ (optional, for BAM file processing)

    • http://www.htslib.org/
  • Python package requirements are listed in requirements.txt and will be installed automatically when you install DeepRM.

Installation options

  • Estimated time: ~10 minutes
  1. Install via PIP (recommended)
python -m pip install deeprm
  1. Install from source (GitHub)
git clone https://github.com/vadanamu/deeprm
cd deeprm
python -m pip install -U pip
python -m pip install -e .
  • If installation fails on old OS (e.g., CentOS 7) due to NumPy, you can try installing older versions of NumPy first:
  •  python -m pip install "numpy<2.3.0,>2.0.0"
     python -m pip install -e .
    

Verify Installation

deeprm --version
deeprm check
  • If everything is installed correctly, you should see the version of DeepRM and a message indicating that the installation is successful.
  • If you encounter CUDA or torch-related errors, make sure you have installed the correct version of PyTorch with CUDA support.

Build from Source

  • DeepRM can use a C++-based preprocessing tool for acceleration, which is both provided as a precompiled binary and source code.
  • Depending on your system configuration, you may need to build the C++ preprocessing tool from source, located in the cpp directory of the DeepRM repository.
  • Please refer to the cpp/README.md page for detailed build instructions.

🚀 Quickstart

  • For demonstration purposes, you can use examples POD5 and BAM files provided in the examples directory of the repository.
  • You can also use your own POD5 and BAM files.

RNA Modification Detection

  • Estimated time: ~1 hours

1️⃣ Prepare data

deeprm call prep -p inference_example.pod5 -b inference_example.bam -o <prep_dir>
  • (Alternative) To supply your own POD5 file:
    dorado basecaller --reference <ref_fasta> --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 <pod5_dir> \
    | tee >(samtools sort -@ <threads> -O BAM -o <bam_path> - && samtools index -@ <threads> <bam_path>) \
    | deeprm call prep -p <pod5_dir> -b - -o <prep_dir>
    
    • If Dorado fails due to "illegal memory access", try adding --chunksize <chunk_size> option (e.g., chunk_size=12000).

2️⃣ Run inference

deeprm call run -b inference_example.bam -i <prep_dir> -o <pred_dir> -s 1000
  • Adjust the -s (batch size) parameter according to your GPU memory capacity (default: 10000).
  • Expected output file:
    • Site-level detection result file (.bed)
    • Molecule-level detection result file (.npz)

Model Training

  • Estimated time: ~1 hours

1️⃣ Prepare unmodified & modified training data

deeprm train prep -p training_a_example.pod5 -b training_a_example.bam -o <prep_dir>/a
deeprm train prep -p training_m6a_example.pod5 -b training_m6a_example.bam -o <prep_dir>/m6a

2️⃣ Compile training data

deeprm train compile -n <prep_dir>/a/data -p <prep_dir>/m6a/data -o <prep_dir>/compiled

3️⃣ Run training

deeprm train run -d <prep_dir>/compiled -o <output_dir> --batch 64
  • Adjust the --batch parameter according to your GPU memory capacity (default: 1024).
  • Expected output file:
    • Trained DeepRM model file (.pt)

💻 Usage

Inference usage

deeprm_inference_pipeline.png

Prepare Data

Accelerated preparation (recommended, default)
  • This method uses precompiled C++ binary for accelerating the preprocessing step.
    dorado basecaller --reference <ref_fasta> --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 <pod5_dir> \
    | tee >(samtools sort -@ <threads> -O BAM -o <bam_path> - && samtools index -@ <threads> <bam_path>) \
    | deeprm call prep -p <pod5_dir> -b - -o <prep_dir>
    
  • If Dorado fails due to "illegal memory access", try adding --chunksize <chunk_size> option (e.g., chunk_size=12000).
  • If the precompiled binary does not work on your system, please refer to the cpp/README.md page for detailed build instructions.
  • Adjust the -g (--filter-flag) parameter according to your needs. If using a genomic reference, you may want to use -g 260.
Sequential preparation
  • This method is slower than the accelerated preparation method, but is supported for cases such as:

    • The POD5 files are already basecalled to BAM files with move tags.
    • You want to run basecalling and preprocessing in separate machines.
  • Basecall the POD5 files to BAM files with move tags (skip if already done):

    • If Dorado fails due to "illegal memory access", try adding --chunksize <chunk_size> option (e.g., chunk_size=12000).
dorado basecaller --reference <reference_path> --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 <pod5_dir> > <raw_bam_path>"
  • Filter, sort, and index the BAM files:
    • Adjust the -F parameter according to your needs. If using a genomic reference, you may want to use -F 260.
samtools view -@ <threads> -bh -F 276 -o <bam_path> <raw_bam_path>
samtools sort -@ <threads> -o <bam_path> <bam_path>
samtools index -@ <threads> <bam_path>
  • To preprocess the inference data (transcriptome), run the following command:
deeprm call prep -p <input_POD5_dir> -b <bam_path> -o <prep_dir>
  • This will create the npz files for inference.

Run Inference

  • The trained DeepRM model file is attached in the repository: weight/deeprm_weights.pt.
  • For inference, run the following command:
    • Adjust the -s (batch size) parameter according to your GPU memory capacity (default: 10000).
deeprm call run --model <model_file> --data <data_dir> --output <prediction_dir> --gpu-pool <gpu_pool>
  • This will create a directory with the site-level and molecule-level result files.
  • Optionally, if you used a transcriptomic reference for alignment, you can convert the result to genomic coordinates by supplying a RefFlat/GenePred/RefGene file (--annot <annotation_file>).

Site-level BED file format

  • The output BED file follows the standard bedMethyl format. Please see https://genome.ucsc.edu/goldenpath/help/bedMethyl.html for description.
  • Please note that columns 14 to 18 are zero-filled for compatibility. These columns will be used for a planned future update.

Molecule-level BAM file format

  • The output BAM file contains modification information in MM and ML tags. Please see https://samtools.github.io/hts-specs/SAMtags.pdf for description.

Molecule-level NPZ file format (advanced usage)

  • The output NPZ file contains the following arrays:
    1. read_id
    2. label_id
    3. pred: modification score (between 0 and 1)
  • Read ID specification:
    • The UUID4 format read ID (128 bits) is converted to two 64-bit integers for NumPy compatibility.
    • You can convert the two 64-bit integers back to UUID4 using the following Python code:
      import numpy as np
      import uuid
      def int_to_uuid(high, low):
          return uuid.UUID(bytes=b"".join([high.tobytes(),low.tobytes()]))
      
  • Label ID specification:
    • Label ID contains the reference, position, and strand information.
    • You can decode the label ID using the following Python code:
    import numpy as np
    def decode_label_id(label_id, label_div = 10**9):
        strand = np.sign(label_id)
        label_id_abs = np.abs(label_id) - 1
        ref_id = label_id_abs // label_div
        pos = label_id_abs % label_div
        return ref_id, pos, strand
    
    • Reference ID is extracted from the input BAM f
View on GitHub
GitHub Stars6
CategoryEducation
Updated4d ago
Forks0

Languages

C

Security Score

75/100

Audited on Mar 25, 2026

No findings