MultiNano

MultiNano is a deep learning framework designed for predicting m6A RNA modifications using raw electrical signals from Oxford Nanopore sequencing. It provides high accuracy across species and conditions, offering a user-friendly pipeline for researchers. MultiNano supports both training from scratch and direct prediction modes.

Generate Convert Improve

Install / Use

/learn @zhangjun640/MultiNano

About this skill

Quality Score

0/100

README

MultiNano: A Deep Learning Framework for m6A Prediction

MultiNano is a deep learning framework designed for predicting m6A RNA modifications using raw electrical signals from Oxford Nanopore sequencing. It provides high accuracy across species and conditions, offering a user-friendly pipeline for researchers. MultiNano supports both training from scratch and direct prediction modes, detailed as follows.

You can read our paper directly at here for more details.

1. Environment Setup

First, create the Conda environment using the provided MultiNano.yml file. This will install all necessary dependencies.

conda env create -f demo/MultiNano.yml

Tip: After creation, activate the new environment with conda activate MultiNano (the environment name is defined within the .yml file).

Data Pre-processing Pipeline

This section details the steps required to convert raw Nanopore data into a feature format suitable for model training or prediction.

2. Convert multi-fast5 to single-fast5

The first step is to convert the default multi-fast5 files from the sequencer into single-fast5 format, as required by Tombo.

multi_to_single_fast5   -i demo/input_fast5   -s demo/output_fast5_single   --recursive   -t 30

3. Base-calling with Guppy

Next, perform base-calling on the single-fast5 files. This step generates base calls and writes them back into the fast5 files, which is essential for the subsequent re-squiggling step.

guppy_basecaller   -i /input/single_fast5   -s /output/basecall   --num_callers 30   --recursive   --fast5_out   --config rna_r9.4.1_70bps_hac.cfg

4. Re-squiggle with Tombo

Tombo's resquiggle command aligns the raw electronic signal events to a reference genome or transcriptome. This step is critical for accurately mapping signals to specific genomic positions.

tombo resquiggle   /path/to/workspace   /path/to/reference.fa   --rna   --overwrite   --processes 50   --corrected-group RawGenomeCorrected_001   --basecall-group Basecall_1D_001

| Option | Purpose | | :------------------------ | :----------------------------------------------------------------------------------------------------------------------------------- | | Positional 1 | Workspace directory containing the base-called single-fast5 files from the previous step. | | Positional 2 | The reference transcriptome in FASTA format. | | --rna | Informs Tombo to use RNA-specific models and expectations for signal alignment. | | --overwrite | If the command was run before, this will overwrite the previous Tombo output within the fast5 files. | | --processes <INT> | Number of worker processes for parallel execution. | | --corrected-group <STR> | Important. The name of the group within the fast5 file where Tombo will store the re-squiggled alignment data. This name is needed for feature extraction. | | --basecall-group <STR> | The name of the group within the fast5 file where Guppy stored its base calls. This must match the output from Guppy. Basecall_1D_001 is a common default. |

5. Map Reads and Generate Error Profiles

This series of commands aligns the base-called reads to the reference, filters them, and generates a detailed TSV file that describes matches, mismatches, and indels for each read.

# 1. Concatenate all FASTQ files into one
cat /path/to/fast5_guppy/*.fastq > test.fastq

# 2. Align reads, convert to BAM, and sort
minimap2 -t 30 -ax map-ont ref.transcript.fa test.fastq |   samtools view -hSb |   samtools sort -@ 30 -o test.bam

# 3. Index the BAM file
samtools index test.bam

# 4. Generate a detailed error profile in TSV format
samtools view -h -F 3844 test.bam |   java -jar sam2tsv.jar -r ref.transcript.fa > test.tsv

minimap2: A fast aligner optimized for noisy long-read data. -ax map-ont is a preset for mapping Oxford Nanopore reads.
samtools view -F 3844: This filter removes unmapped, secondary, supplementary, and low-quality alignments, ensuring only primary alignments are used.
sam2tsv.jar: A tool that converts SAM/BAM format to a detailed TSV, which is used in the next step to guide feature extraction.

6. Split Error Profiles for Parallel Processing

This awk command splits the master test.tsv file into smaller files, one for each read. This allows for massive parallelization in the feature extraction step.

mkdir tmp
awk 'NR==1{ h=$0 } NR>1 {
  print (!a[$2]++ ? h ORS $0 : $0) > "tmp/"$1".txt"
}' test.tsv

How it works: The script reads test.tsv line by line. It saves the header (h=$0). For each data line, it writes the header and the current line to a new file named after the read ID (tmp/$1.txt). The !a[$2]++ logic ensures the header is only written once per file.

7. Feature Extraction

This is the core script that extracts signal features, k-mers, and quality information for each potential m6A site (defined by the DRACH motif).

python scripts/extract.py   -i /path/to/workspace   -o test/features/   --errors_dir test/tmp/   --corrected_group RawGenomeCorrected_001   --w_is_dir yes   -k 5   -s 65   -n 30

| Option | Purpose | | :----------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- | | -i <DIR> | Path to the directory containing the re-squiggled fast5 files. | | -o <PATH> | Output path. Since --w_is_dir is set, this will be a directory where feature files are saved. | | --errors_dir <DIR> | Path to the directory containing the per-read error profiles (.txt files) created in the previous step. | | --corrected_group <STR>| Must match the group name used in the tombo resquiggle command (RawGenomeCorrected_001). This tells the script where to find the signal data. | | --w_is_dir <yes/no> | If yes, treats the output path as a directory and saves features in batches to multiple files, which is efficient for large datasets. | | -k <INT> | K-mer length. The length of the nucleotide sequence to extract around a site (e.g., 5 for NN[DRACH]NN). Must be an odd number. | | -s <INT> | Signal length. The number of raw signal values to extract for each base. The script will pad or sample the signals to meet this fixed length. | | -n <INT> | Number of parallel processes to use for feature extraction. |

8. Aggre

Related Skills

pestel-analysis

Analyze political, economic, social, technological, environmental, and legal forces

ai-cmo

Collection of my Agent Skills and books.

orbit-planning

O.R.B.I.T. - strategic project planning before you build. Objective, Requirements, Blueprint, Implementation Roadmap, Track.

A beautifully designed, floating Pomodoro timer that respects your workspace.

zhangjun640

View profile

View on GitHub

GitHub Stars21

CategoryProduct

Updated1mo ago

Forks1

zhangjun640/MultiNano

Languages

Python

Security Score

95/100

Audited on Feb 18, 2026

No findings