ModkitOpt

ModkitOpt finds the best --mod-threshold and --filter-threshold parameters to use when running modkit pileup, and the best stoichiometry cutoff for filtering modkit's bedMethyl output, to maximise the precision and recall of nanopore direct RNA modification calls.

Why use ModkitOpt?

By default, modkit uses a heuristic and unvalidated algorithm for quantifying RNA modifications, which can mislead reporting of RNA modification landscapes. ModkitOpt identifies optimised settings (--mod-threshold, --filter-threshold and stoichiometry cutoff) that rescue modkit performance, substantially increasing the acccuracy of RNA modification landscapes. Further details about modkit's default heuristic and ModkitOpt can be found in our paper, cited below.

How ModkitOpt works

ModkitOpt takes as input a modBAM file containing dorado per-read modification calls, efficiently and systematically scans 36,000 combinations of modkit thresholds (--filter-threshold and --mod-threshold) and downstream stoichiometry cutoffs, and evaluates predicted sites against validated reference sites to quantify precision and recall. ModkitOpt identifies the optimal threshold combination, and corresponding stoichiometry cutoff, that maximises the F1 score ( $2 \cdot precision \cdot recall/(precision+recall)$ ).

Validated reference sites are supplied for mammalian N6-methyladenosine (m6A) and pseudouridine (pseU), which can be used for nanopore datasets that originate from a different biological sample, provided a subset of validated sites are shared. For other modification types, validated sites can be supplied by the user.

Citation

If you use this software, please cite:

Sneddon, Prodic & Eyras. (2025). Resolving systematic bias in nanopore-based RNA modification detection. bioRxiv preprint. DOI: 10.64898/2025.12.19.695383

Quick start

Step 1: Clone the repository

git clone https://github.com/comprna/modkitopt.git

Step 2: Install dependencies

Download modkit

modkit >= v0.6.0
- Download modkit_vXYZ.tar.gz from the modkit release page
- Extract the archive contents to your preferred location
```
tar -xvzf modkit_vXYZ.tar.gz
```

Install conda & nextflow

conda (Miniconda installation guide)
nextflow (installation guide)

Install conda environment

Note that this may take a few minutes. Also, internet is required to install packages, so if running this in an HPC environment make sure you run on a node with internet access.

cd /path/to/modkitopt

nextflow run main.nf --install

Step 3: Run ModkitOpt demo

The demo runs on a small example modBAM file and should complete in <5 mins.

cd /path/to/modkitopt

nextflow run main.nf --demo --modkit /path/to/modkit

Step 4: Run ModkitOpt on your own dataset

Command example

Briefly, the required input files are:

modBAM file output by dorado
FASTA and GTF/GFF3 annotation for modkit to use (the same FASTA as you provided to dorado, with corresponding GTF/GFF3 annotation).
TSV file containing ground truth sites (optional if your nanopore dataset is mammalian and your modification type is m6A or pseU)

See Command details for more information.

See below for how to specify your HPC environment details.

nextflow run main.nf                                           \
  --modbam          /path/to/modbam.bam                        \
  --mod_type        m6A                                        \
  --modkit          /path/to/modkit                            \
  --fasta           /path/to/ref.fa                            \
  --annotation      /path/to/annotation.gff3                   \
  -profile          pbspro                                     \
  --hpc_queue       normal                                     \
  --hpc_project     ab12                                       \
  --hpc_storage     gdata/ab12

Recommended: Run in an HPC environment

Nextflow handles spawning jobs to split up the workflow, to significantly speed up execution time. If you don't have access to an HPC cluster, then you need to use a computer with at least 30GB RAM and at least 8 CPUs.

Resources to request

Since Nextflow handles job creation and resource allocation for running the workflow, you only need to create one top-level job script containing the ModkitOpt command. The top-level job only requires 1 CPU and 4GB RAM.

Tested HPC environments

We have so far tested ModkitOpt in a pbspro environment (NCI's gadi).

Your specific HPC system may use different Nextflow directives, these can be updated in modkitopt/profiles/pbspro.config.

While we have written profiles for pbs and slurm, these have not been tested. We welcome contributions from the community to improve these profiles, which can be found in modkitopt/profiles/.

If you aren't familiar with your system's expected Nextflow directives, you can also run ModkitOpt using -profile local.

Specifying your HPC environment details

When running in an HPC environment, you need to specify:

1. Your HPC environment profile

This tells Nextflow what type of workload manager it is dealing with, allowing Nextflow to automatically handle creating and submitting jobs.

nextflow run main.nf                                           \
  ...
  -profile pbspro                                              \
  ...

2. Your HPC queue name

You must specify the queue that Nextflow can schedule jobs to. Since ModkitOpt only uses CPUs, and up to 30GB memory for some of the tasks, the standard queue should suffice.

nextflow run main.nf                                           \
  ...
  --hpc_queue       normal                                     \
  ...

3. Your HPC project code

You must specify the HPC project code that Nextflow can schedule jobs to.

nextflow run main.nf                                           \
  ...
  --hpc_project     ab12                                       \
  ...

4. Your HPC storage location

You must specify your HPC storage location(s) that ModkitOpt will use. This includes all storage locations for your input files, conda environment, and the modkit repo.

nextflow run main.nf                                           \
  ...                                    
  --hpc_storage     gdata/ab12+gdata/cd34+scratch/ab12         \
  ...

Nextflow not working? Try running with local profile

If you aren't familiar with your system's expected Nextflow directives, or Nextflow is having trouble creating jobs, you can also run ModkitOpt using -profile local.

Using the local profile means that Nextflow won't spawn jobs to run processes in parallel, so it may take a little longer to run but will produce the same results. Using this approach, your job needs at least 8 CPUs, at least 30GB of RAM and at least 5GB of job filesystem disk space.

cd /path/to/modkitopt

nextflow run main.nf                                          \
  --modbam           ./resources/example.bam                  \
  --mod_type         m6A                                      \
  --modkit           /path/to/modkit                          \
  --fasta            /path/to/gencode.v45.transcripts.fa      \
  --annotation       /path/to/gencode.v45.annotation.gff3     \
  -profile local

Resuming an interrupted run

If your run gets interrupted, Nextflow automatically supports checkpointing and resuming runs. Simply add -resume to the Nextflow command that didn't complete and run again!

For example:

nextflow run main.nf                                           \
  ...
  -resume

Command details

Usage:
The typical command structure for running the pipeline is as follows:
nextflow run main.nf --modbam sample.bam
                     --mod_type <m6A|pseU|m5C|inosine>
                     --modkit /path/to/modkit
                     --fasta /path/to/transcriptome.fa
                     --annotation /path/to/annotation.gff3

Mandatory arguments:
    --modbam             .bam file containing per-read modification calls
    --mod_type           Modification type (options: m6A, pseU, m5C, inosine)
    --modkit             Path to modkit executable
    --fasta              Path to reference transcriptome
    --annotation         Path to corresponding reference annotation (.gtf or .gff3)

Mandatory arguments if running on an HPC system (-profile is pbs, pbspro or slurm):
    --hpc_queue          Name of the queue that Nextflow can schedule jobs to (e.g., 'normal')
    --hpc_project        HPC project code that Nextflow can schedule jobs to (e.g., 'ab12')
    --hpc_storage        HPC storage location that outputs can be written to (e.g., 'gdata/ab12')
    --help               This usage statement

Optional arguments:
    --truth_sites        .tsv file containing known modification sites (genomic coordinates, expected columns 1 and 2: [chr, pos], mandatory if mod_type is m5C or inosine)
    --top_outdir         Path to preferred output directory (default: 'results')

Output details

The optimal --filter_threshold, --mod_threshold and stoichiometry cutoff to use are written to standard out:

The optimal modkit pileup parameters are:

 >>> filter_threshold:  0.5
 >>> mod_threshold:     0.99

 With the optimal stoichiometry cutoff to classify modified sites:

 >>> Threshold: 0.086

 Achieving an F1 score of 0.008

A bar plot showing performance across the parameter space is written to `r

Modkitopt

Install / Use

README