Modkitopt
Optimise modkit parameters and stoichiometry cutoff for accurate nanopore-based RNA modification detection
Install / Use
/learn @comprna/ModkitoptREADME
ModkitOpt
ModkitOpt finds the best --mod-threshold and --filter-threshold parameters to use when running modkit pileup, and the best stoichiometry cutoff for filtering modkit's bedMethyl output, to maximise the precision and recall of nanopore direct RNA modification calls.
Why use ModkitOpt?
By default, modkit uses a heuristic and unvalidated algorithm for quantifying RNA modifications, which can mislead reporting of RNA modification landscapes. ModkitOpt identifies optimised settings (--mod-threshold, --filter-threshold and stoichiometry cutoff) that rescue modkit performance, substantially increasing the acccuracy of RNA modification landscapes. Further details about modkit's default heuristic and ModkitOpt can be found in our paper, cited below.
How ModkitOpt works
ModkitOpt takes as input a modBAM file containing dorado per-read modification calls, efficiently and systematically scans 36,000 combinations of modkit thresholds (--filter-threshold and --mod-threshold) and downstream stoichiometry cutoffs, and evaluates predicted sites against validated reference sites to quantify precision and recall. ModkitOpt identifies the optimal threshold combination, and corresponding stoichiometry cutoff, that maximises the F1 score ( $2 \cdot precision \cdot recall/(precision+recall)$ ).
Validated reference sites are supplied for mammalian N6-methyladenosine (m6A) and pseudouridine (pseU), which can be used for nanopore datasets that originate from a different biological sample, provided a subset of validated sites are shared. For other modification types, validated sites can be supplied by the user.
<p align="center"> <img src="images/design.png" width="1000"> </p>Citation
If you use this software, please cite:
Sneddon, Prodic & Eyras. (2025). Resolving systematic bias in nanopore-based RNA modification detection. bioRxiv preprint. DOI: 10.64898/2025.12.19.695383
Quick start
Step 1: Clone the repository
git clone https://github.com/comprna/modkitopt.git
Step 2: Install dependencies
Download modkit
-
modkit >= v0.6.0
- Download modkit_vXYZ.tar.gz from the modkit release page
- Extract the archive contents to your preferred location
tar -xvzf modkit_vXYZ.tar.gz
Install conda & nextflow
- conda (Miniconda installation guide)
- nextflow (installation guide)
Install conda environment
Note that this may take a few minutes. Also, internet is required to install packages, so if running this in an HPC environment make sure you run on a node with internet access.
cd /path/to/modkitopt
nextflow run main.nf --install
Step 3: Run ModkitOpt demo
The demo runs on a small example modBAM file and should complete in <5 mins.
cd /path/to/modkitopt
nextflow run main.nf --demo --modkit /path/to/modkit
Step 4: Run ModkitOpt on your own dataset
Command example
Briefly, the required input files are:
- modBAM file output by dorado
- FASTA and GTF/GFF3 annotation for modkit to use (the same FASTA as you provided to dorado, with corresponding GTF/GFF3 annotation).
- TSV file containing ground truth sites (optional if your nanopore dataset is mammalian and your modification type is m6A or pseU)
See Command details for more information.
See below for how to specify your HPC environment details.
nextflow run main.nf \
--modbam /path/to/modbam.bam \
--mod_type m6A \
--modkit /path/to/modkit \
--fasta /path/to/ref.fa \
--annotation /path/to/annotation.gff3 \
-profile pbspro \
--hpc_queue normal \
--hpc_project ab12 \
--hpc_storage gdata/ab12
Recommended: Run in an HPC environment
Nextflow handles spawning jobs to split up the workflow, to significantly speed up execution time. If you don't have access to an HPC cluster, then you need to use a computer with at least 30GB RAM and at least 8 CPUs.
Resources to request
Since Nextflow handles job creation and resource allocation for running the workflow, you only need to create one top-level job script containing the ModkitOpt command. The top-level job only requires 1 CPU and 4GB RAM.
Tested HPC environments
We have so far tested ModkitOpt in a pbspro environment (NCI's gadi).
Your specific HPC system may use different Nextflow directives, these can be updated in modkitopt/profiles/pbspro.config.
While we have written profiles for pbs and slurm, these have not been tested. We welcome contributions from the community to improve these profiles, which can be found in modkitopt/profiles/.
If you aren't familiar with your system's expected Nextflow directives, you can also run ModkitOpt using -profile local.
Specifying your HPC environment details
When running in an HPC environment, you need to specify:
1. Your HPC environment profile
This tells Nextflow what type of workload manager it is dealing with, allowing Nextflow to automatically handle creating and submitting jobs.
nextflow run main.nf \
...
-profile pbspro \
...
2. Your HPC queue name
You must specify the queue that Nextflow can schedule jobs to. Since ModkitOpt only uses CPUs, and up to 30GB memory for some of the tasks, the standard queue should suffice.
nextflow run main.nf \
...
--hpc_queue normal \
...
3. Your HPC project code
You must specify the HPC project code that Nextflow can schedule jobs to.
nextflow run main.nf \
...
--hpc_project ab12 \
...
4. Your HPC storage location
You must specify your HPC storage location(s) that ModkitOpt will use. This includes all storage locations for your input files, conda environment, and the modkit repo.
nextflow run main.nf \
...
--hpc_storage gdata/ab12+gdata/cd34+scratch/ab12 \
...
Nextflow not working? Try running with local profile
If you aren't familiar with your system's expected Nextflow directives, or Nextflow is having trouble creating jobs, you can also run ModkitOpt using -profile local.
Using the local profile means that Nextflow won't spawn jobs to run processes in parallel, so it may take a little longer to run but will produce the same results. Using this approach, your job needs at least 8 CPUs, at least 30GB of RAM and at least 5GB of job filesystem disk space.
cd /path/to/modkitopt
nextflow run main.nf \
--modbam ./resources/example.bam \
--mod_type m6A \
--modkit /path/to/modkit \
--fasta /path/to/gencode.v45.transcripts.fa \
--annotation /path/to/gencode.v45.annotation.gff3 \
-profile local
Resuming an interrupted run
If your run gets interrupted, Nextflow automatically supports checkpointing and resuming runs. Simply add -resume to the Nextflow command that didn't complete and run again!
For example:
nextflow run main.nf \
...
-resume
Command details
Usage:
The typical command structure for running the pipeline is as follows:
nextflow run main.nf --modbam sample.bam
--mod_type <m6A|pseU|m5C|inosine>
--modkit /path/to/modkit
--fasta /path/to/transcriptome.fa
--annotation /path/to/annotation.gff3
Mandatory arguments:
--modbam .bam file containing per-read modification calls
--mod_type Modification type (options: m6A, pseU, m5C, inosine)
--modkit Path to modkit executable
--fasta Path to reference transcriptome
--annotation Path to corresponding reference annotation (.gtf or .gff3)
Mandatory arguments if running on an HPC system (-profile is pbs, pbspro or slurm):
--hpc_queue Name of the queue that Nextflow can schedule jobs to (e.g., 'normal')
--hpc_project HPC project code that Nextflow can schedule jobs to (e.g., 'ab12')
--hpc_storage HPC storage location that outputs can be written to (e.g., 'gdata/ab12')
--help This usage statement
Optional arguments:
--truth_sites .tsv file containing known modification sites (genomic coordinates, expected columns 1 and 2: [chr, pos], mandatory if mod_type is m5C or inosine)
--top_outdir Path to preferred output directory (default: 'results')
Output details
- The optimal --filter_threshold, --mod_threshold and stoichiometry cutoff to use are written to standard out:
The optimal modkit pileup parameters are:
>>> filter_threshold: 0.5
>>> mod_threshold: 0.99
With the optimal stoichiometry cutoff to classify modified sites:
>>> Threshold: 0.086
Achieving an F1 score of 0.008
- A bar plot showing performance across the parameter space is written to `r
