RawBench
A comprehensive benchmarking framework for raw nanopore signal analysis, as described by Eris et al. (https://arxiv.org/pdf/2510.03629)
Install / Use
/learn @CMU-SAFARI/RawBenchREADME
RawBench
Benchmarking framework for raw signal analysis (RSA) of nanopore sequencing data. RSA methods skip basecalling and work directly on the electrical signal, which is faster but involves different tradeoffs in accuracy and resource usage.
RawBench decomposes RSA into three stages and benchmarks different methods at each:
| Stage | Methods compared | |---|---| | Reference encoding | ONT pore model, uncalled4 pore model | | Signal segmentation | t-test event detection | | Representation matching | hash-based (RawHash2), FM-index (uncalled), r-index (Sigmoni), DTW, vector distances |
The baseline is the traditional basecall-then-map approach (Dorado + minimap2).
Associated paper: Eris et al., 2025
Datasets
Three organisms at different genome sizes, plus a mock community for classification:
| ID | Organism | Reference | Chemistry | |---|---|---|---| | d8 | H. sapiens | CHM13v2 | R10.4.1 | | d9 | E. coli | CFT073 | R10.4.1 | | d10 | D. melanogaster | BDGP6.32 | R10.4.1 | | zymo | Zymo mock community | combined refs | R9.4.1 |
Signal data: https://huggingface.co/collections/nappenstance/rawbench-datasets
Methods benchmarked
Read mapping (signal → genomic coordinates)
Each method indexes a reference genome, then maps raw signal reads against it:
| Script pattern | Method | Stages |
|---|---|---|
| d*_mm2.sh | minimap2 | basecall → map (baseline) |
| d*_uncalled4_hash_ttest.sh | RawHash2 hash | t-test segmentation → hash matching |
| d*_uncalled4_fmindex_ttest.sh | uncalled FM-index | t-test segmentation → FM-index matching |
| d*_uncalled4_dtw.sh | uncalled4 DTW | signal storage → DTW alignment |
| d*_uncalled4_vectordistances_ttest.sh | uncalled4 vector | t-test segmentation → vector distance matching |
| d*_ont_hash_ttest.sh | ONT hash | t-test segmentation → hash matching (ONT model) |
Chunk-limited variants (d*_Nchunksmax_mm2.sh) test how quickly each method reaches a mapping decision using only the first N signal chunks.
Read classification (signal → species label)
Binary classification on the Zymo mock community (positive vs negative reference set):
| Script | Method |
|---|---|
| zymo_ont_rindex_ttest.sh | Sigmoni with ONT pore model + r-index |
| zymo_uncalled4_rindex_ttest.sh | Sigmoni with uncalled4 pore model + r-index |
| zymo_uncalled4_fmindex_ttest.sh | uncalled FM-index |
| zymo_uncalled4_hashbased_ttest.sh | RawHash2 hash-based |
| zymo_uncalled4_dtw_ttest.sh | uncalled4 DTW |
| zymo_uncalled4_vectordistances_ttest.sh | uncalled4 vector distances |
Evaluation
All methods produce PAF files. Evaluate with uncalled pafstats:
uncalled pafstats -r ground_truth.paf --annotate tool_output.paf \
> annotated.paf 2> metrics.throughput
Setup
1. Clone with submodules
git clone --recursive https://github.com/CMU-SAFARI/RawBench.git
cd RawBench
If you already cloned without --recursive:
git submodule update --init --recursive
2. Install tools
Conda environments:
# Sigmoni (r-index classification) -- needs ont-fast5-api to read fast5
conda create --name sigmoni python=3.8 -y
conda activate sigmoni
conda install h5py numpy scipy ont-fast5-api -y
pip install uncalled4
# minimap2 (basecall-then-map baseline)
conda create --name mm2 -y && conda activate mm2 && conda install minimap2 -y
# BAM conversion (basecalling benchmarks only)
conda create --name bamtofastq -y && conda activate bamtofastq && conda install -c bioconda bamtofastq -y
Build SPUMONI (r-index backend for Sigmoni):
cd spumoni_submodule
mkdir -p build && cd build
cmake ..
make -j$(nproc)
make install # required -- copies helper programs to build/bin/
cd ../..
External tools (not included, install separately):
- Dorado -- ONT basecaller. Set
DORADO_PATH. - RawHash2 -- hash-based signal mapper. Install to
../bin/rawhash2or setRAWHASH2_PATH.
3. Download data
# Reference genomes
cd refs && bash download_refs.sh && cd ..
# Signal data (fast5)
cd fast5 && bash download.sh && cd ..
Or download individual organisms: bash download_refs.sh ecoli, bash download.sh ecoli.
4. Load environment
source scripts/setup_env.sh
validate_environment
This sets SPUMONI_BUILD_DIR, adds SPUMONI and its helpers to PATH, and checks that tools exist. Override paths before sourcing if needed:
export DORADO_PATH="/your/path/to/dorado"
export RAWHASH2_PATH="/your/path/to/rawhash2"
source scripts/setup_env.sh
5. Smoke test
source scripts/setup_env.sh
conda activate sigmoni
mkdir -p /tmp/rawbench_test/output
# Build Sigmoni index (ecoli positive, dmelanogaster negative)
python sigmoni_submodule/index.py \
-p refs/ecoli.fa -n refs/dmelanogaster.fa \
-b 6 --shred 100000 \
-o /tmp/rawbench_test --ref-prefix ecoli_test
# Classify ecoli fast5 reads
python sigmoni_submodule/main.py \
-i fast5/ecoli/ \
-r /tmp/rawbench_test/refs/ecoli_test \
-b 6 -t $(nproc) \
-o /tmp/rawbench_test/output \
--complexity --sp
# Should print read_id / class columns
head /tmp/rawbench_test/output/reads_binary.report
Running benchmarks
Edit #SBATCH headers in the job scripts for your cluster (partition names, node exclusions are site-specific).
source scripts/setup_env.sh
# Basecalling (Dorado)
sbatch job_scripts/basecalling/d9_basecall_sup.sh
# Read mapping -- different methods on same dataset
sbatch job_scripts/read_mapping/d9_mm2.sh # baseline
sbatch job_scripts/read_mapping/d9_uncalled4_fmindex_ttest.sh # FM-index
sbatch job_scripts/read_mapping/d9_uncalled4_vectordistances_ttest.sh # vector distances
# Read classification -- different methods on Zymo
sbatch job_scripts/read_classification/zymo_ont_rindex_ttest.sh # r-index
sbatch job_scripts/read_classification/zymo_uncalled4_hashbased_ttest.sh # hash-based
sbatch job_scripts/read_classification/zymo_uncalled4_fmindex_ttest.sh # FM-index
Nextflow pipeline
The nextflow-pipeline/ directory contains a Nextflow workflow that decomposes RawHash2 into its three stages (reference encoding, signal segmentation, representation matching) as separate processes. See its README for details. Currently only implements RawHash2's methods.
Repository layout
RawBench/
├── scripts/setup_env.sh # environment config
├── sigmoni_submodule/ # Sigmoni r-index classifier (submodule)
├── spumoni_submodule/ # SPUMONI r-index backend (submodule)
├── job_scripts/
│ ├── basecalling/ # Dorado basecalling (d8, d9, d10)
│ ├── read_mapping/ # mm2, hash, FM-index, DTW, vector dist.
│ └── read_classification/ # Sigmoni, hash, FM-index, DTW, vector dist.
├── nextflow-pipeline/ # modular Nextflow decomposition of RawHash2
├── refs/
│ ├── download_refs.sh # downloads ecoli, hsapiens, dmelanogaster
│ └── download_references.md
├── fast5/
│ ├── download.sh # downloads signal data
│ └── ecoli_filenames.txt
├── kmer_models/ # pore chemistry models (included)
│ ├── ont_r10.4.1.txt
│ └── uncalled4_r10.4.1.txt
├── outputs/ # benchmark results
├── basecalled_reads/ # generated FASTQs
└── DEPENDENCIES.md
Troubleshooting
Run validate_environment to check what's missing.
- SPUMONI crashes with "helper program paths are invalid" -- run
make installinspumoni_submodule/build/. sigmoni_submodule/is empty -- rungit submodule update --init --recursive.- Sigmoni
main.pycrashes withFileNotFoundError-- create the output directory first (mkdir -p). - SLURM jobs fail -- edit
#SBATCHheaders. Partition names (gpu_part,cpu_part) and node exclusions (--exclude=kratos...) are site-specific. uncalled/rawhash2not found -- install them and set paths, or put binaries in../bin/.
Output files
*.report-- per-read classification (TSV: read_id, class)*.paf-- alignments in PAF format*_ann.paf-- PAF annotated with evaluation metrics*.throughput-- accuracy, speed, and throughput metrics*.pseudo_lengths-- PML profiles (Sigmoni)*_timing.log-- resource usage from/usr/bin/time -v*.out/*.err-- SLURM logs
Citation
@software{rawbench2025,
title = {RawBench: A comprehensive benchmarking framework for raw nanopore signal analysis},
author = {Eris, Furkan and McConnell, Ulysse and Firtina, Can and Mutlu, Onur},
year = {2025},
url = {https://github.com/CMU-SAFARI/RawBench}
}
License
MIT
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
