RawHash2 Overview

RawHash2 is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.

Rawsamble is a mechanism that finds overlaps betweel raw signals without a reference genome (all-vs-all overlapping). The overlap information is generated in a PAF output and can be used by assemblers such as miniasm to construct de novo assemblies.

Below figure shows the overview of the steps that RawHash2 takes to find matching regions between a reference genome and a raw nanopore signal.

To efficiently identify similarities between a reference genome and reads, RawHash has two steps, similar to regular read mapping tools, 1) indexing and 2) mapping. The indexing step generates hash values from the expected signal representation of a reference genome and stores them in a hash table. In the mapping step, RawHash generates the hash values from raw signals and queries the hash table generated in the indexing step to find seed matches. To map the raw signal to a reference genome, RawHash performs chaining over the seed matches.

RawHash2 can be used to map reads from FAST5, POD5, SLOW5, or BLOW5 files to a reference genome in sequence format. POD5 is the recommended format as it is the current default for Oxford Nanopore sequencers.

RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in UNCALLED and Sigmap to report the mapping information.

Recent changes

MinKNOW gRPC real-time streaming (BETA): Live adaptive sampling via MinKNOW, pod5 replay, or Icarust simulator. See MinKNOW Integration and Adaptive Sampling below and the full guide at live/LIVE.md.
Rawsamble overlapping: All-vs-all raw signal overlapping for de novo assembly. See Rawsamble section.
Adaptive quantization: Dynamic bucket sizing based on signal distribution — significant accuracy and performance gains.
RawAlign DTW integration: Signal alignment via Dynamic Time Warping (experimental). See citation below.
Pure C codebase: All source is C; compiled as C++ only when POD5/HDF5 are enabled.
RawHash2 release: Substantial improvements over RawHash v1, which is still available from this release.
Selective format compilation: HDF5, SLOW5, and POD5 can each be enabled/disabled independently.

Installation

Prerequisites

| Requirement | Linux | macOS | |---|---|---| | C++ compiler | GCC 11+ (g++ 11 or later) | Xcode Command Line Tools (Apple Clang) | | CMake | 3.16+ (for CMake build only) | 3.16+ (for CMake build only) | | GNU Make | Required | Required |

Note: When POD5 support is enabled (the default), all source files are compiled as C++ and linked against POD5 v0.3.36's pre-built static libraries. These libraries require GCC 11 or later on Linux. GCC 8.x is known to fail at link time. If you cannot upgrade GCC, you can disable POD5 with make NOPOD5=1 to compile with any C99-compatible compiler, but POD5 signal input will not be available.

Quick Start

Clone the code from its GitHub repository (--recursive must be used):

git clone --recursive https://github.com/STORMgroup/RawHash2.git rawhash2
cd rawhash2

Recommended: Build with CMake (see Prerequisites above):

make cmake

Alternative: Build with Make only (no CMake required, see Prerequisites):

make

Both methods produce the binary at bin/rawhash2. By default, RawHash2 compiles with POD5 support only. To enable HDF5/FAST5 or SLOW5/BLOW5 support, see the section below.

Compiling with HDF5, SLOW5, and POD5

RawHash2 provides two build systems. The recommended approach is CMake, which provides the most flexibility. The standalone Makefile is an alternative for systems without CMake.

Using CMake (Recommended)

Default build (POD5 only):

make cmake

Enable additional formats:

# Enable all three formats (HDF5, SLOW5, POD5)
make cmake CMAKE_OPTS="-DENABLE_HDF5=ON -DENABLE_SLOW5=ON"

# Enable only HDF5 and POD5
make cmake CMAKE_OPTS="-DENABLE_HDF5=ON"

# Enable only SLOW5 and POD5
make cmake CMAKE_OPTS="-DENABLE_SLOW5=ON"

# Disable POD5, enable HDF5 and SLOW5
make cmake CMAKE_OPTS="-DENABLE_HDF5=ON -DENABLE_SLOW5=ON -DENABLE_POD5=OFF"

Debug, profiling, and sanitizer builds:

# Debug build with debug symbols (-O2 -g)
make cmake CMAKE_OPTS="-DCMAKE_BUILD_TYPE=Debug"

# Enable profiling (-g -fno-omit-frame-pointer -DPROFILERH=1)
make cmake CMAKE_OPTS="-DENABLE_PROFILING=ON"

# Enable AddressSanitizer
make cmake CMAKE_OPTS="-DENABLE_ASAN=ON"

# Enable ThreadSanitizer
make cmake CMAKE_OPTS="-DENABLE_TSAN=ON"

# Combine options
make cmake CMAKE_OPTS="-DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON -DENABLE_HDF5=ON"

Or invoke CMake directly for full control:

mkdir build && cd build
cmake -DENABLE_HDF5=ON -DENABLE_SLOW5=ON ..
cmake --build . -j4
cp src/rawhash2 ../bin/

Use system-installed HDF5 instead of building from submodule:

make cmake CMAKE_OPTS="-DENABLE_HDF5=ON -DUSE_SYSTEM_HDF5=ON"

Using Make (No CMake Required)

Default build (POD5 only):

make

Disable/enable formats:

# Disable POD5 (compile with no external signal format libraries)
make NOPOD5=1

# Enable HDF5 along with POD5
make NOHDF5=0

# Enable SLOW5 along with POD5
make NOSLOW5=0

# Enable all formats
make NOHDF5=0 NOSLOW5=0

Debug, profiling, and sanitizer builds:

# Debug build with AddressSanitizer (-O2 -fsanitize=address -g)
make DEBUG=1

# Enable profiling (-g -fno-omit-frame-pointer -DPROFILERH=1)
make PROFILE=1

# Enable AddressSanitizer without full debug mode
make asan=1

# Enable ThreadSanitizer
make tsan=1

# Combine options
make DEBUG=1 NOHDF5=0

Rebuild without recompiling external dependencies:

make subset

MinKNOW Integration and Adaptive Sampling

RawHash2 supports real-time adaptive sampling from Oxford Nanopore sequencers via the MinKNOW gRPC API. Signal chunks stream in as reads are being sequenced, RawHash2 maps each chunk incrementally using multi-threaded parallel processing, and mapping decisions (keep or eject) are sent back to the sequencer.

Three signal sources are supported — all use the same --live interface:

| Mode | Signal source | Use case | |------|--------------|----------| | Pod5 replay | Python replay server replays a real pod5 file over gRPC | Deterministic validation and benchmarking | | Icarust | Icarust synthesizes signal from a reference | Integration testing with simulated hardware | | Real MinKNOW | Physical nanopore sequencer | Production adaptive sampling |

Building with gRPC support

# 1. Install dependencies (Linux via conda)
conda create -n rawhash2-live cmake make cxx-compiler grpcio grpcio-tools libgrpc protobuf
conda activate rawhash2-live

# macOS alternative: brew install grpc cmake

# 2. Build
make cmake CMAKE_OPTS="-DENABLE_GRPC=ON"

Quick start: pod5 replay

# Terminal 1: Start replay server
pip install grpcio grpcio-tools pod5
python3 pod5_replay_server.py --pod5 reads.pod5 --port 10111 --mode uncalibrated

# Terminal 2: Index (one-time) + map live
bin/rawhash2 -x viral -p extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model \
  -d ref.idx ref.fa
bin/rawhash2 --live --live-port 10111 --live-uncalibrated -x viral -t 16 ref.idx > live.paf

Quick start: real MinKNOW

bin/rawhash2 --live \
  --live-host sequencer01 --live-port 8004 \
  --live-tls --live-tls-cert /opt/ont/minknow/conf/rpc-certs/ca.crt \
  --live-last-channel 512 \
  -x fast --r10 -t 16 ref.idx > live.paf

For the complete guide covering all three modes, installation, configuration, validation, threading architecture, and troubleshooting, see live/LIVE.md.

Usage

Getting help

rawhash2 -h   # print full usage and options

Indexing

Indexing is similar to minimap2's usage. Pore models are located under ./extern. You can jump directly to mapping (the index is built on-the-fly), but pre-building is recommended for real-time applications.

# R9.4 indexing
rawhash2 -d ref.ind \
  -p extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model \
  -t 32 ref.fasta

# R10.4.1 indexing (different pore model + --r10 flag)
rawhash2 -d ref.ind \
  -p extern/local_kmer_models/uncalled_r1041_model_only_means.txt \
  --r10 -t 32 ref.fasta

Mapping

Inputs can be directories of signal files (FAST5, POD5, SLOW5, BLOW5), individual files, or glob patterns. When there are many files (thousands+), pass directories rather than individual files. Use -o mapping.paf to write output to a file, or redirect stdout.

Mapping presets:

| Preset | Use case | Flag | |--------|----------|------| | viral | Viral genomes | -x viral | | sensitive | Small-medium genomes (<500M) | -x sensitive | | fast | Large genom

RawHash2

Install / Use

README