RawHash
RawHash can accurately and efficiently map raw nanopore signals to reference genomes of varying sizes (e.g., from viral to a human genomes) in real-time without basecalling. Described by Firtina et al. (published at https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440).
Install / Use
/learn @CMU-SAFARI/RawHashREADME
RawHash and Rawsamble Overview
RawHash (and RawHash2) is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.
Rawsamble is a mechanism that finds overlaps betweel raw signals without a reference genome (all-vs-all overlapping). The overlap information is generated in a PAF output and can be used by assemblers such as miniasm to construct de novo assemblies.
Below figure shows the overview of the steps that RawHash takes to find matching regions between a reference genome and a raw nanopore signal.
<p align="center" width="100%"> <img width="50%" src="./gitfigures/overview.png"> </p>To efficiently identify similarities between a reference genome and reads, RawHash has two steps, similar to regular read mapping tools, 1) indexing and 2) mapping. The indexing step generates hash values from the expected signal representation of a reference genome and stores them in a hash table. In the mapping step, RawHash generates the hash values from raw signals and queries the hash table generated in the indexing step to find seed matches. To map the raw signal to a reference genome, RawHash performs chaining over the seed matches.
RawHash can be used to map reads from FAST5, POD5, SLOW5, or BLOW5 files to a reference genome in sequence format.
RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in UNCALLED and Sigmap to report the mapping information.
Recent changes
-
We have integrated a new overlapping mechanism along with its presets, for our new mechanism, called Rawsamble. Please see below the corresponding section to run Rawsamble (i.e., overlapping) with RawHash.
-
We came up with a better and more accurate quantization mechanism in RawHash2. The new quantization mechanism dynamically arranges the bucket sizes that each signal value is quantized depending on the normalized distribution of the signal values. This provides significant improvements in both accuracy and performance.
-
We have integrated the signal alignment functionality with DTW as proposed in RawAlign (see the citation below). The parameters may still not be highly optimized as this is still in experimental stage. Use it with caution.
-
rmap.c is now rmap.cpp (needs to be compiled with C++) due to the recent DTW integration. We are planning to make it a C-compatible implementation again.
-
We have released RawHash2, a more sensitive and faster raw signal mapping mechanism with substantial improvements over RawHash. RawHash2 is available within this repository. You can still use the earlier version, RawHash v1, from this release.
-
It is now possible to disable compiling HDF5, SLOW5, and POD5. Please check the
Compiling with HDF5, SLOW5, and POD5section below for details.
Installation
- Clone the code from its GitHub repository (
--recursivemust be used):
git clone --recursive https://github.com/CMU-SAFARI/RawHash.git rawhash2
- Compile (Make sure you have a C++ compiler and GNU make):
cd rawhash2 && make
If the compilation is successful, the path to the binary will be bin/rawhash2.
Compiling with HDF5, SLOW5, and POD5
We are aware that some of the pre-compiled libraries (e.g., POD5) may not work in your system and you may need to compile these libraries from scratch. Additionally, it may be possible that you may not want to compile any of the HDF5, SLOW5, or POD5 libraries if you are not going to use them. RawHash2 provides a flexible Makefile to enable custom compilation of these libraries.
- It is possible to provide your own include and lib directories for any of the HDF5, SLOW5, and POD5 libraries, if you do not want to use the source code or the pre-compiled binaries that come with RawHash2. To use your own include and lib directories you should pass them to
makewhen compiling as follows:
#Provide the path to all of the HDF5/SLOW5/POD5 include and lib directories during compilation
make HDF5_INCLUDE_DIR=/path/to/hdf5/include HDF5_LIB_DIR=/path/to/hdf5/lib \
SLOW5_INCLUDE_DIR=/path/to/slow5/include SLOW5_LIB_DIR=/path/to/slow5/lib \
POD5_INCLUDE_DIR=/path/to/pod5/include POD5_LIB_DIR=/path/to/pod5/lib
#Provide the path to only POD5 include and lib directories during compilation
make POD5_INCLUDE_DIR=/path/to/pod5/include POD5_LIB_DIR=/path/to/pod5/lib
- It is possible to disable compiling any of the HDF5, SLOW5, and POD5 libraries. To disable them, you can use the following variables
#Disables compiling HDF5
make NOHDF5=1
#Disables compiling SLOW5 and POD5
make NOSLOW5=1 NOPOD5=1
Usage
Getting help
You can print the help message to learn how to use rawhash2:
rawhash2
or
rawhash2 -h
Indexing
Indexing is similar to minimap2's usage. We additionally include the pore models located under ./extern
Below is an example that generates an index file ref.ind for the reference genome ref.fasta using a certain k-mer model located under extern and 32 threads.
rawhash2 -d ref.ind -p extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 ref.fasta
Note that you can directly jump to mapping without creating the index because RawHash2 is able to generate the index relatively quickly on-the-fly within the mapping step. However, a real-time genome analysis application may still prefer generating the indexing before the mapping step. Thus, we suggest creating the index before the mapping step.
Mapping
It is possible to provide inputs as FAST5 files from multiple directories. It is also possible to provide a list of files matching a certain pattern such as test/data/contamination/fast5_files/Min*.fast5
- Example usage where multiple files matching a certain the pattern
test/data/contamination/fast5_files/Min*.fast5and fast5 files inside thetest/data/d1_sars-cov-2_r94/fast5_filesdirectory are inputted to rawhash2 using32threads and the previously generatedref.indindex:
rawhash2 -t 32 ref.ind test/data/contamination/fast5_files/Min*.fast5 test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
- Another example usage where 1) we only input a directory including FAST5 files as set of raw signals and 2) the output is directly saved in a file.
rawhash2 -t 32 -o mapping.paf ref.ind test/data/d1_sars-cov-2_r94/fast5_files
IMPORTANT if there are many fast5 files that rawhash2 needs to process (e.g., thousands of them), we suggest that you specify only the directories that contain these fast5 files
RawHash2 also provides a set of default parameters that can be preset automatically.
- Mapping reads to a viral reference genome using its corresponding preset:
rawhash2 -t 32 -x viral ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
- Mapping reads to small reference genomes (<500M bases) using its corresponding preset:
rawhash2 -t 32 -x sensitive ref.ind test/data/d4_green_algae_r94/fast5_files > mapping.paf
- If you want to map a R10.4.1 dataset (or R10 in general), please insert the following preset along with other presets:
rawhash2 -t 32 -x sensitive --r10 ref.ind test/data/d6_ecoli_r104/fast5_files > mapping.paf
For indexing, please use the k-mer model generated by UNCALLED4
- Mapping reads to large reference genomes (>500M bases) using its corresponding preset:
rawhash2 -t 32 -x fast ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf
RawHash2 provides another set of default parameters that can be used for very large metagenomic samples (>10G). To achieve efficient search, it uses the minimizer seeding in this parameter setting, which is slightly less accurate than the non-minimizer mode but much faster (around 3X).
rawhash2 -t 32 -x faster ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf
The output will be saved to mapping.paf in a modified PAF format used by Uncalled.
Rawsamble (for overlapping and assembly construction)
Our new overlapping mechanism, Rawsamble, is now integrated in RawHash. To create overlaps, you can construct the index from signals and perform overlapping using this index as follows:
rawhash2 -x ava -p ../../rawhash2/extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model -d ava.ind -t32 test/data/d3_yeast_r94/fast5_files/
Then perform overlapping using this index:
rawhash2 -x ava -t32 ava.ind test/data/d3_yeast_r94/fast5_files/ > ava.paf
We provide the following presets for Rawsamble to enable the overlapping mode (shown in the help message):
Rawsamble Presets:
- ava All-vs-all overlapping mode (default for Rawsamble)
- ava-sensitive More sensitive All-vs-all overlapping mode. Can be slightly slower than -ava but likely to generate longer unitigs in downstream asssembly
- ava-viral All-vs-all overlapping for very small genomes such as viral genomes.
- ava-large All-vs-all overlapping for large genomes of size > 10Gb
Potential issue
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
