Tuna
kmer-counter based on kache-hash
Install / Use
/learn @vicLeva/TunaREADME
tuna
tuna is a fast, streaming k-mer counter for FASTA/FASTQ input. It partitions k-mers by minimizer into superkmer files, then counts them using a streaming hash table — keeping memory usage low and throughput high.
It uses kache-hash as its streaming k-mer hash table. Phase 1 parsing uses a C++ port of helicase (SIMD FASTX parser), and minimizer hashing uses a C++ port of simd-minimizers (canonical ntHash, two-stack sliding window minimum).
Table of contents
How it works
tuna runs a two-phase pipeline:
-
Partition (Phase 1) — streams each input file through a minimizer iterator. Whenever the minimizer changes, the current superkmer is flushed to a per-partition binary file (on disk if unsufficient RAM budget). This groups k-mers that share a minimizer into the same bucket. The number of partitions is auto-tuned from input size (targeting ~2 MB input per partition) or set explicitly with
-n. -
Count (Phase 2) — replays each partition, upserting every k-mer into a Kache-hash table with increment semantics. Each partition is processed independently, so the hash table only ever holds one partition's k-mers at a time.
-
Output (Phase 2, cont.) — iterates the table, applies
-ci/-cxcount filters, and writes results to the output file in TSV or KFF format.
Partitions are processed in parallel across threads (up to -n partitions at a time), keeping peak memory proportional to a single partition's k-mer set.
Dependencies
- Platform: Linux or macOS, x86_64 only (kache-hash uses x86 SIMD intrinsics)
- C++20 compiler: GCC >= 9.1 or Clang >= 9.0
- CMake >= 3.17
- zlib-ng (fetched automatically by CMake; a system zlib is no longer required)
- kff-cpp-api (fetched automatically by CMake; required for KFF output)
Debian/Ubuntu:
sudo apt-get install build-essential cmake
Fedora/RHEL:
sudo dnf install gcc-c++ cmake
macOS:
brew install llvm cmake
Installation
git clone https://github.com/vicLeva/tuna.git
cd tuna/
mkdir build && cd build/
cmake ..
make -j$(nproc)
The tuna binary will be at build/tuna.
Single-k binary — compile only the templates for one k value. Roughly 10× faster to build and produces a much smaller binary. Passing any other -k at runtime prints an error:
cmake .. -DFIXED_K=31
Debug build — disables optimisations, enables debug symbols for gdb/valgrind:
cmake .. -DCMAKE_BUILD_TYPE=Debug
</details>
Usage
tuna [options] <input1.fa [input2.fa ...]> <output_file>
tuna [options] @<input_list_file> <output_file>
Input files can be FASTA or FASTQ, plain or gzipped.
Instead of listing files directly, you can pass @list.txt where list.txt is a newline-separated file of paths.
Options
| Flag | Argument | Default | Description |
|------|----------|---------|-------------|
| -k | <int> | 31 | k-mer length. Any odd value in [11,31] (fits in 64-bit word) |
| -m | <int> | 21 | Minimizer length. Any odd value in [9, k-2]. m=21 is a good default; use m=23–25 for highly repetitive or low-complexity data (e.g. individual human genomes) |
| -t | <int> | 1 | Number of threads. Phase 1 parallelises over input files; Phase 2 over partitions |
| -ci | <int> | 1 | Minimum count to report |
| -cx | <int> | max | Maximum count to report |
| -w | <dir> | next to output | Working directory for temporary partition files. |
| -kff | — | off | Write output in KFF binary format instead of TSV. Auto-detected from a .kff output extension. |
| -h / --help | — | — | Print usage |
| Flag | Argument | Default | Description |
|------|----------|---------|-------------|
| -n | <int> | auto | Number of partitions. Auto-tuned to ~2 MB input/partition when omitted |
| -hp | — | off | Hide progress messages (phase timings are always emitted to stderr) |
| -kt | — | off | Keep temporary partition files after the run |
| -tp | — | off | Stop after partitioning — Phase 1 only |
| -dbg | — | off | Per-partition table summary + minimizer coverage CSV written to <work_dir>/debug_min_coverage.csv |
Examples
Count k-mers in a reference genome, k=31, 4 threads:
tuna -k 31 -t 4 genome.fa counts.tsv
Count only k-mers seen at least twice:
tuna -k 31 -t 4 -ci 2 genome.fa counts.tsv
Count from a list of files:
tuna -k 31 -t 8 @genomes.list counts.tsv
Write KFF binary output (auto-detected from extension):
tuna -k 31 -t 8 @genomes.list counts.kff
Large genomes — counting a human-scale genome (3 Gbp) produces ~500 million unique k-mers. In TSV this reaches ~20–30 GB; in KFF binary (~12 bytes/k-mer) it is ~6 GB.
Output format
TSV (default)
Plain text, tab-separated, one k-mer per line:
ACGTACGTACGTACGTACGTACGTACGTACG 42
TGCATGCATGCATGCATGCATGCATGCATGC 7
...
KFF binary (-kff or .kff extension)
K-mer File Format binary output. Each k-mer is stored as a 2-bit packed sequence (A=0, C=1, G=2, T=3, MSB-first) with a 4-byte big-endian count. The file is marked canonical=true and unique=true. Roughly 3–4× smaller than TSV for k=31.
KFF files can be read with kff-cpp-api or any other KFF-compatible tool.
Only k-mers with counts in [ci, cx] are written. The canonical (lexicographically smaller of forward/reverse-complement) form of each k-mer is reported.
C++ library API
tuna can be embedded directly in a C++ project
#include <tuna/tuna.hpp>
// Collect all k-mers into a map (simple)
auto kmers = tuna::count_to<31>({"genome.fa"}); // std::unordered_map<std::string, uint32_t>
// Stream k-mers through a callback (memory-efficient)
tuna::count<31>({"genome.fa"}, [](std::string_view kmer, uint32_t count) {
// called for every canonical k-mer; may run from multiple threads
});
CMake integration:
add_subdirectory(tuna) # or use FetchContent
target_link_libraries(my_target PRIVATE tuna::tuna)
For a full walkthrough: CMake setup, FetchContent, container customisation, thread safety, see the wiki: Using tuna as a library.
Benchmarks
Comparison with KMC 3.2.4, k=31, m=21, 8 threads, on a cluster node. Each row shows the median wall time over per-file runs (100 files for bacteria/metagenomes, 10 for human and Tara).
| dataset | type | tuna median | KMC median | speedup | tuna p1 | tuna p2 | |---------|------|-------------|------------|---------|---------|---------| | E. coli | genomes (plain FASTA) | 0.50 s | 1.24 s | 2.5× | 0.14 s | 0.30 s | | Salmonella | genomes (gz) | 0.47 s | 1.25 s | 2.7× | 0.12 s | 0.29 s | | Gut | metagenome assemblies (plain FASTA) | 0.23 s | 0.75 s | 3.3× | 0.07 s | 0.13 s | | Human | genomes (gz) | 134 s | 208 s | 1.5× | 42 s | 88 s | | Tara | metagenome reads (gz, 5.9 GB) | 93 s | 177 s | 1.9× | 28 s | 62 s |
tuna is consistently faster than KMC across all dataset types. Memory usage scales with unique k-mers per partition rather than total input size. KMC tends to be faster on scaling set of datasets rather than counting k-mers inside individual files.

