tuna

tuna is a fast, streaming k-mer counter for FASTA/FASTQ input. It partitions k-mers by minimizer into superkmer files, then counts them using a streaming hash table — keeping memory usage low and throughput high.

It uses kache-hash as its streaming k-mer hash table. Phase 1 parsing uses a C++ port of helicase (SIMD FASTX parser), and minimizer hashing uses a C++ port of simd-minimizers (canonical ntHash, two-stack sliding window minimum).

How it works
Dependencies
Installation
Usage
Output format
- TSV (default)
- KFF binary
C++ library API
Benchmarks

How it works

tuna runs a two-phase pipeline:

Partition (Phase 1) — streams each input file through a minimizer iterator. Whenever the minimizer changes, the current superkmer is flushed to a per-partition binary file (on disk if unsufficient RAM budget). This groups k-mers that share a minimizer into the same bucket. The number of partitions is auto-tuned from input size (targeting ~2 MB input per partition) or set explicitly with -n.
Count (Phase 2) — replays each partition, upserting every k-mer into a Kache-hash table with increment semantics. Each partition is processed independently, so the hash table only ever holds one partition's k-mers at a time.
Output (Phase 2, cont.) — iterates the table, applies -ci/-cx count filters, and writes results to the output file in TSV or KFF format.

Partitions are processed in parallel across threads (up to -n partitions at a time), keeping peak memory proportional to a single partition's k-mer set.

Dependencies

Platform: Linux or macOS, x86_64 only (kache-hash uses x86 SIMD intrinsics)
C++20 compiler: GCC >= 9.1 or Clang >= 9.0
CMake >= 3.17
zlib-ng (fetched automatically by CMake; a system zlib is no longer required)
kff-cpp-api (fetched automatically by CMake; required for KFF output)

Debian/Ubuntu:

sudo apt-get install build-essential cmake

Fedora/RHEL:

sudo dnf install gcc-c++ cmake

macOS:

brew install llvm cmake

Installation

git clone https://github.com/vicLeva/tuna.git
cd tuna/
mkdir build && cd build/
cmake ..
make -j$(nproc)

The tuna binary will be at build/tuna.

<details> <summary><strong>Compile-time options</strong></summary>

Single-k binary — compile only the templates for one k value. Roughly 10× faster to build and produces a much smaller binary. Passing any other -k at runtime prints an error:

cmake .. -DFIXED_K=31

Debug build — disables optimisations, enables debug symbols for gdb/valgrind:

cmake .. -DCMAKE_BUILD_TYPE=Debug

</details>

Usage

tuna [options] <input1.fa [input2.fa ...]> <output_file>
tuna [options] @<input_list_file>          <output_file>

Input files can be FASTA or FASTQ, plain or gzipped. Instead of listing files directly, you can pass @list.txt where list.txt is a newline-separated file of paths.

Options

| Flag | Argument | Default | Description | |------|----------|---------|-------------| | -k | <int> | 31 | k-mer length. Any odd value in [11,31] (fits in 64-bit word) | | -m | <int> | 21 | Minimizer length. Any odd value in [9, k-2]. m=21 is a good default; use m=23–25 for highly repetitive or low-complexity data (e.g. individual human genomes) | | -t | <int> | 1 | Number of threads. Phase 1 parallelises over input files; Phase 2 over partitions | | -ci | <int> | 1 | Minimum count to report | | -cx | <int> | max | Maximum count to report | | -w | <dir> | next to output | Working directory for temporary partition files. | | -kff | — | off | Write output in KFF binary format instead of TSV. Auto-detected from a .kff output extension. | | -h / --help | — | — | Print usage |

<details> <summary><strong>Advanced / benchmarking flags</strong></summary>

| Flag | Argument | Default | Description | |------|----------|---------|-------------| | -n | <int> | auto | Number of partitions. Auto-tuned to ~2 MB input/partition when omitted | | -hp | — | off | Hide progress messages (phase timings are always emitted to stderr) | | -kt | — | off | Keep temporary partition files after the run | | -tp | — | off | Stop after partitioning — Phase 1 only | | -dbg | — | off | Per-partition table summary + minimizer coverage CSV written to <work_dir>/debug_min_coverage.csv |

</details>

Examples

Count k-mers in a reference genome, k=31, 4 threads:

tuna -k 31 -t 4 genome.fa counts.tsv

Count only k-mers seen at least twice:

tuna -k 31 -t 4 -ci 2 genome.fa counts.tsv

Count from a list of files:

tuna -k 31 -t 8 @genomes.list counts.tsv

Write KFF binary output (auto-detected from extension):

tuna -k 31 -t 8 @genomes.list counts.kff

Large genomes — counting a human-scale genome (3 Gbp) produces ~500 million unique k-mers. In TSV this reaches ~20–30 GB; in KFF binary (~12 bytes/k-mer) it is ~6 GB.

Output format

TSV (default)

Plain text, tab-separated, one k-mer per line:

ACGTACGTACGTACGTACGTACGTACGTACG	42
TGCATGCATGCATGCATGCATGCATGCATGC	7
...

KFF binary (`-kff` or `.kff` extension)

K-mer File Format binary output. Each k-mer is stored as a 2-bit packed sequence (A=0, C=1, G=2, T=3, MSB-first) with a 4-byte big-endian count. The file is marked canonical=true and unique=true. Roughly 3–4× smaller than TSV for k=31.

KFF files can be read with kff-cpp-api or any other KFF-compatible tool.

Only k-mers with counts in [ci, cx] are written. The canonical (lexicographically smaller of forward/reverse-complement) form of each k-mer is reported.

C++ library API

tuna can be embedded directly in a C++ project

#include <tuna/tuna.hpp>

// Collect all k-mers into a map (simple)
auto kmers = tuna::count_to<31>({"genome.fa"});   // std::unordered_map<std::string, uint32_t>

// Stream k-mers through a callback (memory-efficient)
tuna::count<31>({"genome.fa"}, [](std::string_view kmer, uint32_t count) {
    // called for every canonical k-mer; may run from multiple threads
});

CMake integration:

add_subdirectory(tuna)                                    # or use FetchContent
target_link_libraries(my_target PRIVATE tuna::tuna)

For a full walkthrough: CMake setup, FetchContent, container customisation, thread safety, see the wiki: Using tuna as a library.

Benchmarks

Comparison with KMC 3.2.4, k=31, m=21, 8 threads, on a cluster node. Each row shows the median wall time over per-file runs (100 files for bacteria/metagenomes, 10 for human and Tara).

| dataset | type | tuna median | KMC median | speedup | tuna p1 | tuna p2 | |---------|------|-------------|------------|---------|---------|---------| | E. coli | genomes (plain FASTA) | 0.50 s | 1.24 s | 2.5× | 0.14 s | 0.30 s | | Salmonella | genomes (gz) | 0.47 s | 1.25 s | 2.7× | 0.12 s | 0.29 s | | Gut | metagenome assemblies (plain FASTA) | 0.23 s | 0.75 s | 3.3× | 0.07 s | 0.13 s | | Human | genomes (gz) | 134 s | 208 s | 1.5× | 42 s | 88 s | | Tara | metagenome reads (gz, 5.9 GB) | 93 s | 177 s | 1.9× | 28 s | 62 s |

tuna is consistently faster than KMC across all dataset types. Memory usage scales with unique k-mers per partition rather than total input size. KMC tends to be faster on scaling set of datasets rather than counting k-mers inside individual files.

Per-file benchmark: wall time distributions, phase breakdown, and speedup across 5 datasets

Tuna

Install / Use

README

tuna

Table of contents

How it works

Dependencies

Installation

Usage

Options

Examples

Output format

TSV (default)

KFF binary (`-kff` or `.kff` extension)

C++ library API

Benchmarks

Tuna

Install / Use

README

tuna

Table of contents

How it works

Dependencies

Installation

Usage

Options

Examples

Output format

TSV (default)

KFF binary (-kff or .kff extension)

C++ library API

Benchmarks

KFF binary (`-kff` or `.kff` extension)