Hamt
A hash array-mapped trie implementation in C
Install / Use
/learn @mkirchner/HamtREADME
libhamt
A hash array-mapped trie (HAMT) implementation in C99. A HAMT is a data structure that can be used to efficiently implement [persistent][wiki_persistent_data_structure] associative arrays (aka maps, dicts) and sets, see the Introduction. The implementation here loosely follows Bagwell's 2000 paper[[1]][bagwell_00_ideal], with a focus on code clarity.
What prompted the somewhat detailed writeup was the realization that there is
not a lot of in-depth documentation for HAMTs beyond the original Bagwell
paper[[1][bagwell_00_ideal]]. Some of the more helpful posts are [Karl Krukow's
intro to Clojure's PersistentHashMap][krukov_09_understanding], [C. S. Lim's
C++ template implementation][chaelim_hamt], [Adrian Coyler's morning paper
post][coyler_15_champ] and the original [Steindoerfer/Vinju compressed HAMT
article it summarizes][steindoerfer_15_optimizing]. The rest mostly seems to be
all bits and pieces and this document is an attempt to (partially) improve that
situation.
FIXME: Complete docs (removal, persistence, path copying)
Quickstart
To build the library and run the tests:
$ git clone git@github.com:mkirchner/hamt.git
$ cd hamt
$ make
$ make test
$ make runtest
In order to use libhamt in your own projects, copy the required sources to
your own source tree.
Benchmarks

The current HAMT implementation consistently outperforms the libavl AVL-tree
and red-black tree implementations by 5x for querying, 1.5x-4x for node insert,
and 1.5x-5x for node removal. Persistent insert and remove implementations
scale roughly similar to the classic trees, with more favorable scaling
behavior for HAMT. Where table caching is an option, the persistent HAMT
implementation reaches better insert performance than red-black trees and
better removal performance than red-black and AVL trees at appoximately 10e5
elements.
Compared to hash tables, HAMT query times start at 2x vs. GLib's HashTable
and 20x vs. hsearch(3) (the latter still being investigated) and then get
progressively worse. This makes sense, given the O(1) vs. O(log N) expectations
between the different approaches.
Note that benchmarking and optimization is an ongoing effort and please take all numbers with a pinch of salt. All measurements have so far been collected on a single system (Apple MBP M2 Max under Ventura 13.4.1).
For detailed performance comparison with AVL and red-black trees (from libavl)
and the HashTree from GLib, see [the benchmarking repo][hamt_bench_github].
Introduction
A hash array mapped trie (HAMT) is a data structure that can be used to implement [associative arrays][wiki_associative_array] (aka maps) and [sets][wiki_set_adt].
Structurally, HAMTs are [hash trees][wiki_hash_tree] that combine favorable characteristics of [hash tables][wiki_hash_table] and array mapped [tries][wiki_trie], namely almost hash table-like time complexity guarantees[[1]][bagwell_00_ideal] (O(log<sub>32</sub>n)) and economic use of memory.
An additional benefit, and a key motivation for the work presented here, is that augmentation of HAMTs with path copying and garbage collection allows for a straightforward and efficient implementation of [persistent][wiki_persistent] versions of maps and sets.
The remaining documentation starts with a description of the libhamt API and
two examples that demonstrate the use of a HAMT as an ephemeral and persistent
data structure, respectively. It then details the implementation: starting from
the foundational data structures and the helper code required for hash
exhaustion and table management, we cover search, insertion, removal, and
iterators. The final implementation section introduces path copying and explains
the changes required to support persistent insert and remove operations. It
closes with an outlook and an appendix.
API
HAMT lifecycle
The core data type exported in the libhamt interface is struct hamt. In order to
create a struct hamt instance, one must call hamt_create(), which requires a
hash function of type hamt_key_hash_fn to hash keys, a comparison function of
type hamt_cmp_fn to compare keys, and a pointer to a hamt_allocator instance.
hamt_delete() deletes struct hamt instances that were created with hamt_create().
/* The libhamt core data structure is a handle to a hash array-mapped trie */
/* Function signature definitions for key comparison and hashing */
typedef int (*hamt_cmp_fn)(const void *lhs, const void *rhs);
typedef uint32_t (*hamt_key_hash_fn)(const void *key, const size_t gen);
/* API functions for lifecycle management */
struct hamt *hamt_create(hamt_key_hash_fn key_hash, hamt_cmp_fn key_cmp, struct hamt_allocator *ator);
void hamt_delete(struct hamt *);
The hamt_key_hash_fn takes a key and a generation gen. The expectation is
that the supplied hash function returns different hashes for the same key but
different generations. Depending on the choice of hash function this can be
implemented using gen as a seed or modifying a copy of key on the fly.
See the examples section for a murmur3-based implementation and
the hashing section for more information on suitable hash functions.
Memory management
libhamt exports its internal memory management API through the hamt_allocator
struct. The struct specifies the functions that the HAMT implementation uses to
allocate, re-allocate and deallocate system memory. The API provides a default
hamt_allocator_default which refers to the standard malloc(), realloc()
and free() functions.
struct hamt_allocator {
void *(*malloc)(const size_t size);
void *(*realloc)(void *chunk, const size_t size);
void (*free)(void *chunk);
};
extern struct hamt_allocator hamt_allocator_default;
Exporting the libhamt memory management API enables library clients to make
use of alternate memory management solutions, most notably of garbage collection
solutions (e.g. the [Boehm-Demers-Weiser GC][boehm_gc]) which are required when
using the HAMT as a persistent data structure (see the structural sharing
example).
Query
size_t hamt_size(const struct hamt *trie);
const void *hamt_get(const struct hamt *trie, void *key);
The hamt_size() function returns the size of the HAMT in O(1). Querying the
HAMT (i.e. searching a key) is done with hamt_get() which takes a pointer to a
key and returns a result in O(log<sub>32</sub> n) - or NULL if the key does
not exist in the HAMT.
Iterators
The API also provides key/value pair access through the hamt_iterator struct.
size_t hamt_size(const struct hamt *trie);
const void *hamt_get(const struct hamt *trie, void *key);
Iterators are tied to a specific HAMT and are created using the
hamt_it_create() function, passing the HAMT instance the iterator should refer
to. Iterators can be advanced with the hamt_it_next() function and as long as
hamt_it_valid() returns true, the hamt_it_get_key() and
hamt_it_get_value() functions will return the pointers to the current
key/value pair. In order to delete an existing and/or exhausted iterator, call
hamt_it_delete().
typedef struct hamt_iterator_impl *hamt_iterator;
hamt_iterator hamt_it_create(const struct hamt *trie);
void hamt_it_delete(hamt_iterator it);
bool hamt_it_valid(hamt_iterator it);
hamt_iterator hamt_it_next(hamt_iterator it);
const void *hamt_it_get_key(hamt_iterator it);
const void *hamt_it_get_value(hamt_iterator it);
Iterators maintain state about their traversal path and changes to the HAMT that an iterator refers to implicitly invalidate the iteration (i.e. undefined behavior).
The order in which iterators return the key value pairs is fully defined by the structure of the trie, which, in turn, is completely defined by the choice of hash function and (where applicable) seed.
Insert & Remove
libhamt supports ephemeral and
[persistent][wiki_persistent] (aka not ephemeral) HAMTs through two different interfaces:
hamt_set() and hamt_remove() for ephemeral use, and their p-versions
hamt_pset() and hamt_premove() for persistent use.
Ephemeral modification
const void *hamt_set(struct hamt *trie, void *key, void *value);
void *hamt_remove(struct hamt *trie, void *key);
hamt_set() takes a pair of key and value pointers and adds the pair to the HAMT,
returning a pointer to the value. If the key already exists, hamt_set()
updates the pointer to the value.
hamt_remove() takes a key and removes the key/value pair with the
respective key from the HAMT, returning a pointer to the value that was
just removed. If the key does not exist, hamt_remove() returns NULL.
Persistent HAMTs
The semantics of persistent HAMTs are different from their ephemeral counterparts: since every modification creates a new version of a HAMT, the modificiation functions return a new HAMT. Modification of a persistent HAMT therefore requires a reassignment idiom if the goal is modification only:
const struct hamt *h = hamt_create(...)
...
/* Set a value and drop the reference to the old HAMT; the GC
* will take care of cleaning up remaining unreachable allocations.
*/
h = hamt_pset(h, some_key, some_value);
...
This seems wasteful at first glance but the respective functions implement structural sharing such that the overhead is limited to ~log<sub>32</sub>(N) nodes (where N is the number of nodes in the graph).
const struct hamt *hamt_pset(const struct hamt *trie, void *key, void *value);
const struct hamt *hamt_premove(const struct hamt *trie, void *key);
hamt_pset() inserts or updates the key with value and returns an opaque
handle to the new HAMT. The new HAMT is guaranteed to contain the new
key/value pair.
hamt_premove() attempts to remove the value with the key `k
