This is the original README. For the adpatation to raku/MoarVM see HAMT-for-MoarVM.md
A hash array-mapped trie (HAMT) implementation in C99. A HAMT is a data structure that can be used to efficiently implement persistent associative arrays (aka maps, dicts) and sets, see the Introduction. The implementation here loosely follows Bagwell's 2000 paper[1], with a focus on code clarity.
What prompted the somewhat detailed writeup was the realization that there is
not a lot of in-depth documentation for HAMTs beyond the original Bagwell
paper[1]. Some of the more helpful posts are Karl Krukow's
intro to Clojure's PersistentHashMap
, C. S. Lim's
C++ template implementation, Adrian Coyler's morning paper
post and the original Steindoerfer/Vinju compressed HAMT
article it summarizes. The rest mostly seems to be
all bits and pieces and this document is an attempt to (partially) improve that
situation.
FIXME: Complete docs (removal, persistence, path copying)
To build the library and run the tests:
$ git clone git@github.com:mkirchner/hamt.git
$ cd hamt
$ make
$ make test
$ make runtest
In order to use libhamt
in your own projects, copy the required sources to
your own source tree.
The current HAMT implementation consistently outperforms the libavl
AVL-tree
and red-black tree implementations by 5x for querying, 1.5x-4x for node insert,
and 1.5x-5x for node removal. Persistent insert and remove implementations
scale roughly similar to the classic trees, with more favorable scaling
behavior for HAMT. Where table caching is an option, the persistent HAMT
implementation reaches better insert performance than red-black trees and
better removal performance than red-black and AVL trees at appoximately 10e5
elements.
Compared to hash tables, HAMT query times start at 2x vs. GLib's HashTable
and 20x vs. hsearch(3)
(the latter still being investigated) and then get
progressively worse. This makes sense, given the O(1) vs. O(log N) expectations
between the different approaches.
Note that benchmarking and optimization is an ongoing effort and please take all numbers with a pinch of salt. All measurements have so far been collected on a single system (Apple MBP M2 Max under Ventura 13.4.1).
For detailed performance comparison with AVL and red-black trees (from libavl
)
and the HashTree from GLib, see the benchmarking repo.
A hash array mapped trie (HAMT) is a data structure that can be used to implement associative arrays (aka maps) and sets.
Structurally, HAMTs are hash trees that combine favorable characteristics of hash tables and array mapped tries, namely almost hash table-like time complexity guarantees[1] (O(log32n)) and economic use of memory.
An additional benefit, and a key motivation for the work presented here, is that augmentation of HAMTs with path copying and garbage collection allows for a straightforward and efficient implementation of persistent versions of maps and sets.
The remaining documentation starts with a description of the libhamt
API and
two examples that demonstrate the use of a HAMT as an ephemeral and persistent
data structure, respectively. It then details the implementation: starting from
the foundational data structures and the helper code required for hash
exhaustion and table management, we cover search, insertion, removal, and
iterators. The final implementation section introduces path copying and explains
the changes required to support persistent insert and remove operations. It
closes with an outlook and an appendix.
The core data type exported in the libhamt
interface is struct hamt
. In order to
create a struct hamt
instance, one must call hamt_create()
, which requires a
hash function of type hamt_key_hash_fn
to hash keys, a comparison function of
type hamt_cmp_fn
to compare keys, and a pointer to a hamt_allocator
instance.
hamt_delete()
deletes struct hamt
instances that were created with hamt_create()
.
/* The libhamt core data structure is a handle to a hash array-mapped trie */
/* Function signature definitions for key comparison and hashing */
typedef int (*hamt_cmp_fn)(const void *lhs, const void *rhs);
typedef uint32_t (*hamt_key_hash_fn)(const void *key, const size_t gen);
/* API functions for lifecycle management */
struct hamt *hamt_create(hamt_key_hash_fn key_hash, hamt_cmp_fn key_cmp, struct hamt_allocator *ator);
void hamt_delete(struct hamt *);
The hamt_key_hash_fn
takes a key
and a generation gen
. The expectation is
that the supplied hash function returns different hashes for the same key but
different generations. Depending on the choice of hash function this can be
implemented using gen
as a seed or modifying a copy of key
on the fly.
See the examples section for a murmur3
-based implementation and
the hashing section for more information on suitable hash functions.
libhamt
exports its internal memory management API through the hamt_allocator
struct. The struct specifies the functions that the HAMT implementation uses to
allocate, re-allocate and deallocate system memory. The API provides a default
hamt_allocator_default
which refers to the standard malloc()
, realloc()
and free()
functions.
struct hamt_allocator {
void *(*malloc)(const size_t size);
void *(*realloc)(void *chunk, const size_t size);
void (*free)(void *chunk);
};
extern struct hamt_allocator hamt_allocator_default;
Exporting the libhamt
memory management API enables library clients to make
use of alternate memory management solutions, most notably of garbage collection
solutions (e.g. the Boehm-Demers-Weiser GC) which are required when
using the HAMT as a persistent data structure (see the structural sharing
example).
size_t hamt_size(const struct hamt *trie);
const void *hamt_get(const struct hamt *trie, void *key);
The hamt_size()
function returns the size of the HAMT in O(1). Querying the
HAMT (i.e. searching a key) is done with hamt_get()
which takes a pointer to a
key and returns a result in O(log32 n) - or NULL
if the key does
not exist in the HAMT.
The API also provides key/value pair access through the hamt_iterator
struct.
size_t hamt_size(const struct hamt *trie);
const void *hamt_get(const struct hamt *trie, void *key);
Iterators are tied to a specific HAMT and are created using the
hamt_it_create()
function, passing the HAMT instance the iterator should refer
to. Iterators can be advanced with the hamt_it_next()
function and as long as
hamt_it_valid()
returns true
, the hamt_it_get_key()
and
hamt_it_get_value()
functions will return the pointers to the current
key/value pair. In order to delete an existing and/or exhausted iterator, call
hamt_it_delete()
.
typedef struct hamt_iterator_impl *hamt_iterator;
hamt_iterator hamt_it_create(const struct hamt *trie);
void hamt_it_delete(hamt_iterator it);
bool hamt_it_valid(hamt_iterator it);
hamt_iterator hamt_it_next(hamt_iterator it);
const void *hamt_it_get_key(hamt_iterator it);
const void *hamt_it_get_value(hamt_iterator it);
Iterators maintain state about their traversal path and changes to the HAMT that an iterator refers to implicitly invalidate the iteration (i.e. undefined behavior).
The order in which iterators return the key value pairs is fully defined by the structure of the trie, which, in turn, is completely defined by the choice of hash function and (where applicable) seed.
libhamt
supports ephemeral and
persistent (aka not ephemeral) HAMTs through two different interfaces:
hamt_set()
and hamt_remove()
for ephemeral use, and their p
-versions
hamt_pset()
and hamt_premove()
for persistent use.
const void *hamt_set(struct hamt *trie, void *key, void *value);
void *hamt_remove(struct hamt *trie, void *key);
hamt_set()
takes a pair of key
and value
pointers and adds the pair to the HAMT,
returning a pointer to the value
. If the key
already exists, hamt_set()
updates the pointer to the value
.
hamt_remove()
takes a key
and removes the key/value pair with the
respective key
from the HAMT, returning a pointer to the value
that was
just removed. If the key
does not exist, hamt_remove()
returns NULL
.
The semantics of persistent HAMTs are different from their ephemeral counterparts: since every modification creates a new version of a HAMT, the modificiation functions return a new HAMT. Modification of a persistent HAMT therefore requires a reassignment idiom if the goal is modification only:
const struct hamt *h = hamt_create(...)
...
/* Set a value and drop the reference to the old HAMT; the GC
* will take care of cleaning up remaining unreachable allocations.
*/
h = hamt_pset(h, some_key, some_value);
...
This seems wasteful at first glance but the respective functions implement structural sharing such that the overhead is limited to ~log32(N) nodes (where N is the number of nodes in the graph).
const struct hamt *hamt_pset(const struct hamt *trie, void *key, void *value);
const struct hamt *hamt_premove(const struct hamt *trie, void *key);
hamt_pset()
inserts or updates the key
with value
and returns an opaque
handle to the new HAMT. The new HAMT is guaranteed to contain the new
key/value pair.
hamt_premove()
attempts to remove the value with the key key
. It is not
an error if the key does not exist; the new HAMT is guaranteed to not contain
the key key
.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "hamt.h"
#include "murmur3.h"
static uint32_t hash_string(const void *key, const size_t gen)
{
return murmur3_32((uint8_t *)key, strlen((const char *)key), gen);
}
int main(int argn, char *argv[])
{
enum { N = 5; };
struct {
char *country;
char *capital;
} cities[N] = {
{"Germany", "Berlin"},
{"Spain", "Madrid"},
{"Italy", "Rome"},
{"France", "Paris"},
{"Romania", "Bucharest"}
/* ... */
};
struct hamt *t;
/* create table */
t = hamt_create(hash_string, strcmp, &hamt_allocator_default);
/* load table */
for (size_t i = 0; i < N; i++) {
hamt_set(t, cities[i].country, cities[i].capital);
}
/* query table */
for (size_t i = 0; i < N; i++) {
printf("%s has capital %s\n", cities[i].country,
hamt_get(t, cities[i].country));
}
/* cleanup */
hamt_delete(t);
return 0;
}
The key to making use of structural sharing is to provide libhamt
with a
struct hamt_allocator
instance that implements garbage collection.
The example below uses the the Boehm-Demers-Weiser GC. For GC installation, compilation and linking instructions, please refer to the GC documentation.
In brief, the Boehm GC provides a gc.h
include file and drop-in replacements
for the standard memory management functions, including malloc
, realloc
and free
.
The following snippet illustrates the required changes:
...
#include "gc.h" /* Boehm-Demers-Weiser GC */
...
inline void nop(void *_) { return; }
int main(int argc, char *argv[]) {
...
/*
Set up garbage collection. We set the function pointer for `free` to
NULL to avoid explicit freeing of memory.
*/
struct hamt_allocator gc_alloc = {GC_malloc, GC_realloc, nop};
const struct hamt *t = hamt_create(hash_string, strcmp, &gc_alloc);
...
}
We set the gc_alloc.free
function pointer to point to nop()
, a
no-operation function. This is necessary to ensure that we rely on the garbage
collector. If we were to provide a pointer to GC_free()
(i.e. GC's drop-in
replacement for the free()
function), we would still implement explicit
deallocation, just with a different free function.
The following snipped illustrates how to create, test, exhaust and dispose of
an iterator. We first create the iterator using hamt_it_create()
, jump into
a while
loop and advance the iterator using hamt_it_next()
while the
iterator is valid. In every interation we print the current key/value pair to
stdout
. Once we exit the loop, we clean up using hamt_it_delete()
.
...
struct hamt *t = hamt_create(hash_string, strcmp, &hamt_allocator_default);
/* load table */
...
/* create iterator */
hamt_iterator it = hamt_it_create(t);
while (hamt_it_valid(it)) {
printf("(%s, %s)\n", (char *)hamt_it_get_key(it),
(char *)hamt_it_get_value(it));
hamt_it_next(it);
}
/* clean up */
hamt_it_delete(it);
...
hamt_delete(t);
...
This concludes the description of the libhamt
interface and we now move on
to detailed implementation notes.
The hamt
source tree has the following structure:
hamt/
build/ Out-of-source build destination
include/ Header files that are part of the interface
src/ Source and header files
test/ Test and utility headers & sources
Makefile
Sources are organized in three folders: the include
folder, for all header
files that are part of the public interface; the src
folder, for the
actual implementation and private header files; and the test
folder, for all
test code, including headers and sources for testing utilities (e.g. data
loading and benchmarking functions).
The build process is governed by a single Makefile
in the project root
directory.
To build the library and run the tests:
$ make && make test
Hash tables. A common and practical answer to efficient value retrieval from a collection given a key is to "use a hash table". This is good advice. Hash tables provide insert, modification, and retrieval in amortized constant average time, using space linear in the number of elements they store. They have been the subject of intensive research and optimization and are part of every introductory CS textbook. Chances are that the standard library of the languange at hand contains a readily available, tried and tested implementation.
For instance, std::unordered_set
and std::unordered_map
(and their
*_multiset
cousins) are hash table implementations for C++ 1; for C, multiple
libc implementations (e.g. glibc, musl,
BSD libc) provide POSIX-compliant hsearch
facilities,
GNOME's GLib
and others provide hash table implementations2. Python has the dict
type
for associative arrays which is implemented as a hash
table3. Java has
Hashtable
, HashMap
, and HashSet
4 and JavaScript has
Map
.
One property of the classical hash table implementations is that they do not provide support for persistence (in the sense of persistent data structures, not persistent storage). They are a place-oriented solution to associative storage and make destructive modifications to the data structure when the data changes (note that this is independent of any particular conflict resolution and capacity maintenance strategies).
Persistent associative containers require a different approach.
Persistent data structures. (Full) persistence is the property of a data structure to always preserve (all) previous versions if itself under modification. The property is related to immutability: from the perspective of the client, every update yields a new copy, making instances practically immutable. This is a huge conceptual change: if data structures are immutable, functions using these data structures are pure (i.e. side effect-free). That in turn enables value semantics, referential transparency, and, consequently, substantial reduction in programming complexity when dealing with paralellism and synchronization (see e.g. Rich Hickey's presentations on The Value of Values and Are We There Yet?).
The catch is that classical hash tables set a high bar in terms of time and space performance characteristics, and persistent data structures need to approximate that bar.
Efficient persistence. Persistent associative data structures need to minimize the memory overhead introduced by value semantics (i.e. returning copies as opposed to modified originals) and, at the same time, provide practically average constant time insert, retrieve and delete capabilities to minimize the performance gap to classical hash tables.
It turns out that the data structure of choice to tackle these challenges is a tree. Trees support efficient structural sharing strategies for efficient memory management and, if they are balanced and have large branching factors, provide O(logk N) average performance guarantees.
Persistent hash array-mapped tries are, in essence, a sophisticated, practical implementation of such a data structure.
One way to understand hash array-mapped tries is to look at them as an evolution of k-ary trees (Fig. 1) that follows from a series of real-world tradeoffs.
Figure 1: *k*-ary tree, hash tree, and hash array-mapped trie.
In classic k-ary trees Ⓐ, Internal and leaf nodes have different types: internal nodes point to n internal or leaf nodes and leaf nodes hold or point to data (i.e. the keys/value pairs). In their basic form, n-ary trees (just like binary trees) are not balanced and their performance characteristics can easily degrade from O(logk n) to O(n) for degenerate input sequences.
One approach to balanced trees are explicit implementations of tree rebalancing (as in e.g. Red-black trees, AVL trees, or B-trees).
Another option is to use a hash tree Ⓑ: like the name implies, it uses the hash of the key, interpreted as a sequence of b-bit groups, to detetermine the location of the leaf node that stores the key/value pair. The group size b determines the branching factor 2b, i.e. for b=5, every node can have 25=32 child nodes. Instead of implementing explicit tree rebalancing, hash trees rely on the distributional properties of a (good) hash function to place nodes uniformly. While this saves some effort for rebalancing, note that hash trees do require a strategy to deal with hash exhaustion, a topic covered below.
The challenge with vanilla hash trees is that they reserve space for k children in every internal node. If the tree is sparsely populated this will cause significant memory overhead and impact performance due to cache misses.
For that reason, HAMTs implement array mapping Ⓒ: instead of reserving space for n pointers to children in each internal node, the parent node stores a bitmap that indicates which children are present and the actual node only allocates the memory required to refer to its children. This is an important optimization that makes trees with a high branching factor more memory efficient and cache-friendly.
In order to implement a persistent map or set, every modification operation must return a modified copy and maintain the source data structure. And returning actual copies is prohibitively expensive in time and memory.
This, finally, is where HAMTs really shine and the true reason why we build them in the first place.
HAMTs are trees and trees are compatible with
structural sharing strategies. Common
techniques are copy-on-write, fat nodes, path
copying, and there are complex
combinations of the previous three. Path copying is
simple, efficient and general and therefore the technique of choice for
libhamt
: Instead of returning an actual copy of the tree during an insert,
update or delete operations, we follow the search path to the item in
question, maintaining a path copy with all the nodes along the way, make our
modification along this path and return it to the caller.
Note that enabling persistence requires the use of a garbage collection
strategy. Under stanard malloc()
memory management, there is no way for
the HAMT nodes to know how many descendants of a HAMT refer to them.
In the following we will address these concepts in turn: we first define the foundational data structure used to build a tree and introduce the concept of an anchor. We then dive into hash functions and the hash state management required to make hashing work for trees of arbitrary depths and in the presence of hash collisions. We then turn to table management, introducing a set of functions used to create, modify, query and dispose of mapped arrays. With these pieces in place, we are ready to implement the insert/update, query, and delete functions for non-persistent HAMTs. And lastly, we introduce the concept of path copying and close with the implementation of persistent insert/update and delete functions for HAMTs.
libhamt
uses different types to implement internal and leaf nodes.
Leaf nodes contain two fields, called value
and key
(the rationale for the
reverse ordering of the two fields will become evident shortly).
struct {
void *value;
void *key;
} kv;
Both fields are
defined as void*
pointers to support referring to arbitrary data types via
type casting
5.
libhamt
's internal nodes are where the magic happens, based on Bagwell's Ideal Hash
Trees paper and according to the design principles
outlined above.
With a branching factor k, internal nodes have at most k successors but
can be sparsely populated. To allow for a memory-efficient representation,
internal nodes have a pointer ptr
that points to a fixed-size, right-sized
array of child nodes (also known as a table) and a k-bit index
bitmap field that
keeps track of the size and occupancy of that array.
libhamt
uses k=32 and because index
is a 32-bit bitmap field, the number
of one-bits in index
yields the size of the array that ptr
points to (also
known as the population count or popcount()
of index
).
This suggests an initial (incomplete) definition along the following lines:
struct {
struct T *ptr; /* incomplete */
uint32_t index;
} table;
The specification of T
must provide the ability for that datatype to point to
internal and external nodes alike, using only a single pointer type.
A solution is to wrap the two types into a union
(and then to wrap
the union
into a typedef
for convenience):
typedef struct hamt_node {
union {
struct {
void *value;
void *key;
} kv;
struct {
struct hamt_node *ptr;
uint32_t index;
} table;
} as;
} hamt_node;
With this structure, given a pointer hamt_node *p
to a hamt_node
instance, p->as.kv
addresses the leaf node, and p->as.table
addresses the
internal node and p->as.kv.value
, p->as.kv.key
, p->as.table.ptr
, and
p->as.table.index
provide access to the respective fields.
To maintain sanity, we define the following convenience macros:
#define TABLE(node) node->as.table.ptr
#define INDEX(node) node->as.table.index
#define VALUE(node) node->as.kv.value
#define KEY(node) node->as.kv.key
Figure 2:
Memory structure of an internal node. If node
is a pointer
to an internal node, TABLE(node)
(or, equivalently,
node->as.table.ptr
) points to the first field of the successor table.
The libhamt
codebase makes liberal use of the concept of an anchor. An
anchor is a hamt_node*
pointer to an internal node (i.e.
is_value(VALUE(anchor))
evaluates to false). An anchor
provides access to
all information relevant to manage the table of child nodes: INDEX(anchor)
returns the bitmap that encodes the array mapping, applying a popcount to the
bitmap gives the size of the table and indexing is implemented using partial
popcounts. Table elements can be accessed through
TABLE(anchor)[i]
, where i
must be in the valid range.
The definition of hamt_node
enables the construction of trees with a mix of
internal and leaf nodes. What the definition does not provide, is a way to
determine if a concrete hamt_node*
pointer points to an internal or a leaf
node. One solution would be to specify an enum
that indicates the type
(i.e. NODE_LEAF
, etc.) and to add a type
field to struct hamt_node
. While
valid, this would also increase the size of the struct by 50% just to maintain
a single bit of information. Luckily, there is a more memory-efficient
solution: pointer tagging.
Since pointers need to be word-aligned, that leaves the lower 3 bits of all pointers on 64-bit architectures always set to zero. It is possible to make use of these bits under two conditions: (1) we know we are looking at a pointer (the bottom three bits for the integer 1 are zero, too); and (2) we carefully mask the bits in question whenever we actually use the pointer (since it would point to the wrong location otherwise). The first is not a problem since we own the code; the second requires diligence and some helper functions:
#define HAMT_TAG_MASK 0x3
#define HAMT_TAG_VALUE 0x1
#define tagged(__p) (hamt_node *)((uintptr_t)__p | HAMT_TAG_VALUE)
#define untagged(__p) (hamt_node *)((uintptr_t)__p & ~HAMT_TAG_MASK)
#define is_value(__p) (((uintptr_t)__p & HAMT_TAG_MASK) == HAMT_TAG_VALUE)
In order to mark a leaf node as such, we set key
as usual and tag the value
pointer before assigning it to value
:
p->as.kv.key = key_ptr;
p->as.kv.value = tagged(value_ptr);
Given a pointer to a leaf (e.g. a search result), we untag value
before
returning it:
...
if (status == SEARCH_SUCCESS) {
return untagged(p->as.kv.value);
}
...
And, in order to determine what we are looking at, we use is_value
:
if (is_value(p->as.kv.value)) {
/* this is a leaf */
...
} else {
/* this is an internal node */
...
}
Pointer tagging is the reason why the value
and key
fields in the struct kv
struct are ordered the way they are.
The union
in hamt_node
causes the
memory locations of the struct kv
and struct table
structs to overlap. Since
the table.index
field is not a pointer (and the bottom-three-bits-are-zero
guarantee does not apply), its storage location cannot be used for pointer
tagging, leaving the table.ptr
to the task. Putting kv.value
first,
aligns the value field with table.ptr
. The reverse order would work, but the
kv.key
pointer is dereferenced much more often in the code and so it is more
convenient to use kv.value
.
The principal idea behing array mapping is to project a sparse bitmap index onto the index of a dense array, where the size of the array corresponds to the number of non-zero bits in the bitmap index.
Given hamt_node *p
is a valid pointer to a node, INDEX(p)
corresponds to a
sparse bitmap index. The dense array is located at TABLE(p)
and its size is
determined by the population count of INDEX(p)
.
The mapping itself is conceptually trivial: to determine the dense index for every non-zero bit in the bitmap index, count the number of non-zero bits to the right of it. In other words, the first set bit goes to index 0, the second to index 1, and so forth.
Efficiently implementing population counting (also known as the hamming
weight) of a bitset is not trivial. libhamt
falls back on a
GCC/Clang intrinsic:
static inline int get_popcount(uint32_t n) { return __builtin_popcount(n); }
With get_popcount()
available, determining the position (i.e. dense index)
for a sparse index in a bitmap reduces to calculating the population count of
the bitmap masked off above the sparse index:
static inline int get_pos(uint32_t sparse_index, uint32_t bitmap)
{
return get_popcount(bitmap & ((1 << sparse_index) - 1));
}
Lastly, to determine if a node has a child at a particular index index
, we
check if the bit at that index is set in the bitmap:
static inline bool has_index(const hamt_node *anchor, size_t index)
{
return INDEX(anchor) & (1 << index);
}
A hash function is a function that takes data of arbitrary size and maps it to a fixed-size value (often machine word sizes). Good hash functions are fast to compute and produce uniform output, they map their inputs as evenly as possible over the output range. If it is practically infeasible to invert the mapping (i.e. determine which hash corresponds to which input value), the hash function is called a cryptographic hash function.
For the purpose of implementing a HAMT, cryptographical security is not a design goal. However, the uniformity of the hash function has direct impact on the balance of the tree: it is the hash that pre-determines all key positions in the fully populated tree and it is its distribution properties that determines the number of collisions (and hence depth extensions) we introduce.
libhamt
does not force clients to use a particular hash function. The
libary exposes a hash function signature of the form
typedef uint32_t (*hamt_key_hash_fn)(const void *key, const size_t gen);
and expects users to provide a suitable function pointer as part of the call to
hamt_create()
which, among other parameters, takes a hash function:
/* ... see below for a practical definition of my_keyhash_string */
struct hamt *t = hamt_create(my_keyhash_string, my_keycmp_string,
&hamt_allocator_default);
There are multiple good, practical choices
for the HAMT. Per default libhamt
includes its own,
tested implementation of 32-bit
MurmurHash3:
/* from include/murmur3.h */
uint32_t murmur3_32(const uint8_t *key, size_t len, uint32_t seed);
This declares the murmur hash function. In its standard form murmur3_32
takes a pointer key
to byte-sized objects, a count of len
that speficies
the number of bytes to hash and a random seed seed
.
In order to use murmur3 as a hamt
hash function, we need to wrap it into a
helper function:
static uint32_t my_keyhash_string(const void *key, const size_t gen)
{
uint32_t hash = murmur3_32((uint8_t *)key, strlen((const char *)key), gen);
return hash;
}
Here, the wrapper makes use of strlen(3)
, assuming valid C strings as keys.
Note the use of gen
as a seed for the hash (see below for the hash exhaustion
discussion).
Here is a full example:
#include "murmur3.h"
/* ... */
static uint32_t my_keyhash_string(const void *key, const size_t gen)
{
uint32_t hash = murmur3_32((uint8_t *)key, strlen((const char *)key), gen);
return hash;
}
/* ... */
struct hamt *t = hamt_create(my_keyhash_string, my_keycmp_string,
&hamt_allocator_default);
For a hash trie, the number of elements in the trie is limited by the total number
of hashes that fits into a 32-bit uint32_t
, i.e. 2^32-1. Since the HAMT only
uses 30 bits (in 6 chunks of 5 bits), the number of unique keys in the trie is
limited to 230-1 = 1,073,741,823 keys.
At the same time, since every layer of the
tree uses 5 bits of the hash, the trie depth is limited to 32/5 = 6 layers.
Neither the hard limit to the number of elements in the trie,
nor the inability to build a trie beyond depth of 6 are desirable properties.
To address both issues, libhamt
recalculates the hash with a different seed every
6 layers. This requires a bit of state management and motivates the
existence of the hash_state
data type and functions that operate on it:
typedef struct hash_state {
const void *key;
hamt_key_hash_fn hash_fn;
uint32_t hash;
size_t depth;
size_t shift;
} hash_state;
The struct maintains the pointers key
to the key that is being hashed and
hash_fn
to the hash function used to calculate the current hash hash
. At
the same time, it tracks the current depth depth
in the tree (this is the
hash generation) and the bitshift shift
of the current 5-bit hash chunk.
The interface provides two functions: the means to step from the current 5-bit
hash to the next in hash_next()
; and the ability query the current index of a
key at the current trie depth in hash_get_index()
.
hash_next()
takes a pointer to a hash_state
instance and steps that instance
from the current to the next chunk. Taking a step involves increasing the
depth
and shift
, and initiating a rehash if the shift
indicates
that the hash has been exhausted:
static inline hash_state *hash_next(hash_state *h)
{
h->depth += 1;
h->shift += 5;
if (h->shift > 25) {
h->hash = h->hash_fn(h->key, h->depth / 5);
h->shift = 0;
}
return h;
}
The index of a hash at its current depth corresponds to the decimal
representation of the current chunk. To determine the current chunk,
we right-shift the hash by h->shift
to right-align the desired
LSB and then mask with 0x11111
which equals 0x1f
:
static inline uint32_t hash_get_index(const hash_state *h)
{
return (h->hash >> h->shift) & 0x1f;
}
In order to facilitate memory management for tables (aka the internal nodes),
libhamt
defines a set of helper functions. Each of these functions takes a
hamt_allocator
and calls the user-supplied allocation, re-allocation and
deallocation functions as appropriate.
We start by defining a simple memory abstraction (it would also be correct to use real functions instead of preprocessor macros for this):
#define mem_alloc(ator, size) (ator)->malloc(size)
#define mem_realloc(ator, ptr, size) (ator)->realloc(ptr, size)
#define mem_free(ator, ptr) (ator)->free(ptr)
This will make it easier to add optimizations (e.g. table caching) in the future. On top of these macros, table lifecycle management is accomplished with a few dedicated allocation and de-allocation functions.
table_allocate()
allocates tables with size size
and returns a pointer to
the newly allocated table.
hamt_node *table_allocate(struct hamt_allocator *ator, size_t size)
{
return (hamt_node *)mem_alloc(ator, (size * sizeof(hamt_node)));
}
table_free()
deallocates the allocation referenced by ptr
. It also
supports taking a size
parameter for future extension (e.g. provide a hint
for allocation pool management) that is currently ignored by the underlying
mem_free()
implementation.
void table_free(struct hamt_allocator *ator, hamt_node *ptr, size_t size)
{
mem_free(ator, ptr);
}
While it is possible to implement table re- and right-sizing with the two functions introduced above, it makes a lot of sense to provide specialized functionality for the key allocation/de-allocation use cases: extending, shrinking and gathering a table.
Table extension. Since the tables in a HAMT are right-sized to minimize memory overhead, item insertion must necessarily add an additional row to an existing table. As illustrated in figure 3, the table extension function takes an anchor for an existing table, allocates a new table with increased size, copies over the exsiting entries (leaving a gap at the appropriate position for the new row), assigns the new key and value to the fields in the new row, updates the anchor with the new memory location of the table and the new index, and eventually frees the memory of the old table.
Figure 3: Extending a table creates a new copy of the existing table with an additional row for the new node.
Looking at the code, this is implemented in verbatim in the table_extend()
function. table_extend()
takes an anchor
pointer to a table of
size n_rows
, then uses the allocator ator
to create a new table of size n_rows + 1
with an empty row at position pos
and the bitmap index bit index
set. It
uses memcpy()
to copy memory ranges into the the appropriate positions in
the new allocation, frees the old table and assignes the new table ptr
and
index
in the anchor:
hamt_node *table_extend(struct hamt_allocator *ator, hamt_node *anchor,
size_t n_rows, uint32_t index, uint32_t pos)
{
hamt_node *new_table = table_allocate(ator, n_rows + 1);
if (!new_table)
return NULL;
if (n_rows > 0) {
/* copy over table */
memcpy(&new_table[0], &TABLE(anchor)[0], pos * sizeof(hamt_node));
/* note: this works since (n_rows - pos) == 0 for cases
* where we're adding the new k/v pair at the end and memcpy(a, b, 0)
* is a nop */
memcpy(&new_table[pos + 1], &TABLE(anchor)[pos],
(n_rows - pos) * sizeof(hamt_node));
}
table_free(ator, TABLE(anchor), n_rows);
TABLE(anchor) = new_table;
INDEX(anchor) |= (1 << index);
return anchor;
}
Shrinking a table. Shrinking a table is the inverse operation of table extension: since we maintain right-sized tables as an invariant, we need to adjust table sizes the moment the client deletes a key/value pair from the HAMT.
Figure 4 illustrates the concept: given an anchor, the shrinking function returns a new table with the specified row removed.
Figure 4: Shrinking a table creates a new copy of the table with the specified row removed.
In the code, this is what table_shrink()
does. In the same way as
table_extend()
the function takes a pointer ator
to the global allocator,
a pointer anchor
to the current anchor, the size of the current tables as
n_rows
, and the pair of one-hot bitmap index index
and storage array
position pos
. And, in analogy to table extension, the function allocation a
right-sized table, copies the data to keep using range copies with memcpy()
,
frees up the old table and updates the anchor to reflect the changes.
hamt_node *table_shrink(struct hamt_allocator *ator, hamt_node *anchor,
size_t n_rows, uint32_t index, uint32_t pos)
{
hamt_node *new_table = NULL;
uint32_t new_index = 0;
if (n_rows > 0) {
new_table = table_allocate(ator, n_rows - 1);
if (!new_table)
return NULL;
new_index = INDEX(anchor) & ~(1 << index);
memcpy(&new_table[0], &TABLE(anchor)[0], pos * sizeof(hamt_node));
memcpy(&new_table[pos], &TABLE(anchor)[pos + 1],
(n_rows - pos - 1) * sizeof(hamt_node));
}
table_free(ator, TABLE(anchor), n_rows);
INDEX(anchor) = new_index;
TABLE(anchor) = new_table;
return anchor;
}
Table gathering. As we are deleting entries from the HAMT, we may end up
with the table structure shown in Figure 5: a table in which one of the
entries is a single-row table. What we want to do in these cases is to replace
the table entry in TABLE(anchor)[1]
with the key/value pair from
TABLE(TABLE(anchor)[1])
and gather the one-row table into its parent.
While this comes at additional computational cost upon delete, it maintains
the logarithmic depth properties as the HAMT changes its size.
Figure 5: Gathering pulls a one-row-sized table into its parent table (essentially converting an internal node into a leaf node).
The code is straightforward: we take the allocator alloc
, the anchor
pointer, and the position pos
of the single-row table inside the parent
table, copy over the key and value from the child table to the parent
(maintaining a temporary handle on the child) and then free the child table:
hamt_node *table_gather(struct hamt_allocator *ator, hamt_node *anchor,
uint32_t pos)
{
int n_rows = get_popcount(INDEX(anchor));
hamt_node *table = TABLE(anchor);
KEY(anchor) = table[pos].as.kv.key;
VALUE(anchor) = table[pos].as.kv.value; /* already tagged */
table_free(ator, table, n_rows);
return anchor;
}
Table duplication. Lastly, table duplication. This will be required for path
copying when we implement persistency and it is so straightforward that there
is no diagram: given an anchor, table_dup()
determines the size of the table
that the anchor points to, allocates the required memory and performs a range
copy using memcpy()
to duplicate the table contents:
hamt_node *table_dup(struct hamt_allocator *ator, hamt_node *anchor)
{
int n_rows = get_popcount(INDEX(anchor));
hamt_node *new_table = table_allocate(ator, n_rows);
if (new_table) {
memcpy(&new_table[0], &TABLE(anchor)[0], n_rows * sizeof(hamt_node));
}
return new_table;
}
The following subsections detail the implementations of search, insertion and
removal of key/value pairs in our HAMT implementation. Note that, while the
implementations shown here have been thoroughly tested and are deemed correct,
they may have been replaced by faster or more capable implementations in the
actual libhamt
source. An attempt is being made to keep this section up to
date with the actual implementation but the choice here is in favor of
conceptual clarity and will not necessarily cover every implementation detail.
PRs welcome.
key | key hash | binary key hash | 5-bit ints |
---|---|---|---|
"0" | d271c07f | 11 01001 00111 00011 10000 00011 11111 |
[ 31 3 16 3 7 9 ] |
"2" | 0129e217 | 00 00000 10010 10011 11000 10000 10111 |
[ 23 16 24 19 18 0 ] |
"4" | e131cc88 | 11 10000 10011 00011 10011 00100 01000 |
[ 8 4 19 3 19 16 ] |
"7" | 23ea8628 | 00 10001 11110 10101 00001 10001 01000 |
[ 8 17 1 21 30 17 ] |
"8" | bd920017 | 10 11110 11001 00100 00000 00000 10111 |
[ 23 0 0 4 25 30 ] |
Search plays a double role: finding a HAMT entry is a fundamental part of the
HAMT interface (exposed by hamt_get()
); and the first step in the insert and remove
functions is finding the anchors to operate on.
It is therefore desirable to approach the search implementation from a more generic perspective such that we do not need to re-invent the wheel for each of these use cases. We therefore define an internal search function
static ... search_recursive(...);
that is called from internal and the API functions alike. As the name implies, we implement search in a recursive manner (this is for clarity; conversion to an iterative solution is straightforward).
When we search for a key in the HAMT, there are two fundamental outcomes: the
key is either there, or it is not (note that these are exactly the semantics
of the user-facing hamt_get()
function: it either returns a pointer to the
value stored under the key or it returns NULL
). However, looking more
closely, searches can fail for two reasons: the search can be unsuccessful
because a key does not exist in the HAMT or it can be unsuccessful because
there is a key value pair that happens to have the same partial hash but a
different key (i.e. there is a hash collission or the hash has not been
sufficiently exhausted to differentiate between the two keys). And each of
these three cases is meaningful (the latter two corresponding directly to the
two different insertion strategies described below).
A good approach here is to define a ternary return value (as opposed to
the usual, binary use-NULL
-as-a-failure-indicator approach that is often
prevalent in C code) to allow us to signal each of these cases clearly.
We create a suitable three-value enum
called search_status
typedef enum {
SEARCH_SUCCESS,
SEARCH_FAIL_NOTFOUND,
SEARCH_FAIL_KEYMISMATCH
} search_status;
where SEARCH_SUCCESS
indicates that the key in question was
found, SEARCH_FAIL_NOTFOUND
indicates a search failure due to a missing key,
and SEARCH_FAIL_KEYMISMATCH
signals a hash conflict.
In order to return the result of a search (and not only its status), we introduce a search result data type that is a bit more heavy-weight:
struct search_result {
search_status status;
hamt_node *anchor;
hamt_node *value;
hash_state *hash;
};
Here, anchor
always points to the anchor at which the search was terminated;
if the search was successful, value
points to the table row that holds the
key/value pair with matching key; if it was unsuccessful with a key mismatch,
value
points to the mismatching key/value pair; and if it was unsuccessful
because the key did not exist, value
equals NULL
. Depending on the depth
that the search reached, we may have hit hash exhaustion and the hash may have
been recalculated, so we are returning this here, too.
Given struct search_result
, the return value of search_recursive()
becomes:
static struct search_result search_recursive(...)
{
// ...
}
With these prerequisites out of the way, we can tackle the actual search algorithm:
search_recursive(anchor, hash, eq, key, ...):
if the current 5-bit sub-hash is a valid index in the current table:
if the index refers to a key/value pair:
if the key matches the search key:
return SEARCH_SUCCESS
else:
return SEARCH_FAIL_KEYMISMATCH
else (i.e. it refers to a sub-table):
search_recursive(sub-table, hash_next(hash), eq, key)
else:
return SEARCH_FAIL_NOTFOUND
The basic idea is to start from the root of the HAMT and then, at every level, test if the curret sub-hash of the key is present in the current sub-trie. If not, bail and report failure immediately. If yes, check if the entry refers to a key/value pair or to another table. If this is true as well, check if the keys match and return success or failure accordingly. If the entry refers to a sub-table, repeat the search at the level of the sub-table.
With the conceptual approach lined out, let's get into the implementation
details.
We start with deriving the table index for the current search level from the
hash. This is accomplished using
hash_get_index()
, which encapsulates the bit-fiddling required to extract
the correct 5-bit hash for the current search level and returns the index as
an unsigned integer.
static search_result search_recursive(hamt_node *anchor,
hash_state *hash,
hamt_cmp_fn cmp_eq,
const void *key, ...)
{
uint32_t expected_index = hash_get_index(hash);
...
}
The code then checks if the expected_index
exists in the current table:
...
if (has_index(anchor, expected_index)) {
...
}
Here, has_index()
is a simple helper function that checks if
the INDEX(anchor)
bitfield has the bit set at expected_index
:
static inline bool has_expected_index(const hamt_node *anchor, size_t expected_index)
{
return INDEX(anchor) & (1 << expected_index);
}
If has_index()
evaluates to false, the key does not exist in the HAMT and we
can immediately fail the search and return the result:
{
uint32_t expected_index = hash_get_index(hash);
if (has_index(anchor, expected_index)) {
...
...
...
}
search_result result = {.status = SEARCH_FAIL_NOTFOUND,
.anchor = anchor,
.value = NULL,
.hash = hash};
return result;
}
If has_index()
evaluates to true, we find the array index using
get_pos()
(see above), store it into pos
and then acquire a pointer to the
next
node by addressing pos
indices into the anchor
's table.
{
...
if (has_index(anchor, expected_index)) {
/* If yes, get the compact index to address the array */
int pos = get_pos(expected_index, INDEX(anchor));
/* Index into the table */
hamt_node *next = &TABLE(anchor)[pos];
...
}
...
}
If the next
node is not a value, we advance the hash state and recurse the
search. If it is, we compare the keys and return success or failure
accordingly:
{
...
/* Index into the table */
hamt_node *next = &TABLE(anchor)[pos];
/* Are we looking at a value or another level of tables? */
if (is_value(VALUE(next))) {
if ((*cmp_eq)(key, KEY(next)) == 0) {
/* Found: keys match */
search_result result = {.status = SEARCH_SUCCESS,
.anchor = anchor,
.value = next,
.hash = hash};
return result;
}
/* Not found: same hash but different key */
search_result result = {.status = SEARCH_FAIL_KEYMISMATCH,
.anchor = anchor,
.value = next,
.hash = hash};
return result;
} else {
/* For table entries, recurse to the next level */
return search_recursive(next, hash_next(hash), cmp_eq, key);
}
That concludes the implementation of the recursive search function and the complete implementation looks like this:
static search_result search_recursive(hamt_node *anchor, hash_state *hash,
hamt_cmp_fn cmp_eq, const void *key)
{
/* Determine the expected index in table */
uint32_t expected_index = hash_get_index(hash);
/* Check if the expected index is set */
if (has_index(anchor, expected_index)) {
/* If yes, get the compact index to address the array */
int pos = get_pos(expected_index, INDEX(anchor));
/* Index into the table */
hamt_node *next = &TABLE(anchor)[pos];
/* Are we looking at a value or another level of tables? */
if (is_value(VALUE(next))) {
if ((*cmp_eq)(key, KEY(next)) == 0) {
/* Found: keys match */
search_result result = {.status = SEARCH_SUCCESS,
.anchor = anchor,
.value = next,
.hash = hash};
return result;
}
/* Not found: same hash but different key */
search_result result = {.status = SEARCH_FAIL_KEYMISMATCH,
.anchor = anchor,
.value = next,
.hash = hash};
return result;
} else {
/* For table entries, recurse to the next level */
return search_recursive(next, hash_next(hash), cmp_eq, key);
}
}
/* Not found: expected index is not set, key does not exist */
search_result result = {.status = SEARCH_FAIL_NOTFOUND,
.anchor = anchor,
.value = NULL,
.hash = hash};
return result;
}
The external API for search is hamt_get(trie, key)
which takes a trie
and attempts to find (and return) a key/value pair specified by key
. Its
implementation uses search_recursive()
from above:
const void *hamt_get(const struct hamt *trie, void *key)
{
hash_state *hash = &(hash_state){.key = key,
.hash_fn = trie->key_hash,
.hash = trie->key_hash(key, 0),
.depth = 0,
.shift = 0};
search_result sr = search_recursive(trie->root, hash, trie->key_cmp, key,
NULL, trie->ator);
if (sr.status == SEARCH_SUCCESS) {
return untagged(sr.VALUE(value));
}
return NULL;
}
In order to use search_recursive()
, it is necessary to set up the hash state
management, initializing it with the key
, the hashed key
, and starting
search from level 0
(corresponding to a shift of 0
). If the search is
not successful, the function returns NULL
, if it is successful, it passes
a void
pointer to the value that corresponds to key
. Note the untagging
of the value
field since we're using it as a tagged pointer to indicate
field types.
libhamt
does not support an explicit insertion function; all insertions into
the HAMT are upserts, i.e. after calling hamt_set()
the API guarantees
that the requested key/value pair exists, irrespective of potential previous
entries that may have had the same key but a different value.
The internal function that implements this behavior is set()
:
static const hamt_node *set(struct hamt *h, hamt_node *anchor, hamt_key_hash_fn hash_fn,
hamt_cmp_fn cmp_fn, void *key, void *value)
set()
takes a HAMT, an anchor in that HAMT, hashing and comparison
functions as well as a key/value pair. After initializing the hash state, the
function makes use of search_recursive
to find the specified key
. It deals
with three different search outcomes: (1) if the search is successful, the
value of key
gets replaced with the new value
; (2) if the search is
unsuccessful because the key does not exist, it attempts to insert a new
key/value pair at the appropriate position; and (3) if the search fails due to
a key mismatch (i.e. there is an entry at the expected hash position but its
key does not equal key
), it extends the hash trie until the new key/value
pair can be placed correctly. Cases (2) and (3) are covered by the
insert_kv()
and insert_table()
helper functions, respectively.
static const hamt_node *set(struct hamt *h, hamt_node *anchor, hamt_key_hash_fn hash_fn,
hamt_cmp_fn cmp_fn, void *key, void *value)
{
hash_state *hash = &(hash_state){.key = key,
.hash_fn = hash_fn,
.hash = hash_fn(key, 0),
.depth = 0,
.shift = 0};
search_result sr =
search_recursive(anchor, hash, cmp_fn, key, NULL, h->ator);
const hamt_node *inserted;
switch (sr.status) {
case SEARCH_SUCCESS:
sr.VALUE(value) = tagged(value);
inserted = sr.value;
break;
case SEARCH_FAIL_NOTFOUND:
if ((inserted = insert_kv(sr.anchor, sr.hash, key, value, h->ator)) !=
NULL) {
h->size += 1;
}
break;
case SEARCH_FAIL_KEYMISMATCH:
if ((inserted = insert_table(sr.value, sr.hash, key, value, h->ator)) !=
NULL) {
h->size += 1;
}
break;
}
return inserted;
}
If the call to search_recursive()
fails with SEARCH_FAIL_NOTFOUND
, we know
that there is a free row in the table of sr.anchor
. To insert the new
key
/value
pair, we calculate the position of the key
in the current
table: it extracts the 0-31 index position for the current key and stores it
into ix
, extends the existing INDEX(anchor)
index bitmap to include the
new key by setting the ix
-th bit, and then calculates the dense index
position of the new entry via get_pos()
. It then uses table_extend()
to
extend the table to the correct size and populates the key
and value
entries to reflect the new key/value pair. Note the pointer tagging on the
value field to mark it as a key/value row in the table (as opposed to a row
that points to a sub-table).
static const hamt_node *insert_kv(hamt_node *anchor, hash_state *hash,
void *key, void *value,
struct hamt_allocator *ator)
{
/* calculate position in new table */
uint32_t ix = hash_get_index(hash);
uint32_t new_index = INDEX(anchor) | (1 << ix);
int pos = get_pos(ix, new_index);
/* extend table */
size_t n_rows = get_popcount(INDEX(anchor));
anchor = table_extend(ator, anchor, n_rows, ix, pos);
if (!anchor)
return NULL;
hamt_node *new_table = TABLE(anchor);
/* set new k/v pair */
new_table[pos].as.kv.key = key;
new_table[pos].as.kv.value = tagged(value);
/* return a pointer to the inserted k/v pair */
return &new_table[pos];
}
When the call to search_recursive()
in set()
fails with
SEARCH_FAIL_KEYMISMATCH
, the situation is different: there is another entry
(either a key/value pair or a reference to a sub-table) in the HAMT that
currently occupies a transitionary trie location for key
. This is expected
to happen regularly: keys are always inserted with the shortest possible trie
path that resolves hashing conflicts between existing keys. As more and more
entries are added to the HAMT, these paths necessarily must increase in
length. This situation is handled by insert_table()
:
static const hamt_node *insert_table(hamt_node *anchor, hash_state *hash,
void *key, void *value,
struct hamt_allocator *ator)
{
/* Collect everything we know about the existing value */
hash_state *x_hash =
&(hash_state){.key = KEY(anchor),
.hash_fn = hash->hash_fn,
.hash = hash->hash_fn(KEY(anchor), hash->depth / 5),
.depth = hash->depth,
.shift = hash->shift};
void *x_value = VALUE(anchor); /* tagged (!) value ptr */
/* increase depth until the hashes diverge, building a list
* of tables along the way */
hash_state *next_hash = hash_next(hash);
hash_state *x_next_hash = hash_next(x_hash);
uint32_t next_index = hash_get_index(next_hash);
uint32_t x_next_index = hash_get_index(x_next_hash);
while (x_next_index == next_index) {
TABLE(anchor) = table_allocate(ator, 1);
INDEX(anchor) = (1 << next_index);
next_hash = hash_next(next_hash);
x_next_hash = hash_next(x_next_hash);
next_index = hash_get_index(next_hash);
x_next_index = hash_get_index(x_next_hash);
anchor = TABLE(anchor);
}
/* the hashes are different, let's allocate a table with two
* entries to store the existing and new values */
TABLE(anchor) = table_allocate(ator, 2);
INDEX(anchor) = (1 << next_index) | (1 << x_next_index);
/* determine the proper position in the allocated table */
int x_pos = get_pos(x_next_index, INDEX(anchor));
int pos = get_pos(next_index, INDEX(anchor));
/* fill in the existing value; no need to tag the value pointer
* since it is already tagged. */
TABLE(anchor)[x_pos].as.kv.key = (void *)x_hash->key;
TABLE(anchor)[x_pos].as.kv.value = x_value;
/* fill in the new key/value pair, tagging the pointer to the
* new value to mark it as a value ptr */
TABLE(anchor)[pos].as.kv.key = key;
TABLE(anchor)[pos].as.kv.value = tagged(value);
return &TABLE(anchor)[pos];
}
insert_table()
works in three stages: (1) it initiatlizes the hash_state
for the current anchor; (2) creates a series of single-entry tables until the
hashes of the current and new keys diverge; and (3) finally creates a new
table of size 2 that holds the old entry as well as the new key/value pair.
The implementation of the external API for inserting and updating values in the HAMT is straighforward:
const void *hamt_set(struct hamt *trie, void *key, void *value)
{
const hamt_node *n =
set(trie, trie->root, trie->key_hash, trie->key_cmp, key, value);
return VALUE(n);
}
hamt_set()
uses a vanilla call to the internal set()
function and returns
a pointer to the value of the new key.
For testing, hamt
uses a variant of John Brewer's minunit
testing
framework. Minunit is extremely minimalistic and its
header-only implementation easily fits on a single page:
// test/minunit.h
#ifndef MINUNIT_H
#define MINUNIT_H
#define MU_ASSERT(test, message) \
do { \
if (!(test)) \
return message; \
} while (0)
#define MU_RUN_TEST(test) \
do { \
char *message = test(); \
mu_tests_run++; \
if (message) \
return message; \
} while (0)
#define MU_TEST_CASE(name) static char *name()
#define MU_TEST_SUITE(name) static char *name()
extern int mu_tests_run;
#endif /* !MINUNIT_H */
With minunit
, every unit test is a MU_TEST_CASE
We use MU_ASSERT
to test
the test invariants. Test cases are grouped into MU_TEST_SUITE
s as
sequential calls to MU_RUN_TEST
. When an assertion fails, the return
statement in MU_ASSERT
short-circuts test execution and returns a non-null
pointer to the respective message
(generally a static string). This, in turn,
causes MU_RUN_TEST
to issue a return
call with the string pointer,
short-circuting the remaining test suite. The header also declares a global
variable mu_tests_run
that keeps track of the total number of executed
tests.
The following listing illustrates the basic structure of unit test
implementations with minunit
, check the actual tests for
a full listing.
// test/test_hamt.c
#include "minunit.h"
#include "../src/hamt.c"
int mu_tests_run = 0;
MU_TEST_CASE(test_dummy)
{
/* do something here */
MU_ASSERT(0 == 0, "Oops X-{");
return 0;
}
MU_TEST_SUITE(test_suite)
{
/* Add tests here */
MU_RUN_TEST(test_dummy);
/*
* ... many more ...
*/
return 0;
}
int main()
{
printf("---=[ Hash array mapped trie tests\n");
char *result = test_suite();
if (result != 0) {
printf("%s\n", result);
} else {
printf("All tests passed.\n");
}
printf("Tests run: %d\n", tests_run);
return result != 0;
}
Note that the test setup include
s the hamt.c
implementation file. This is a
common trick used in unit testing to gain easy access to testing static
functions that would otherwise be inaccessible since they are local to the
hamt.c
compilation unit. This requires some care in
the Makefile setup in order to avoid symbol duplication.
[1]
The std::unordered_*
methods implement open hashing (aka separate chaining),
with the hash table being an array of buckets, each pointing to the head of a
linked list. This is a deliberate and reasonable compromise for general use;
gaining an order of magnitude of speed improvements for specialized use cases
(e.g. append-only, guaranteed high-quality hash functions) is possible. See
this stackoverflow post for a summary of the standard
proposal.
↩
[2]
musl
provides a hsearch
implementation that uses closed hashing with
quadratic probing for conflict resolution. The
documentation states that they use powers of two for
table sizing which seems wrong due to the impact on the modulo (table sizes
should ideally be prime). The GLib GHashTable
has surprisingly little
documentation in its implementation details but appears to be
using a separate chaining approach similar to the C++
solution.
↩
[3] Python's dict
implementation uses
closed hashing (aka open addressing) with pseudo-random probing to mitigate
the poor hashing properties of standard python hash()
function for some data
types (from here). Python keeps the load factor below
0.66; this avoids gradual performance degradation associated w/ high load
factors in closed hashing but comes at increased memory footprint. The
codebase was refactored to split the actual data from the
hash table in 3.6, resulting in better memory efficiency and GC friendliness
(see here and here).
↩
[4] Java provides Hashtable<K,V>
and
HashMap<K,V>
, both of which implement Map
and Collection
interfaces; in
addition, Hashtable
is synchronized. The HashSet
type internally uses a
HashMap
. Hashtable
and HashMap
implement open hashing
(separate chaining) with a default load factor of 0.75; The OpenJDK
implementation of HashMap
converts
between linked list and tree representations in the hash buckets, depending on
bucket size, see the source.
↩
[5] There are alternative approaches to enable (somewhat) typesafe templating in C, mainly by implementing what basically amounts to virtual method tables using the C preprocessor. See e.g. here for a useful stackoverflow summary or here for a more in-depth treatise. ↩