This article delves into the complexities of implementing an intrusive hash table, highlighting the challenges of managing data structures without embedding data. It discusses the necessity of tracking multiple tables during incremental resizing and the architectural decisions involved in separating node and table-level functionalities. The author also touches upon the choice of hash functions, favoring MurmurHash for better distribution over FNV.
Read original on Dev.to #systemdesignAn intrusive data structure offers a way to organize data without embedding the data directly within the structure itself. Instead, the data structure holds pointers or references to the data, and the data itself contains metadata (like pointers for linked lists or hash table entries) that allows it to participate in the structure. This approach can be beneficial for memory efficiency, avoiding unnecessary copies, and managing complex object lifecycles, especially in performance-critical applications or when integrating with existing C-style data without modification.
The implementation of an intrusive hash table, particularly with features like incremental resizing, presents significant architectural hurdles. A key challenge is managing multiple tables concurrently during the rehashing process. While resizing, new entries go into the new, larger table, but existing entries remain in the old table until they are gradually migrated. This necessitates:
Why Incremental Resizing?
Incremental resizing avoids the performance spike associated with reallocating and rehashing an entire table at once. Instead, rehashing is spread out over multiple operations, leading to more predictable performance, which is crucial for real-time or low-latency systems. However, as the article notes, this complexity is often only justified when strictly necessary.
The choice of hash function is critical for hash table performance. For non-cryptographic hash tables, the primary goals are fast computation and good distribution to minimize collisions and maintain O(1) average-case lookup times. The article notes that while FNV is a common choice, MurmurHash often provides better distribution, which can be beneficial for minimizing chain lengths in collision resolution strategies.
/* Example conceptual structure for an intrusive node */
struct MyDataNode {
// Application data fields
int value;
std::string key;
// Intrusive hash table metadata
MyDataNode* next_in_bucket;
size_t hash_code;
};
/* Concept of container_of() */
#define container_of(ptr, type, member) \
((type *)((char *)(ptr) - offsetof(type, member)))