This article provides a comprehensive explanation of consistent hashing, a fundamental distributed systems technique. It details how consistent hashing minimizes data remapping when nodes are added or removed, contrasting it with less scalable partitioning methods. The core mechanism involves mapping both nodes and data keys onto a virtual hash ring to achieve better load distribution and fault tolerance in dynamic environments.
Read original on High ScalabilityCaching is crucial for improving latency and reducing load on origin servers in high-traffic applications. To handle dynamic loads and scale horizontally, cache servers must distribute data across multiple nodes. This distribution, known as data partitioning or sharding, aims to maximize throughput and fault tolerance by preventing a single cache server from becoming a bottleneck. The article first outlines various partitioning techniques, highlighting their limitations before introducing consistent hashing.
Before delving into consistent hashing, the article discusses several simpler partitioning strategies and their drawbacks. These include random assignment, single global cache, key range partitioning, and static hash partitioning. Each method suffers from issues like poor data retrieval efficiency, lack of scalability, or inefficient data rebalancing when the number of nodes changes. Static hash partitioning, for instance, requires rehashing and massive data movement across all nodes upon a server addition or removal, leading to significant cache misses and potential overload on the origin server.
Consistent hashing is a distributed systems technique designed to minimize the number of keys remapped when nodes are added or removed from a system. It achieves this by mapping both node identifiers (e.g., IP addresses) and data keys onto a virtual circular space called a 'hash ring' using the same hash function (e.g., MD5).
The primary advantage of consistent hashing is its efficiency in handling dynamic node changes. When a node fails, only the data objects mapped to that specific node's segment of the ring are reassigned to its immediate clockwise neighbor. Similarly, when a new node is added, it takes over a portion of keys from its clockwise neighbor, requiring remapping for only a fraction of the data. This significantly reduces data movement compared to static hashing, where a change in 'N' (number of nodes) might necessitate rehashing all 'k' keys, leading to k/N data movement on average.
A potential issue with basic consistent hashing is non-uniform distribution of nodes on the hash ring, which can lead to 'hotspots' where a few nodes receive a disproportionately large share of traffic. To mitigate this, the concept of 'virtual nodes' is introduced. Instead of mapping each physical node to a single point on the ring, each node is mapped to multiple positions by hashing its ID through distinct hash functions or appending different suffixes. This ensures a more uniform distribution of keys across physical nodes, improving load balancing and preventing cascading failures due to overloaded servers.