High Scalability·February 22, 2023

Consistent Hashing Algorithm Explained

This article provides a comprehensive explanation of consistent hashing, a fundamental distributed systems technique. It details how consistent hashing minimizes data remapping when nodes are added or removed, contrasting it with less scalable partitioning methods. The core mechanism involves mapping both nodes and data keys onto a virtual hash ring to achieve better load distribution and fault tolerance in dynamic environments.

Distributed Systems Performance & Scaling Databases & Storage

Read original on High Scalability

Introduction to Data Partitioning and Caching

Caching is crucial for improving latency and reducing load on origin servers in high-traffic applications. To handle dynamic loads and scale horizontally, cache servers must distribute data across multiple nodes. This distribution, known as data partitioning or sharding, aims to maximize throughput and fault tolerance by preventing a single cache server from becoming a bottleneck. The article first outlines various partitioning techniques, highlighting their limitations before introducing consistent hashing.

Limitations of Traditional Partitioning Methods

Before delving into consistent hashing, the article discusses several simpler partitioning strategies and their drawbacks. These include random assignment, single global cache, key range partitioning, and static hash partitioning. Each method suffers from issues like poor data retrieval efficiency, lack of scalability, or inefficient data rebalancing when the number of nodes changes. Static hash partitioning, for instance, requires rehashing and massive data movement across all nodes upon a server addition or removal, leading to significant cache misses and potential overload on the origin server.

Understanding Consistent Hashing

Consistent hashing is a distributed systems technique designed to minimize the number of keys remapped when nodes are added or removed from a system. It achieves this by mapping both node identifiers (e.g., IP addresses) and data keys onto a virtual circular space called a 'hash ring' using the same hash function (e.g., MD5).

Output of the hash function is placed on a virtual ring (hash ring).
Hashed node identifiers are assigned positions on the ring.
Data object keys are hashed to find their positions on the ring.
To store or retrieve data, traverse the hash ring clockwise from the key's position until a node is found.
The data object is stored or retrieved from that found node.

Advantages of Consistent Hashing

The primary advantage of consistent hashing is its efficiency in handling dynamic node changes. When a node fails, only the data objects mapped to that specific node's segment of the ring are reassigned to its immediate clockwise neighbor. Similarly, when a new node is added, it takes over a portion of keys from its clockwise neighbor, requiring remapping for only a fraction of the data. This significantly reduces data movement compared to static hashing, where a change in 'N' (number of nodes) might necessitate rehashing all 'k' keys, leading to k/N data movement on average.

Addressing Hotspots with Virtual Nodes

A potential issue with basic consistent hashing is non-uniform distribution of nodes on the hash ring, which can lead to 'hotspots' where a few nodes receive a disproportionately large share of traffic. To mitigate this, the concept of 'virtual nodes' is introduced. Instead of mapping each physical node to a single point on the ring, each node is mapped to multiple positions by hashing its ID through distinct hash functions or appending different suffixes. This ensures a more uniform distribution of keys across physical nodes, improving load balancing and preventing cascading failures due to overloaded servers.

consistent hashingload balancingdata partitioningcachingdistributed cachehash ringvirtual nodesscalability