This article explores the challenges of building a distributed rate limiter, particularly during network partitions. It highlights how a naive centralized Redis-based limiter can fail due to prioritizing consistency over availability, leading to service outages. The author proposes an AP (Availability and Partition Tolerance) approach using local token buckets and a gossip protocol for eventual consistency, ensuring the service remains available even during network disruptions.
Read original on Dev.to #architectureThe article begins with a common scenario: a microservice powering a flash-sale site experiencing traffic spikes. An initial attempt at rate limiting using a single Redis counter failed catastrophically during a network hiccup. The Redis node became unreachable, leading the rate limiter to deny all requests, effectively causing a total service outage. This illustrates the critical problem with highly consistent, centralized components in distributed systems when facing network partitions.
package main
import (
"github.com/go-redis/redis/v8"
"time"
)
var rdb = redis.NewClient(&redis.Options{Addr: "redis:6379"})
func allow(userID string) bool {
key := fmt.Sprintf("rl:%s", userID)
// INCR and EXPIRE are a single atomic script (Lua)
script := redis.NewScript(` local val = redis.call('INCR', KEYS[1]) if val == 1 then redis.call('EXPIRE', KEYS[1], ARGV[1]) end return val `)
count, _ := script.Run(ctx, rdb, []string{key}, strconv.Itoa(limit)).Int()
return count <= limit
}This Go example demonstrates a naive Redis-backed rate limiter. The issue here is that the `INCR` operation blocks until Redis responds. If Redis is unreachable due to a network partition, the operation times out, leading the limiter to deny requests, thus sacrificing Availability (A) in favor of Consistency (C) and Partition Tolerance (P) (CP). While correctness is maintained, the service becomes unusable.
The core insight is that CAP theorem is not a one-time choice but a continuous lens for system design. For most public-facing APIs, an occasional overflow of requests during a partition is less damaging than a complete service outage. Therefore, the optimal choice for a distributed rate limiter is often to prioritize Availability (A) and Partition Tolerance (P), accepting eventual consistency for the rate counters.
CAP Theorem for Rate Limiters
For critical user-facing services, preferring Availability (A) and Partition Tolerance (P) (AP) over strong Consistency (C) for rate limiting ensures the service remains operational during network partitions, even if it allows a few extra requests. A total outage caused by a strict CP limiter is often far more detrimental than a slight overage of requests.
The proposed solution leverages local token buckets on each service instance combined with a lightweight gossip protocol for synchronization. Each instance maintains its own token bucket, allowing it to respond immediately to requests without blocking on remote calls. This inherently provides high availability locally. To achieve eventual consistency and overall rate limiting, the state of these local buckets is periodically 'gossiped' between instances via a push-pull mechanism.
type Bucket struct {
Capacity int64
Tokens int64
LastRefill time.Time
Rate time.Duration // refill interval per token
mu sync.Mutex
}
func NewBucket(capacity int64, rate time.Duration) *Bucket {
return &Bucket{
Capacity: capacity,
Tokens: capacity,
LastRefill: time.Now(),
Rate: rate,
}
}
func (b *Bucket) Allow() bool {
b.mu.Lock()
defer b.mu.Unlock()
now := time.Now()
// refill based on elapsed time
elapsed := now.Sub(b.LastRefill)
add := int64(elapsed / b.Rate)
if add > 0 {
b.Tokens += add
if b.Tokens > b.Capacity {
b.Tokens = b.Capacity
}
b.LastRefill = now
}
if b.Tokens > 0 {
b.Tokens--
return true
}
return false
}