Rate Limiting & Throttling
Algorithms in depth: token bucket, leaky bucket, fixed window, sliding window log, sliding window counter. Distributed rate limiting across multiple servers.
Why Rate Limiting Matters
Rate limiting is a foundational defense mechanism that protects your system from abuse, overload, and resource exhaustion. Without it, a single misbehaving client (whether intentional or due to a bug) can degrade service for everyone. Rate limiting serves multiple purposes: preventing API abuse, enforcing fair usage policies, mitigating brute-force attacks, and protecting downstream services from being overwhelmed. It is one of the highest-value, highest-frequency system design topics — know the algorithms cold.
Rate Limiting Algorithms
1. Token Bucket
The Token Bucket is the most widely used algorithm (used by AWS API Gateway, Stripe, and many others). Tokens accumulate in a bucket at a fixed refill rate up to a maximum capacity. Each request consumes one token. If the bucket has tokens, the request proceeds; otherwise it is rejected. This allows controlled bursting: a client that has been idle accumulates tokens and can then burst up to the bucket capacity.
import time
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity # Max tokens (burst size)
self.refill_rate = refill_rate # Tokens added per second
self.tokens = capacity # Start full
self.last_refill = time.time()
def allow_request(self) -> bool:
now = time.time()
elapsed = now - self.last_refill
# Refill tokens based on elapsed time
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_rate
)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True # Request allowed
return False # Rate limited
# Example: 100 req/s steady state, bursts up to 200
limiter = TokenBucket(capacity=200, refill_rate=100)2. Leaky Bucket
The Leaky Bucket (as a queue) enforces a strictly constant output rate regardless of input bursts. Incoming requests enter a fixed-size queue. A processor drains the queue at a constant rate. If the queue is full, new requests are dropped. This is a traffic shaper — ideal for protecting downstream services that cannot handle any bursts (e.g., a payment processor).
3. Fixed Window Counter
Fixed Window divides time into equal-sized windows (e.g., 1-minute intervals). A counter is incremented for each request. When the counter exceeds the limit, requests are rejected until the next window starts. It is extremely simple (a single Redis `INCR` with `EXPIRE`) but has the boundary burst problem: a client can send 2x the limit in 60 seconds by sending requests at the end of one window and the beginning of the next.
4. Sliding Window Log
Sliding Window Log stores the timestamp of every request in a sorted set. To check a request: remove timestamps older than `now - window_size`, count remaining timestamps, and reject if the count exceeds the limit. This is perfectly accurate (no boundary burst problem) but stores one timestamp per request, making it memory-intensive for high-volume APIs.
5. Sliding Window Counter
Sliding Window Counter is the best practical compromise. It uses two fixed-window counters (current and previous window) and estimates the request count for the rolling window by interpolating: `count = prev_window_count * (window_remaining_ratio) + current_window_count`. This approximation is accurate within ~0.003% for most traffic patterns and requires only two integers per user.
| Algorithm | Memory | Accuracy | Allows Bursts | Implementation Complexity |
|---|---|---|---|---|
| Token Bucket | O(1) per user | High | Yes (up to capacity) | Low |
| Leaky Bucket | O(queue_size) | High | No (traffic shaped) | Medium |
| Fixed Window Counter | O(1) per user | Medium (boundary burst) | Yes (2x at boundary) | Very Low |
| Sliding Window Log | O(limit) per user | Exact | No | Medium |
| Sliding Window Counter | O(1) per user | ~99.7% accurate | Yes (controlled) | Low |
Distributed Rate Limiting
When your API is served by multiple servers behind a load balancer, each server cannot maintain its own local counter — a client routed to different servers could exceed the limit by a factor equal to the number of servers. You need a shared, atomic counter that all servers update. Redis is the standard solution due to its atomic operations (`INCR`, Lua scripts) and sub-millisecond latency.
-- Redis Lua script for atomic sliding-window-counter rate limiting
-- Called with: EVAL script 1 user:123 1000 60 current_time
local key = KEYS[1] -- e.g., "ratelimit:user:123"
local limit = tonumber(ARGV[1]) -- e.g., 1000 requests
local window = tonumber(ARGV[2]) -- e.g., 60 seconds
local now = tonumber(ARGV[3]) -- current Unix timestamp
local current_window = math.floor(now / window) * window
local prev_window = current_window - window
local curr_key = key .. ":" .. current_window
local prev_key = key .. ":" .. prev_window
local curr_count = tonumber(redis.call("GET", curr_key) or "0")
local prev_count = tonumber(redis.call("GET", prev_key) or "0")
-- Interpolate: weight previous window by how much of current window remains
local elapsed_in_window = now - current_window
local prev_weight = (window - elapsed_in_window) / window
local estimated_count = math.floor(prev_count * prev_weight) + curr_count
if estimated_count >= limit then
return 0 -- Rate limited
end
-- Increment current window counter
redis.call("INCR", curr_key)
redis.call("EXPIRE", curr_key, window * 2)
return 1 -- AllowedRate Limiting Dimensions
- By IP address: Protects unauthenticated endpoints. Easy to implement, but can penalize users behind NAT sharing a single IP.
- By user ID: More accurate for authenticated APIs. Requires extracting user ID from JWT or session.
- By API key: Standard for public APIs (e.g., Stripe, Twilio). Each key has its own limit, enabling tiered plans.
- By endpoint: Some endpoints (e.g., SMS sending) warrant stricter limits than others (e.g., read operations).
- By geographic region: Rate limit separately per data center to prevent one region's traffic from consuming global quota.
Interview Tip
When asked to design a rate limiter, walk through: (1) what algorithm you choose and why (Token Bucket for flexibility, Sliding Window Counter for accuracy with low memory), (2) where state is stored (Redis with atomic Lua scripts), (3) what the rate limit key is (user ID, API key, IP), (4) how you handle Redis failure (fail open vs fail closed — most systems fail open to avoid blocking legitimate traffic during an outage), and (5) how you surface rate limit info to clients (standard `X-RateLimit-Limit`, `X-RateLimit-Remaining`, `Retry-After` headers).
Practice this pattern
Design a distributed rate limiter for a global API