Course/Security & Auth/Rate Limiting & Throttling

Rate Limiting & Throttling

Algorithms in depth: token bucket, leaky bucket, fixed window, sliding window log, sliding window counter. Distributed rate limiting across multiple servers.

15 min readHigh interview weight

Why Rate Limiting Matters

Rate limiting is a foundational defense mechanism that protects your system from abuse, overload, and resource exhaustion. Without it, a single misbehaving client (whether intentional or due to a bug) can degrade service for everyone. Rate limiting serves multiple purposes: preventing API abuse, enforcing fair usage policies, mitigating brute-force attacks, and protecting downstream services from being overwhelmed. It is one of the highest-value, highest-frequency system design topics — know the algorithms cold.

Rate Limiting Algorithms

1. Token Bucket

The Token Bucket is the most widely used algorithm (used by AWS API Gateway, Stripe, and many others). Tokens accumulate in a bucket at a fixed refill rate up to a maximum capacity. Each request consumes one token. If the bucket has tokens, the request proceeds; otherwise it is rejected. This allows controlled bursting: a client that has been idle accumulates tokens and can then burst up to the bucket capacity.

python

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity          # Max tokens (burst size)
        self.refill_rate = refill_rate    # Tokens added per second
        self.tokens = capacity            # Start full
        self.last_refill = time.time()

    def allow_request(self) -> bool:
        now = time.time()
        elapsed = now - self.last_refill

        # Refill tokens based on elapsed time
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True  # Request allowed
        return False      # Rate limited

# Example: 100 req/s steady state, bursts up to 200
limiter = TokenBucket(capacity=200, refill_rate=100)

2. Leaky Bucket

The Leaky Bucket (as a queue) enforces a strictly constant output rate regardless of input bursts. Incoming requests enter a fixed-size queue. A processor drains the queue at a constant rate. If the queue is full, new requests are dropped. This is a traffic shaper — ideal for protecting downstream services that cannot handle any bursts (e.g., a payment processor).

3. Fixed Window Counter

Fixed Window divides time into equal-sized windows (e.g., 1-minute intervals). A counter is incremented for each request. When the counter exceeds the limit, requests are rejected until the next window starts. It is extremely simple (a single Redis `INCR` with `EXPIRE`) but has the boundary burst problem: a client can send 2x the limit in 60 seconds by sending requests at the end of one window and the beginning of the next.

4. Sliding Window Log

Sliding Window Log stores the timestamp of every request in a sorted set. To check a request: remove timestamps older than `now - window_size`, count remaining timestamps, and reject if the count exceeds the limit. This is perfectly accurate (no boundary burst problem) but stores one timestamp per request, making it memory-intensive for high-volume APIs.

5. Sliding Window Counter

Sliding Window Counter is the best practical compromise. It uses two fixed-window counters (current and previous window) and estimates the request count for the rolling window by interpolating: `count = prev_window_count * (window_remaining_ratio) + current_window_count`. This approximation is accurate within ~0.003% for most traffic patterns and requires only two integers per user.

Algorithm	Memory	Accuracy	Allows Bursts	Implementation Complexity
Token Bucket	O(1) per user	High	Yes (up to capacity)	Low
Leaky Bucket	O(queue_size)	High	No (traffic shaped)	Medium
Fixed Window Counter	O(1) per user	Medium (boundary burst)	Yes (2x at boundary)	Very Low
Sliding Window Log	O(limit) per user	Exact	No	Medium
Sliding Window Counter	O(1) per user	~99.7% accurate	Yes (controlled)	Low

Distributed Rate Limiting

When your API is served by multiple servers behind a load balancer, each server cannot maintain its own local counter — a client routed to different servers could exceed the limit by a factor equal to the number of servers. You need a shared, atomic counter that all servers update. Redis is the standard solution due to its atomic operations (`INCR`, Lua scripts) and sub-millisecond latency.

Loading diagram...

Distributed rate limiting: all API servers share rate limit state in Redis

lua

-- Redis Lua script for atomic sliding-window-counter rate limiting
-- Called with: EVAL script 1 user:123 1000 60 current_time

local key = KEYS[1]             -- e.g., "ratelimit:user:123"
local limit = tonumber(ARGV[1]) -- e.g., 1000 requests
local window = tonumber(ARGV[2]) -- e.g., 60 seconds
local now = tonumber(ARGV[3])   -- current Unix timestamp

local current_window = math.floor(now / window) * window
local prev_window = current_window - window

local curr_key = key .. ":" .. current_window
local prev_key = key .. ":" .. prev_window

local curr_count = tonumber(redis.call("GET", curr_key) or "0")
local prev_count = tonumber(redis.call("GET", prev_key) or "0")

-- Interpolate: weight previous window by how much of current window remains
local elapsed_in_window = now - current_window
local prev_weight = (window - elapsed_in_window) / window
local estimated_count = math.floor(prev_count * prev_weight) + curr_count

if estimated_count >= limit then
    return 0  -- Rate limited
end

-- Increment current window counter
redis.call("INCR", curr_key)
redis.call("EXPIRE", curr_key, window * 2)
return 1  -- Allowed

Rate Limiting Dimensions

By IP address: Protects unauthenticated endpoints. Easy to implement, but can penalize users behind NAT sharing a single IP.
By user ID: More accurate for authenticated APIs. Requires extracting user ID from JWT or session.
By API key: Standard for public APIs (e.g., Stripe, Twilio). Each key has its own limit, enabling tiered plans.
By endpoint: Some endpoints (e.g., SMS sending) warrant stricter limits than others (e.g., read operations).
By geographic region: Rate limit separately per data center to prevent one region's traffic from consuming global quota.

💡

Interview Tip

When asked to design a rate limiter, walk through: (1) what algorithm you choose and why (Token Bucket for flexibility, Sliding Window Counter for accuracy with low memory), (2) where state is stored (Redis with atomic Lua scripts), (3) what the rate limit key is (user ID, API key, IP), (4) how you handle Redis failure (fail open vs fail closed — most systems fail open to avoid blocking legitimate traffic during an outage), and (5) how you surface rate limit info to clients (standard `X-RateLimit-Limit`, `X-RateLimit-Remaining`, `Retry-After` headers).

Practice this pattern

Design a distributed rate limiter for a global API

RBAC & ABAC Authorization

API Gateway for Security