Medium #system-design·May 24, 2026

Building a Production-Grade Distributed Rate Limiter

This article provides a deep dive into designing and implementing a production-grade rate limiter in Java. It covers crucial system design considerations such as concurrency, pluggable algorithms, and choosing between in-memory solutions like Caffeine and distributed options like Redis to handle different scaling requirements and trade-offs.

Distributed Systems Performance & Scaling Tools & Frameworks

Read original on Medium #system-design

Understanding Rate Limiting Fundamentals

Rate limiting is a critical component in distributed systems to protect services from abuse, manage resource consumption, and ensure fair usage. It involves controlling the rate at which an API or service can be accessed. Key design considerations include accuracy, low latency, high availability, and distributed consistency.

Core Rate Limiting Algorithms

Token Bucket: A bucket of tokens is refilled at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected or queued. Good for handling bursts.
Leaky Bucket: Requests are added to a queue and processed at a fixed rate. If the queue is full, new requests are dropped. Smooths out bursts but can introduce latency.
Fixed Window Counter: Divides time into fixed windows. Increments a counter for each request within a window. Simple, but susceptible to bursty traffic at window boundaries.
Sliding Window Log: Stores timestamps of requests. Filters out requests older than the current window. Highly accurate but memory-intensive.
Sliding Window Counter: Combines fixed window counters with interpolation to approximate the sliding window log's accuracy while being more memory-efficient.

Architectural Choices for Distributed Rate Limiting

For distributed environments, a centralized store is often necessary to synchronize rate limit states across multiple service instances. The article explores using Redis as a high-performance, distributed data store for maintaining rate limit counters or token buckets.

💡

Design Trade-offs

When choosing a rate limiting algorithm and its implementation, consider your system's specific requirements for burst handling, accuracy, memory footprint, and the acceptable level of eventual consistency versus strong consistency in a distributed setting.

In-Memory vs. Distributed Solutions

Caffeine (an in-memory caching library) can be used for local, per-instance rate limiting, offering very low latency. However, it does not provide global rate limits across multiple service instances. For true distributed rate limiting, a shared state mechanism like Redis is essential. Using Redis with Lua scripts allows for atomic operations, crucial for maintaining accurate counters in a concurrent environment.

lua

local current = redis.call('incr', KEYS[1])
local expiry = tonumber(ARGV[1])
if current == 1 then
    redis.call('expire', KEYS[1], expiry)
end
return current

This Lua script demonstrates an atomic increment and expire pattern in Redis, ensuring that the counter is incremented and expired correctly even under high concurrency, which is vital for algorithms like Fixed Window Counter.

rate limitingconcurrencydistributed systemsRedisCaffeinesystem designJavaAPI Gateway