ByteByteGo·May 13, 2026

Databricks' High-Performance Rate Limiting Redesign

Databricks overhauled its rate limiting architecture to address scalability and tail latency issues encountered with its original Redis-based setup. The redesign introduced an in-memory, sharded counter system (Dicer) and an optimistic, asynchronous batch-reporting model, significantly reducing critical path latency and improving system predictability. This involved a crucial trade-off: sacrificing strict real-time accuracy for dramatically improved performance and scalability.

Distributed Systems Performance & Scaling Microservices

Read original on ByteByteGo

Databricks faced significant scaling challenges with its initial rate limiting architecture, which relied on a single Redis instance. The real-time model serving traffic exposed limitations, particularly in tail latency due to multiple network hops and a single point of failure. The redesign prioritizes performance and scalability over strict real-time accuracy, offering valuable lessons in distributed system design and trade-offs.

Original Architecture Bottlenecks

The initial design used an Envoy ingress gateway calling a Ratelimit Service, which then queried Redis for counts. This introduced two network hops on the critical path for every request, leading to high P99 latencies (10-20ms). Attempts to optimize with Envoy's consistent hashing helped distribute load but introduced new problems like fragmentation, churn during scaling, and hotspotting, making horizontal scaling inefficient past a certain point.

Moving to In-Memory and Sharded Counters

Recognizing that rate limit counts are transient, Databricks moved to an in-memory storage model. This eliminated the Redis network hop. To achieve horizontal scalability and fault tolerance for these in-memory counters, they introduced a routing layer called Dicer. Dicer partitions keys across multiple servers, allowing each server to be the authoritative owner for its slice of keys. This design significantly reduced server-side tail latency and removed the single point of failure.

Optimistic Rate Limiting with Batch Reporting

The most impactful change was the adoption of an optimistic, asynchronous batch-reporting model. Instead of synchronous per-request checks, clients now make local decisions (defaulting to allow), count requests, and asynchronously report aggregates to the Ratelimit Service every 100 milliseconds. The server then instructs clients on which keys to reject and at what rate. This effectively removes the rate limit check from the critical path, drastically reducing client-side latency and making server load predictable.

ℹ️

Key Trade-off

The core of this redesign is the explicit trade-off of strict rate limit accuracy for significantly improved performance and scalability. Databricks accepts that some requests might temporarily exceed the limit between reports, designing their backends to tolerate this overshoot.

Bounding Overshoot and Algorithm Choice

To manage the overshoot introduced by optimistic reporting, Databricks implemented several layers: a server-provided rejection rate based on past traffic, a client-side local rate limiter for extreme spikes, and a switch to the token bucket algorithm. The token bucket, feasible with cheap in-memory compare-and-set operations, offers a more consistent rate limiting shape and handles bursts better than fixed window counters, which have boundary issues. This highlights how algorithm choice is often gated by storage and performance characteristics.

Algorithm Selection: Fixed window vs. sliding window vs. token bucket – each has different boundary behaviors.
State Location: External shared store (Redis) vs. in-memory on a single server vs. sharded in-memory across a cluster.
Sync Model: Synchronous per-request checks vs. asynchronous batch reporting with optimistic local decisions.

The Databricks redesign moved from a synchronous, fixed-window, Redis-backed approach to an asynchronous, token-bucket-based, sharded in-memory system, illustrating how these three architectural decisions are deeply coupled and influence system behavior and scalability.

rate limitingdistributed systemsscalabilitylatencyredisin-memoryoptimistic concurrencyarchitecture redesign