Netflix implemented an interval-aware caching strategy for Apache Druid to optimize real-time analytics queries. This approach significantly boosts cache hit rates and reduces query load by decomposing query results into time-aligned segments, allowing reuse across overlapping rolling window queries. The architectural solution involves an external proxy layer that intercepts, processes, and caches historical data while recomputing only the most recent intervals.
Read original on InfoQ ArchitectureTraditional caching mechanisms struggle with rolling window queries common in real-time analytics dashboards. Even with minor shifts in time boundaries (e.g., "errors in the last 3 hours" refreshing every minute), caches treat these as distinct requests. This leads to redundant computation, low cache reuse, and repeated scans over large datasets, especially problematic at Netflix's scale with trillions of rows in Apache Druid.
Netflix's solution decomposes query results into smaller, time-aligned segments. Instead of caching the full query output, intermediate aggregates for fixed time intervals are stored. When a new query arrives, relevant historical segments are retrieved from the cache and merged with newly computed data for only the most recent interval. This strategy maximizes reuse and minimizes computation.
Impact and Benefits
This caching layer led to a 33% drop in queries to Druid and a 66% improvement in P90 query times. In some workloads, it achieved up to a 14x reduction in result bytes and substantial reductions in segment scans by Druid. This significantly reduces the load on the underlying analytics database and improves dashboard responsiveness.
Netflix is exploring tighter integration of this interval-aware caching directly into Apache Druid to eliminate the need for an external proxy layer and further improve query planning efficiency. This would streamline the architecture and potentially unlock even greater performance gains by making the cache a first-class citizen within the query engine.