Thundering herd problem: our cache expired and 10K requests hit the database simultaneously

·8 views

We just got hit by a classic thundering herd problem last night. We have a highly popular dashboard that's heavily cached in Redis, with a 5-minute TTL. Around 2 AM PST, the cache expired, and thousands of concurrent requests for that dashboard all hit our PostgreSQL database simultaneously. The database immediately fell over, and it took us a good 20 minutes to recover fully. It was painful. We've discussed a few strategies to prevent this from happening again. One is a simple distributed mutex: when the cache key expires, only one request is allowed to recompute and populate the cache, while others wait. Another is probabilistic early expiration, where a small percentage of requests proactively refresh the cache before it fully expires. And then there's the `stale-while-revalidate` pattern, where old data is served while a background refresh happens. Each has its pros and cons regarding latency, complexity, and how fresh the data is. For critical dashboards, what approach do you find most robust and least error-prone for preventing these cache stampedes?

2 comments

Thundering herd problem: our cache expired and 10K requests hit the database simultaneously

Comments