Thundering herd problem: our cache expired and 10K requests hit the database simultaneously

·541 views

we recently experienced a classic 'thundering herd' problem. a popular cache key expired simultaneously across all our instances, and immediately, thousands of concurrent requests hammered our primary database for the same piece of data. the db buckled, and our API response times went through the roof. we're looking at solutions to prevent this. one approach is using a mutex in the application layer to ensure only one request regenerates the cache. another is probabilistic early expiration, where a cache entry expires slightly before its actual ttl, giving a single instance a head start. a third is 'stale-while-revalidate'. what's your preferred strategy for cache stampede prevention, especially at high concurrency? what are the practical trade-offs you've observed with each method regarding latency, consistency, and implementation complexity?

2 comments

Thundering herd problem: our cache expired and 10K requests hit the database simultaneously

Comments