Cache Invalidation Patterns
The hardest problem in CS: TTL-based, event-driven, and versioned invalidation. Cache stampede prevention and stale-while-revalidate.
The Hardest Problem
Phil Karlton's famous quote: "There are only two hard things in Computer Science: cache invalidation and naming things." Cache invalidation is hard because knowing when to expire a cached value requires knowing when the underlying data has changed — and in distributed systems, that change may happen on a different server, in a different service, or through a path that doesn't touch the cache at all.
The consequences of getting invalidation wrong are visible and painful: users see stale prices after a discount is applied, outdated profile photos after an edit, or incorrect inventory counts after a purchase. This lesson covers the major invalidation strategies and their production failure modes.
TTL-Based Invalidation
TTL (Time-To-Live) is the simplest and most common invalidation mechanism. Every cache entry is given a maximum age; after expiry, the next request triggers a fresh fetch from the source of truth. TTL-based invalidation requires no coordination — the cache handles expiry independently.
Setting the right TTL is a business decision: how stale is acceptable? For stock prices: 1 second. For product descriptions: 1 hour. For country codes: 24 hours. For static assets: indefinite (use versioned URLs instead). The key insight is that TTL trades staleness for simplicity — you are accepting that data may be out of date by up to TTL seconds.
Cache stampede (thundering herd)
When a popular cache key expires, all concurrent requests for that key simultaneously get a cache miss and race to regenerate it — each firing an expensive database query. This can overload the database exactly when it is most vulnerable (high traffic). This is called a cache stampede or thundering herd.
Cache Stampede Prevention
Several techniques prevent cache stampedes:
- Mutex / lock on cache miss: Only one process regenerates the key; others wait or serve stale data. Redis `SET NX EX` (set if not exists with expiry) is a standard distributed lock primitive.
- Probabilistic early expiry: Before the TTL expires, some requests proactively refresh the cache. The probability increases as the TTL approaches zero — 'XFetch' algorithm.
- Stale-while-revalidate: Serve the stale value to all callers while one background process refreshes it. Users never see a miss; the background refresh absorbs the recomputation cost.
- Jittered TTLs: Add random jitter to TTLs (`ttl + random(0, jitter)`) so all entries for a batch load don't expire at the same moment.
import redis
import time
import random
r = redis.Redis()
def get_with_mutex(key: str, recompute_fn, ttl: int):
"""Get from cache, using a mutex lock to prevent stampede."""
value = r.get(key)
if value:
return value
# Cache miss — try to acquire lock
lock_key = f"lock:{key}"
acquired = r.set(lock_key, "1", nx=True, ex=10) # 10s lock timeout
if acquired:
try:
# We hold the lock — recompute
value = recompute_fn()
jitter = random.randint(0, int(ttl * 0.1)) # 10% jitter
r.set(key, value, ex=ttl + jitter)
return value
finally:
r.delete(lock_key)
else:
# Another process is recomputing — wait briefly and retry
time.sleep(0.05)
return r.get(key) or recompute_fn() # fallback if lock holder failedEvent-Driven Invalidation
Event-driven invalidation invalidates cache entries in response to data change events rather than waiting for TTL expiry. When the user updates their profile in the database, the service publishes a `user.updated` event; a cache invalidation subscriber deletes or regenerates the affected cache keys.
Event-driven invalidation is more accurate than TTL — cache entries are invalidated immediately after a change rather than waiting up to TTL seconds. But it adds infrastructure complexity: you need a reliable event bus, the invalidation subscriber must handle duplicates (at-least-once delivery), and the event must carry enough information to identify the affected cache keys.
Event ordering and race conditions
With async invalidation, there is a window between the database write and the cache deletion where stale data is served. Worse: if the cache is regenerated before the delete arrives, you can cache the stale pre-update value. Mitigate by using versioned values or a conditional write: only write to cache if the version in the event is newer than the cached version.
Versioned / Tag-Based Invalidation
Versioned invalidation embeds a version number in the cache key. When the underlying data changes, the version is incremented; all cache lookups use the new version and get a miss, causing a natural refresh. Old entries are left to expire via TTL (they are orphaned and never accessed again).
def get_user_cache_key(user_id: str) -> str:
# Version stored in a fast counter (Redis or DB)
version = get_user_version(user_id) # e.g., "v7"
return f"user:{user_id}:{version}"
def invalidate_user(user_id: str):
# Increment version — all old keys become orphaned
increment_user_version(user_id)
# Old cached keys (e.g., user:42:v6) will expire via their TTL
# New requests will miss on user:42:v7 and repopulateCache tags (supported by Fastly and Cloudflare) are a CDN-level form of grouped invalidation. Tag a response with multiple identifiers: `Surrogate-Key: product:42 category:electronics`. A single purge API call for `product:42` invalidates all responses tagged with that product — even across many different URLs.
Stale-While-Revalidate
Stale-while-revalidate is a pattern (and HTTP directive) that allows serving a stale cached response while simultaneously refreshing it in the background. This effectively eliminates cache miss latency for end users at the cost of accepting brief staleness.
| Invalidation Strategy | Staleness | Complexity | Infrastructure Needed | Best For |
|---|---|---|---|---|
| TTL expiry | Up to TTL seconds | None | Nothing extra | Tolerable staleness, simple systems |
| Event-driven (pub/sub) | Seconds (async) | Medium | Message broker (Kafka/SQS) | User-facing mutations that must be reflected quickly |
| Versioned keys | Zero (on version bump) | Low | Version counter store | Content where you control all write paths |
| Cache tags (CDN) | Near-zero (API purge) | Low | CDN with surrogate key support | CDN edge caching of product/content pages |
| Stale-while-revalidate | Up to revalidation interval | Low | Nothing extra | High read QPS where miss latency is unacceptable |
Interview Tip
Cache invalidation questions are a litmus test for senior candidates. A junior answer: 'set a TTL.' A senior answer: 'TTL is my baseline; I'd set it based on the business tolerance for staleness. For user-facing mutations (profile updates, inventory changes), I'd add event-driven invalidation via Kafka so changes propagate within seconds rather than minutes. I'd also add stampede prevention — either a mutex lock for a hot key or stale-while-revalidate to serve the stale value during regeneration. The choice depends on whether 100ms staleness is acceptable for the background refresh.' Show you know the failure modes, not just the happy path.