Course/Caching/Cache Invalidation Patterns

Cache Invalidation Patterns

The hardest problem in CS: TTL-based, event-driven, and versioned invalidation. Cache stampede prevention and stale-while-revalidate.

15 min readHigh interview weight

The Hardest Problem

Phil Karlton's famous quote: "There are only two hard things in Computer Science: cache invalidation and naming things." Cache invalidation is hard because knowing when to expire a cached value requires knowing when the underlying data has changed — and in distributed systems, that change may happen on a different server, in a different service, or through a path that doesn't touch the cache at all.

The consequences of getting invalidation wrong are visible and painful: users see stale prices after a discount is applied, outdated profile photos after an edit, or incorrect inventory counts after a purchase. This lesson covers the major invalidation strategies and their production failure modes.

TTL-Based Invalidation

TTL (Time-To-Live) is the simplest and most common invalidation mechanism. Every cache entry is given a maximum age; after expiry, the next request triggers a fresh fetch from the source of truth. TTL-based invalidation requires no coordination — the cache handles expiry independently.

Setting the right TTL is a business decision: how stale is acceptable? For stock prices: 1 second. For product descriptions: 1 hour. For country codes: 24 hours. For static assets: indefinite (use versioned URLs instead). The key insight is that TTL trades staleness for simplicity — you are accepting that data may be out of date by up to TTL seconds.

⚠️

Cache stampede (thundering herd)

When a popular cache key expires, all concurrent requests for that key simultaneously get a cache miss and race to regenerate it — each firing an expensive database query. This can overload the database exactly when it is most vulnerable (high traffic). This is called a cache stampede or thundering herd.

Cache Stampede Prevention

Several techniques prevent cache stampedes:

Mutex / lock on cache miss: Only one process regenerates the key; others wait or serve stale data. Redis `SET NX EX` (set if not exists with expiry) is a standard distributed lock primitive.
Probabilistic early expiry: Before the TTL expires, some requests proactively refresh the cache. The probability increases as the TTL approaches zero — 'XFetch' algorithm.
Stale-while-revalidate: Serve the stale value to all callers while one background process refreshes it. Users never see a miss; the background refresh absorbs the recomputation cost.
Jittered TTLs: Add random jitter to TTLs (`ttl + random(0, jitter)`) so all entries for a batch load don't expire at the same moment.

python

import redis
import time
import random

r = redis.Redis()

def get_with_mutex(key: str, recompute_fn, ttl: int):
    """Get from cache, using a mutex lock to prevent stampede."""
    value = r.get(key)
    if value:
        return value

    # Cache miss — try to acquire lock
    lock_key = f"lock:{key}"
    acquired = r.set(lock_key, "1", nx=True, ex=10)  # 10s lock timeout

    if acquired:
        try:
            # We hold the lock — recompute
            value = recompute_fn()
            jitter = random.randint(0, int(ttl * 0.1))  # 10% jitter
            r.set(key, value, ex=ttl + jitter)
            return value
        finally:
            r.delete(lock_key)
    else:
        # Another process is recomputing — wait briefly and retry
        time.sleep(0.05)
        return r.get(key) or recompute_fn()  # fallback if lock holder failed

Event-Driven Invalidation

Event-driven invalidation invalidates cache entries in response to data change events rather than waiting for TTL expiry. When the user updates their profile in the database, the service publishes a `user.updated` event; a cache invalidation subscriber deletes or regenerates the affected cache keys.

Loading diagram...

Event-driven invalidation: a message queue delivers change events to a cache invalidator service

Event-driven invalidation is more accurate than TTL — cache entries are invalidated immediately after a change rather than waiting up to TTL seconds. But it adds infrastructure complexity: you need a reliable event bus, the invalidation subscriber must handle duplicates (at-least-once delivery), and the event must carry enough information to identify the affected cache keys.

⚠️

Event ordering and race conditions

With async invalidation, there is a window between the database write and the cache deletion where stale data is served. Worse: if the cache is regenerated before the delete arrives, you can cache the stale pre-update value. Mitigate by using versioned values or a conditional write: only write to cache if the version in the event is newer than the cached version.

Versioned / Tag-Based Invalidation

Versioned invalidation embeds a version number in the cache key. When the underlying data changes, the version is incremented; all cache lookups use the new version and get a miss, causing a natural refresh. Old entries are left to expire via TTL (they are orphaned and never accessed again).

python

def get_user_cache_key(user_id: str) -> str:
    # Version stored in a fast counter (Redis or DB)
    version = get_user_version(user_id)   # e.g., "v7"
    return f"user:{user_id}:{version}"

def invalidate_user(user_id: str):
    # Increment version — all old keys become orphaned
    increment_user_version(user_id)
    # Old cached keys (e.g., user:42:v6) will expire via their TTL
    # New requests will miss on user:42:v7 and repopulate

Cache tags (supported by Fastly and Cloudflare) are a CDN-level form of grouped invalidation. Tag a response with multiple identifiers: `Surrogate-Key: product:42 category:electronics`. A single purge API call for `product:42` invalidates all responses tagged with that product — even across many different URLs.

Stale-While-Revalidate

Stale-while-revalidate is a pattern (and HTTP directive) that allows serving a stale cached response while simultaneously refreshing it in the background. This effectively eliminates cache miss latency for end users at the cost of accepting brief staleness.

Invalidation Strategy	Staleness	Complexity	Infrastructure Needed	Best For
TTL expiry	Up to TTL seconds	None	Nothing extra	Tolerable staleness, simple systems
Event-driven (pub/sub)	Seconds (async)	Medium	Message broker (Kafka/SQS)	User-facing mutations that must be reflected quickly
Versioned keys	Zero (on version bump)	Low	Version counter store	Content where you control all write paths
Cache tags (CDN)	Near-zero (API purge)	Low	CDN with surrogate key support	CDN edge caching of product/content pages
Stale-while-revalidate	Up to revalidation interval	Low	Nothing extra	High read QPS where miss latency is unacceptable

💡

Interview Tip

Cache invalidation questions are a litmus test for senior candidates. A junior answer: 'set a TTL.' A senior answer: 'TTL is my baseline; I'd set it based on the business tolerance for staleness. For user-facing mutations (profile updates, inventory changes), I'd add event-driven invalidation via Kafka so changes propagate within seconds rather than minutes. I'd also add stampede prevention — either a mutex lock for a hot key or stale-while-revalidate to serve the stale value during regeneration. The choice depends on whether 100ms staleness is acceptable for the background refresh.' Show you know the failure modes, not just the happy path.

Application-Level Caching

Distributed Caching