Course/Distributed Systems/Bloom Filters & HyperLogLog

Bloom Filters & HyperLogLog

Probabilistic data structures for scale: Bloom filters for membership testing, counting Bloom filters, HyperLogLog for cardinality estimation, and practical applications.

12 min read

Why Probabilistic Data Structures?

At large scale, exact answers are expensive. Storing every URL seen by a web crawler to check for duplicates might require terabytes of memory. Counting distinct users across billions of events is impractical with a hash set. Probabilistic data structures trade a small, bounded error rate for dramatic reductions in memory and CPU usage — often by several orders of magnitude.

Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure that answers membership queries: 'is element X in the set?' It can produce false positives (saying X is in the set when it is not) but never false negatives (if it says X is not in the set, X is definitely not there). This asymmetry is exploitable in many systems.

A Bloom filter consists of a bit array of size m (initially all zeros) and k hash functions. To add an element, hash it with all k functions and set those bit positions to 1. To query, hash it with all k functions and check if all those bit positions are 1 — if any are 0, the element is definitely not in the set.

Loading diagram...

Bloom filter: set bits on insert, check all k bits on query

False Positive Rate

The false positive rate depends on: the bit array size m, the number of hash functions k, and the number of inserted elements n. The optimal false positive rate is approximately `(1 - e^(-kn/m))^k`. Practically: with m=10 bits/element and k=7 hash functions, the false positive rate is about 1%. Doubling m to 20 bits/element drops it to 0.001%.

typescript

class BloomFilter {
  private bits: Uint8Array;
  private m: number;       // bit array size
  private k: number;       // number of hash functions

  constructor(expectedItems: number, falsePositiveRate: number) {
    // Optimal m and k formulas
    this.m = Math.ceil(-(expectedItems * Math.log(falsePositiveRate)) / Math.LN2 ** 2);
    this.k = Math.round((this.m / expectedItems) * Math.LN2);
    this.bits = new Uint8Array(Math.ceil(this.m / 8));
  }

  add(item: string): void {
    for (const pos of this.hashPositions(item)) {
      this.bits[Math.floor(pos / 8)] |= 1 << (pos % 8);
    }
  }

  mightContain(item: string): boolean {
    return this.hashPositions(item).every(pos =>
      (this.bits[Math.floor(pos / 8)] >> (pos % 8)) & 1
    );
  }

  private hashPositions(item: string): number[] {
    // In production: use MurmurHash3 with different seeds
    return Array.from({ length: this.k }, (_, i) =>
      Math.abs(this.simpleHash(item, i)) % this.m
    );
  }

  private simpleHash(s: string, seed: number): number {
    let h = seed;
    for (const c of s) h = Math.imul(h ^ c.charCodeAt(0), 0x9e3779b9);
    return h;
  }
}

// Usage: avoid DB lookups for definitely-absent keys
const filter = new BloomFilter(1_000_000, 0.01); // 1M items, 1% FPR
// ~1.2 MB vs ~8 MB for a hash set of 1M strings

Real-World Bloom Filter Usage

System	Bloom Filter Use
Apache Cassandra	Each SSTable has a Bloom filter; avoids disk reads for keys not in that SSTable
Google Bigtable	Per-tablet Bloom filter reduces disk seeks for non-existent rows
PostgreSQL	Query planner uses Bloom filter-based index for multi-column lookups
Web browsers	Safe Browsing API: local Bloom filter for malicious URLs; server query only on positive
Akamai CDN	One-hit-wonder detection: only cache objects seen at least twice (avoids cache pollution)

💡

Counting Bloom filters

Standard Bloom filters do not support deletion — you cannot unset bits because multiple elements may share a bit position. Counting Bloom filters replace each bit with a small counter (typically 4 bits). Increment on add, decrement on delete. This supports deletion at the cost of 4x memory. Used in network routers for flow counting.

HyperLogLog

HyperLogLog (HLL) estimates the cardinality (count of distinct elements) in a multiset using a tiny, fixed amount of memory. The key insight: if you hash all elements and observe the maximum number of leading zeros in any hash, you can estimate the cardinality. With p bits of prefix used as registers (2^p registers), HLL achieves a standard error of `1.04 / sqrt(2^p)`.

12 KB memory estimates cardinality up to billions with ~0.81% standard error (p=14).
HLL++ (Google): improved HLL with bias correction for small and large cardinalities. Used in BigQuery.
Redis HyperLogLog: `PFADD key element...` / `PFCOUNT key`. Uses ~12 KB per key, 0.81% standard error.
Mergeability: two HLL sketches can be merged via element-wise max — enabling distributed counting across shards without communication overhead.

Data Structure	Use Case	Memory (1M items)	Error Rate
Hash Set	Exact membership	~32 MB	0% (exact)
Bloom Filter	Membership testing	~1.2 MB (1% FPR)	1% false positives, 0% false negatives
Counting Bloom Filter	Membership + deletion	~4.8 MB	~1% false positives
HyperLogLog	Cardinality estimation	12 KB (fixed!)	~0.81% standard error

💡

Interview Tip

Bloom filters come up in storage system design questions ('how does Cassandra avoid disk reads for missing keys?'). HyperLogLog comes up in analytics questions ('how does YouTube count unique video views without storing every user ID?'). The key interview point for both: these structures are space-optimal approximations with mathematically bounded error. Mention Redis natively supports HyperLogLog (`PFADD`, `PFCOUNT`), and Cassandra uses Bloom filters per-SSTable to avoid unnecessary disk I/O.

Split-Brain Problem

Containerization with Docker