Menu
Course/Data Management Patterns/Change Data Capture (CDC)

Change Data Capture (CDC)

Capture database changes as events: Debezium, log-based CDC, trigger-based CDC, and using CDC to keep materialized views and search indexes in sync.

12 min read

What Is Change Data Capture?

Change Data Capture (CDC) is a pattern that detects and captures changes made to a database — inserts, updates, and deletes — and makes those changes available to downstream systems as a stream of events. Instead of polling the database for changes, CDC observes the database's own change feed, making it low-latency and low-overhead.

CDC is the infrastructure that powers many other patterns in this module: it is how the Transactional Outbox relay reads outbox inserts, how CQRS read projections stay current, how Elasticsearch indexes are kept in sync with Postgres, and how data pipelines replicate changes to data warehouses.

Loading diagram...
CDC with Debezium: database changes flow from the WAL to multiple downstream consumers.

CDC Implementation Approaches

ApproachHow It WorksLatencyOverheadUse When
Log-based CDCReads the DB transaction log (WAL, binlog)Sub-secondVery lowProduction systems; preferred approach
Trigger-based CDCDB triggers write changes to a shadow tableLowModerate write overheadSimple setups without log access
Query-based pollingPeriodically SELECT WHERE updated_at > last_pollSeconds to minutesDB query loadSimple legacy systems; limited
Dual write with captureApplication writes to both DB and event streamNoneNone (logic in app)When DB log access is unavailable

Debezium: Log-Based CDC in Practice

Debezium is the dominant open-source CDC platform. It runs as a set of Kafka Connect connectors that tail the database transaction log and publish change events to Kafka topics. Each row change becomes a structured event with the before and after state, the operation type, the table name, and a timestamp.

json
// Sample Debezium change event for a Postgres UPDATE
{
  "before": {
    "id": "order-123",
    "status": "pending",
    "total": 49.99
  },
  "after": {
    "id": "order-123",
    "status": "shipped",
    "total": 49.99
  },
  "op": "u",                    // "c"=create, "u"=update, "d"=delete, "r"=read (snapshot)
  "ts_ms": 1708300800000,
  "source": {
    "db": "orders_db",
    "table": "orders",
    "txId": 789012,
    "lsn": 24567890           // Log Sequence Number — position in WAL
  }
}

Use Cases for CDC

  • Search index synchronization — Keep an Elasticsearch or Solr index up to date with a Postgres source table. When a product is updated in Postgres, Debezium publishes the change, and a Kafka consumer updates the Elasticsearch document.
  • Data warehouse ETL — Replicate OLTP database changes to a data warehouse (Snowflake, BigQuery) in near-real time, replacing slow nightly batch ETL jobs.
  • Cache invalidation — When source data changes, CDC events trigger cache eviction in Redis, ensuring the cache is always invalidated on real changes rather than on a fixed TTL.
  • CQRS read model updates — Maintain read-model projections by consuming CDC events from the write-side database.
  • Microservice data replication — Replicate data from one service's database to another service's local copy for autonomous querying.
  • Transactional outbox relay — The canonical production implementation of the outbox pattern (Debezium Outbox Connector).

Handling Initial Snapshot

When you first set up CDC on an existing database, you need to populate downstream consumers with existing data before streaming live changes. Debezium handles this with an initial snapshot: it reads all existing rows (emitting `read` operation events) before switching to live WAL streaming. The transition is seamless — the consumer processes historical data as if it were change events.

ℹ️

CDC Requires Database Configuration

Log-based CDC requires specific database settings. For Postgres, you need `wal_level = logical`. For MySQL, you need `binlog_format = ROW`. These settings may require a database restart and are not always available on managed cloud databases. Always verify CDC support before committing to this approach in your architecture.

Real-World Examples

LinkedIn was an early adopter of Databus (their internal CDC system), which inspired Debezium. Airbnb uses CDC to synchronize listing data from MySQL to their search infrastructure. Shopify uses CDC to keep ElasticSearch product indexes in sync. Netflix uses CDC for its data mesh, where hundreds of microservices' databases feed into central streaming pipelines.

💡

Interview Tip

CDC is often the right answer when an interviewer asks 'how do you keep your Elasticsearch index in sync with your database?' or 'how do you keep a read model up to date without polling?'. Frame it as: watch the database's transaction log, publish changes as events, consume those events to update downstream stores. Mention Debezium as the de-facto tool and note that it requires database-level configuration (logical replication for Postgres).

📝

Knowledge Check

4 questions

Test your understanding of this lesson. Score 70% or higher to complete.

Ask about this lesson

Ask anything about Change Data Capture (CDC)