📌Pinterest Engineering·February 5, 2026

Designing Pinterest's Real-time CDC-based Database Ingestion Framework

This article details Pinterest's journey in building a unified, real-time database ingestion framework using Change Data Capture (CDC). It addresses the challenges of their legacy batch systems, outlines the new architecture leveraging Kafka, Flink, Spark, and Iceberg, and discusses key optimizations for performance, cost efficiency, and data compliance at petabyte scale.

Databases & Storage Distributed Systems Performance & Scaling

Read original on Pinterest Engineering

The Need for a Unified Real-time Ingestion System

Pinterest's growth led to a critical demand for a robust, real-time, and cost-effective database ingestion platform. The existing batch-oriented, fragmented landscape suffered from high data latency (often >24 hours), inefficient full-table processing for minor changes, lack of row-level deletion support (hindering compliance), and significant operational complexity due to disparate pipelines. These issues severely impacted analytics, machine learning, and product features requiring timely data.

Architectural Principles and Components

The new framework is built on a Change Data Capture (CDC) paradigm, aiming for a unified, generic, reliable, low-latency, scalable, and config-driven solution. It integrates several core technologies:

Source Databases: MySQL, TiDB, and KVStore.
CDC Layer: A custom service captures database changes and streams them to Kafka with sub-second latency.
Streaming Layer: Flink jobs process CDC events in near real-time, persisting them into append-only CDC Iceberg tables on S3.
Batch Layer: Periodically, Spark jobs read from CDC tables and perform upserts on Base Iceberg tables using `Merge Into` statements.
Bootstrap Pipeline: Initializes Base Iceberg tables with historical data from DB dumps.
Maintenance Jobs: Handles tasks like compaction and snapshot expiration for Iceberg tables.

ℹ️

CDC Table vs. Base Table

The architecture distinguishes between two types of Iceberg tables: CDC tables, which are append-only time-series ledgers of all change events (latency <5 mins), and Base tables, which mirror the online database, preserving historical records and supporting upserts (latency 15 mins to an hour). This separation allows for efficient incremental processing and historical data retention.

Key Optimizations for Performance and Cost Efficiency

Several crucial optimizations were implemented to enhance the efficiency of the Upsert operation and overall system performance:

Merge-on-Read (MOR) Strategy: Standardized on MOR for Iceberg's `Merge Into` operations, as Copy-on-Write (COW) incurred significantly higher storage costs for Pinterest's use cases.
Partitioning Base Table by Primary Key Hash: Using `bucket(N, id)` evenly distributes records, allowing Spark to parallelize updates across partitions, drastically improving upsert performance by reducing data scans and rewrites.
Addressing the Small Files Problem: Set the Iceberg table property `WRITE DISTRIBUTED BY PARTITION` to ensure Spark groups all data for a given partition to be written together. This significantly reduced the number of small files generated during upserts, which can degrade performance.
Bucket Join for Large Tables: For very large CDC and Base tables, standard `MERGE INTO` joins were prohibitively expensive due to full table shuffles. A workaround involved bucketing a temporary CDC table by primary key hash to enable Spark's efficient bucket join, bypassing full shuffles and reducing compute costs by over 40%.

CDCChange Data CaptureIcebergSparkFlinkKafkaData IngestionReal-time Analytics

Comments

Loading comments...

Architecture Design

Design this yourself

Design a real-time, scalable, and cost-efficient database ingestion framework for a large-scale social media platform. The system must support multiple database types, guarantee at-least-once processing, achieve data latency in minutes, and handle petabyte-scale data with thousands of pipelines. Incorporate Change Data Capture (CDC), a streaming layer, a batch layer for upserts, and an optimized data lake storage solution (like Iceberg) to manage CDC and base tables. Detail the architectural components and key optimizations for managing large tables, such as bucketing and efficient join strategies.

Focus: real-time database ingestion framework with CDC and Iceberg

Other design angles

· Design only the data lake architecture for storing and processing CDC events and maintaining a synchronized base table with low latency, focusing on schema evolution and compaction strategies.· Design a generic CDC service that can capture changes from diverse relational and NoSQL databases and publish them reliably to a message queue, considering schema management and transformation.· Propose an alternative architecture for real-time analytics dashboards that relies on a CDC-based ingestion system, detailing how data consistency and query performance are maintained.