This article details Pinterest's journey in building a unified, real-time database ingestion framework using Change Data Capture (CDC). It addresses the challenges of their legacy batch systems, outlines the new architecture leveraging Kafka, Flink, Spark, and Iceberg, and discusses key optimizations for performance, cost efficiency, and data compliance at petabyte scale.
Read original on Pinterest EngineeringPinterest's growth led to a critical demand for a robust, real-time, and cost-effective database ingestion platform. The existing batch-oriented, fragmented landscape suffered from high data latency (often >24 hours), inefficient full-table processing for minor changes, lack of row-level deletion support (hindering compliance), and significant operational complexity due to disparate pipelines. These issues severely impacted analytics, machine learning, and product features requiring timely data.
The new framework is built on a Change Data Capture (CDC) paradigm, aiming for a unified, generic, reliable, low-latency, scalable, and config-driven solution. It integrates several core technologies:
CDC Table vs. Base Table
The architecture distinguishes between two types of Iceberg tables: <b>CDC tables</b>, which are append-only time-series ledgers of all change events (latency <5 mins), and <b>Base tables</b>, which mirror the online database, preserving historical records and supporting upserts (latency 15 mins to an hour). This separation allows for efficient incremental processing and historical data retention.
Several crucial optimizations were implemented to enhance the efficiency of the Upsert operation and overall system performance: