Pinterest engineered a next-generation data ingestion framework using Change Data Capture (CDC) to overcome the limitations of their legacy batch processing systems. This new architecture significantly reduces data latency from over 24 hours to 15 minutes, enabling near real-time analytics and machine learning applications. By processing only changed records, the system achieves substantial cost savings and improved operational efficiency across petabyte-scale data.
Read original on InfoQ ArchitecturePinterest's previous data infrastructure struggled with high latency, operational complexity, and inefficient resource utilization due to its reliance on multiple, independently maintained batch pipelines and full-table processing. This architecture resulted in data delays exceeding 24 hours, impacting critical use cases like analytics and machine learning that demanded fresher data.
The new framework leverages Change Data Capture (CDC) with Debezium/TiCDC to capture database changes, streaming them through Kafka. Flink and Spark process these streams, and Iceberg tables on AWS S3 serve as the unified data lake. This architecture supports various data sources (MySQL, TiDB, KVStore) and is configuration-driven for easy integration.
Key Architectural Components
The system separates CDC tables (append-only ledgers for change events with sub-5-minute latency) from base tables (full historical snapshots). Base tables are updated incrementally via Spark Merge Into operations, typically every 15 minutes to an hour.
Pinterest evaluated Iceberg's Copy-on-Write (COW) and Merge-on-Read (MOR) strategies for updating base tables. COW rewrites entire data files, leading to higher storage and compute costs for updates. MOR writes changes to separate files and applies them during read time, reducing write amplification. Pinterest standardized on MOR to manage infrastructure costs at petabyte scale, as COW's storage overhead outweighed its benefits for most workloads.