Menu
InfoQ Architecture·February 26, 2026

Pinterest's CDC-Powered Data Ingestion for Real-time Analytics at Scale

Pinterest engineered a next-generation data ingestion framework using Change Data Capture (CDC) to overcome the limitations of their legacy batch processing systems. This new architecture significantly reduces data latency from over 24 hours to 15 minutes, enabling near real-time analytics and machine learning applications. By processing only changed records, the system achieves substantial cost savings and improved operational efficiency across petabyte-scale data.

Read original on InfoQ Architecture

Pinterest's previous data infrastructure struggled with high latency, operational complexity, and inefficient resource utilization due to its reliance on multiple, independently maintained batch pipelines and full-table processing. This architecture resulted in data delays exceeding 24 hours, impacting critical use cases like analytics and machine learning that demanded fresher data.

Challenges of Legacy Batch Ingestion

  • Data Latency: Over 24 hours, delaying time-sensitive insights and ML model training.
  • Resource Inefficiency: Full-table batch jobs reprocessed unchanged records (daily changes often below 5%), wasting compute and storage.
  • Lack of Native Deletion Support: Row-level deletions were not natively handled.
  • Operational Fragmentation: Inconsistent data quality and high maintenance overhead due to disparate pipelines.

Next-Generation CDC-Powered Architecture

The new framework leverages Change Data Capture (CDC) with Debezium/TiCDC to capture database changes, streaming them through Kafka. Flink and Spark process these streams, and Iceberg tables on AWS S3 serve as the unified data lake. This architecture supports various data sources (MySQL, TiDB, KVStore) and is configuration-driven for easy integration.

💡

Key Architectural Components

The system separates CDC tables (append-only ledgers for change events with sub-5-minute latency) from base tables (full historical snapshots). Base tables are updated incrementally via Spark Merge Into operations, typically every 15 minutes to an hour.

Iceberg Merge Strategies: Copy-on-Write vs. Merge-on-Read

Pinterest evaluated Iceberg's Copy-on-Write (COW) and Merge-on-Read (MOR) strategies for updating base tables. COW rewrites entire data files, leading to higher storage and compute costs for updates. MOR writes changes to separate files and applies them during read time, reducing write amplification. Pinterest standardized on MOR to manage infrastructure costs at petabyte scale, as COW's storage overhead outweighed its benefits for most workloads.

Optimizations and Outcomes

  • Partitioning base tables by a hash of the primary key using Iceberg bucketing for parallelized upserts and reduced data scans.
  • Distributing Spark writes by partition to mitigate the 'small files problem'.
  • Reduced data latency from >24 hours to ~15 minutes.
  • Processing only ~5% of changed records daily, leading to significant infrastructure cost savings.
  • Handles petabyte-scale data across thousands of pipelines with incremental updates and deletions.
CDCChange Data CaptureApache KafkaApache FlinkApache SparkApache IcebergData LakeReal-time Analytics

Comments

Loading comments...
Pinterest's CDC-Powered Data Ingestion for Real-time Analytics at Scale | SysDesAi