InfoQ Architecture·February 26, 2026

Pinterest's CDC-Powered Data Ingestion for Real-time Analytics at Scale

Pinterest engineered a next-generation data ingestion framework using Change Data Capture (CDC) to overcome the limitations of their legacy batch processing systems. This new architecture significantly reduces data latency from over 24 hours to 15 minutes, enabling near real-time analytics and machine learning applications. By processing only changed records, the system achieves substantial cost savings and improved operational efficiency across petabyte-scale data.

Databases & Storage Distributed Systems Performance & Scaling

Read original on InfoQ Architecture

Pinterest's previous data infrastructure struggled with high latency, operational complexity, and inefficient resource utilization due to its reliance on multiple, independently maintained batch pipelines and full-table processing. This architecture resulted in data delays exceeding 24 hours, impacting critical use cases like analytics and machine learning that demanded fresher data.

Challenges of Legacy Batch Ingestion

Data Latency: Over 24 hours, delaying time-sensitive insights and ML model training.
Resource Inefficiency: Full-table batch jobs reprocessed unchanged records (daily changes often below 5%), wasting compute and storage.
Lack of Native Deletion Support: Row-level deletions were not natively handled.
Operational Fragmentation: Inconsistent data quality and high maintenance overhead due to disparate pipelines.

Next-Generation CDC-Powered Architecture

The new framework leverages Change Data Capture (CDC) with Debezium/TiCDC to capture database changes, streaming them through Kafka. Flink and Spark process these streams, and Iceberg tables on AWS S3 serve as the unified data lake. This architecture supports various data sources (MySQL, TiDB, KVStore) and is configuration-driven for easy integration.

💡

Key Architectural Components

The system separates CDC tables (append-only ledgers for change events with sub-5-minute latency) from base tables (full historical snapshots). Base tables are updated incrementally via Spark Merge Into operations, typically every 15 minutes to an hour.

Iceberg Merge Strategies: Copy-on-Write vs. Merge-on-Read

Pinterest evaluated Iceberg's Copy-on-Write (COW) and Merge-on-Read (MOR) strategies for updating base tables. COW rewrites entire data files, leading to higher storage and compute costs for updates. MOR writes changes to separate files and applies them during read time, reducing write amplification. Pinterest standardized on MOR to manage infrastructure costs at petabyte scale, as COW's storage overhead outweighed its benefits for most workloads.

Optimizations and Outcomes

Partitioning base tables by a hash of the primary key using Iceberg bucketing for parallelized upserts and reduced data scans.
Distributing Spark writes by partition to mitigate the 'small files problem'.
Reduced data latency from >24 hours to ~15 minutes.
Processing only ~5% of changed records daily, leading to significant infrastructure cost savings.
Handles petabyte-scale data across thousands of pipelines with incremental updates and deletions.

CDCChange Data CaptureApache KafkaApache FlinkApache SparkApache IcebergData LakeReal-time Analytics

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable, low-latency data ingestion framework for an e-commerce platform that processes petabytes of transactional data. The system must support incremental updates and deletions from various operational databases, provide data for real-time analytics and machine learning applications, and leverage a data lake based on technologies like CDC, Kafka, Flink/Spark, and Iceberg. Focus on the architectural choices for handling large-scale change data capture, efficient updates to historical data, and cost optimization.

Focus: CDC-powered data ingestion framework for a data lake

Other design angles

· Design a data warehousing solution that can ingest streaming data with low latency from multiple microservices using event-driven architecture principles.· Detail the trade-offs and implementation considerations for using Copy-on-Write versus Merge-on-Read strategies in a large-scale data lake environment for incremental updates.· Design an end-to-end data pipeline for a social media platform, from user action capture to a data lake, ensuring data consistency and real-time availability for personalized recommendations.