This article details Pinterest's approach to automated schema evolution within their next-generation change data capture (CDC) ingestion framework, built on Kafka, Flink, Spark, and Iceberg. It highlights the challenges of schema changes in a multi-stage distributed pipeline and outlines a solution that prioritizes availability and eventual consistency through a phased convergence model. The system supports additive-only schema changes to maintain backward compatibility and provides clear recovery paths for unsupported or ambiguous cases.
Read original on Pinterest EngineeringSchema evolution in large-scale distributed systems, particularly those relying on Change Data Capture (CDC), presents significant challenges. A single schema change can impact multiple tightly coupled stages, from data ingestion and transformation to storage and historical backfill. Pinterest's next-generation ingestion platform, leveraging Kafka, Flink, Spark, and Iceberg, faced these complexities, necessitating an automated framework to manage schema changes safely and scalably across its pipeline.
Pinterest's solution treats schema evolution not as an atomic operation, but as a multi-stage convergence process across both the control plane and data plane. This design choice is crucial for preserving pipeline availability while gradually restoring schema and data correctness. The system automates the propagation of supported schema changes across Kafka, Flink, Spark, and Iceberg, incorporating PR-based rollouts with versioning and auditing, and ensuring SLA-based eventual consistency between online and offline schemas.
The framework relies on a dedicated schema definition file, acting as the source of truth, where each column has a stable numeric identifier to track changes unambiguously. This, combined with a sink configuration (defining mappings, type overrides, and conversion functions), drives the automated generation of Flink transformation code, Spark writer code, Iceberg table definitions, and associated queries. This ensures consistency between initial onboarding and subsequent schema evolutions.
Trade-offs in Schema Evolution
To maintain reliability and operational manageability, Pinterest deliberately restricts automated schema evolution to additive changes only. This preserves backward compatibility, avoids complex historical data replay, and minimizes risks to existing consumers. Type changes are highly restricted, generally limited to widening numeric precision, as other changes could introduce widespread incompatibilities and necessitate manual intervention or full data backfills.
A key innovation is the three-phase convergence model, which decouples schema propagation from immediate data correctness, allowing for temporary, bounded divergence:
The system includes mechanisms for challenging scenarios: new columns with default values (manual bootstrap needed), sensitive data changes (blocked, requires pipeline migration), primary key changes (partially automated, may need manual intervention), ambiguous `CREATE TABLE` Diffs (resolved using a two-layer strategy combining build-time static analysis with deployment-time audit trails grounded in database DDL history), and concurrent schema changes (sequentially queued).