Figma successfully transitioned its data pipeline from a multi-day latency batch process to a real-time, incremental synchronization system. This overhaul, driven by rapid data growth and unsustainable costs, leveraged Change Data Capture (CDC) with Kafka and Snowflake to ensure data freshness, scalability, and significant cost savings, while maintaining high data integrity through rigorous validation.
Read original on ByteByteGoFigma's initial data pipeline was a simple daily cron job performing full table synchronization: querying all rows, dumping to S3, and loading into Snowflake. While straightforward initially, this approach became a major bottleneck as user data grew, leading to multi-day latencies, synchronization tasks taking hours, and millions in annual costs due to dedicated database replicas needed to handle the export load. This scenario highlights the common scalability issues of naive batch processing for rapidly growing datasets.
Figma chose to overhaul its pipeline by implementing incremental synchronization using Change Data Capture (CDC). Instead of copying entire tables, CDC reads the database's write-ahead log to capture only inserts, updates, and deletes as they occur. These change events are then published to Kafka, a distributed streaming platform, which acts as a durable buffer between the production database (Amazon RDS PostgreSQL) and the analytics warehouse (Snowflake). This decoupling is crucial for resilience and allows consumers to process data at their own pace without impacting the source database.
Key Components of Figma's Data Pipeline
Amazon RDS (PostgreSQL): Source database for live user traffic, leveraging its APIs for snapshot exports to S3. Change Data Capture (CDC): Mechanism to capture incremental changes from the database's write-ahead log. Kafka: Distributed streaming platform for buffering and reliably transporting change events. Snowflake: Data warehouse for analytics, consuming events from Kafka and applying incremental merges via stored procedures. S3: Object storage for initial full snapshots.
A critical aspect of CDC-based pipelines is ensuring data integrity, especially during initial load or schema evolution. Figma tackled this by performing an initial full snapshot and ensuring the Kafka CDC stream's start offset *precedes* the snapshot timestamp. This overlap means some duplicate events, but it guarantees no data is lost. Duplicates are handled during the merge step in Snowflake. For ongoing data quality, Figma developed a rigorous validation workflow that independently bootstraps a separate copy of data weekly and performs cell-by-cell comparisons with the main pipeline's output, thus preventing silent data corruption.
Figma evaluated off-the-shelf CDC solutions but opted for an in-house assembly due to limitations with vendor tools regarding RDS-specific APIs, prohibitive costs at their scale, and unproven reliability for their growing data volume. They integrated lower-level components like RDS snapshot exports, Kafka for streaming, and Snowflake stored procedures for merging. This decision highlights the trade-offs between flexibility, cost, and control versus convenience and speed of off-the-shelf solutions for large-scale, specific requirements.