ByteByteGo·May 12, 2026

Figma's Real-Time Data Pipeline Upgrade with CDC and Kafka

Figma successfully transitioned its data pipeline from a multi-day latency batch process to a real-time, incremental synchronization system. This overhaul, driven by rapid data growth and unsustainable costs, leveraged Change Data Capture (CDC) with Kafka and Snowflake to ensure data freshness, scalability, and significant cost savings, while maintaining high data integrity through rigorous validation.

Databases & Storage Distributed Systems Performance & Scaling

Read original on ByteByteGo

The Challenge: Scaling a Batch Data Pipeline

Figma's initial data pipeline was a simple daily cron job performing full table synchronization: querying all rows, dumping to S3, and loading into Snowflake. While straightforward initially, this approach became a major bottleneck as user data grew, leading to multi-day latencies, synchronization tasks taking hours, and millions in annual costs due to dedicated database replicas needed to handle the export load. This scenario highlights the common scalability issues of naive batch processing for rapidly growing datasets.

Solution: Incremental Synchronization with CDC and Kafka

Figma chose to overhaul its pipeline by implementing incremental synchronization using Change Data Capture (CDC). Instead of copying entire tables, CDC reads the database's write-ahead log to capture only inserts, updates, and deletes as they occur. These change events are then published to Kafka, a distributed streaming platform, which acts as a durable buffer between the production database (Amazon RDS PostgreSQL) and the analytics warehouse (Snowflake). This decoupling is crucial for resilience and allows consumers to process data at their own pace without impacting the source database.

💡

Key Components of Figma's Data Pipeline

Amazon RDS (PostgreSQL): Source database for live user traffic, leveraging its APIs for snapshot exports to S3. Change Data Capture (CDC): Mechanism to capture incremental changes from the database's write-ahead log. Kafka: Distributed streaming platform for buffering and reliably transporting change events. Snowflake: Data warehouse for analytics, consuming events from Kafka and applying incremental merges via stored procedures. S3: Object storage for initial full snapshots.

Addressing Data Consistency and Integrity

A critical aspect of CDC-based pipelines is ensuring data integrity, especially during initial load or schema evolution. Figma tackled this by performing an initial full snapshot and ensuring the Kafka CDC stream's start offset *precedes* the snapshot timestamp. This overlap means some duplicate events, but it guarantees no data is lost. Duplicates are handled during the merge step in Snowflake. For ongoing data quality, Figma developed a rigorous validation workflow that independently bootstraps a separate copy of data weekly and performs cell-by-cell comparisons with the main pipeline's output, thus preventing silent data corruption.

Build vs. Buy Decision

Figma evaluated off-the-shelf CDC solutions but opted for an in-house assembly due to limitations with vendor tools regarding RDS-specific APIs, prohibitive costs at their scale, and unproven reliability for their growing data volume. They integrated lower-level components like RDS snapshot exports, Kafka for streaming, and Snowflake stored procedures for merging. This decision highlights the trade-offs between flexibility, cost, and control versus convenience and speed of off-the-shelf solutions for large-scale, specific requirements.

Data freshness improved from 30+ hours to under 3 hours, configurable to minutes.
Pipeline handles tables over 10x larger with consistent performance.
Eliminated dedicated export replicas, resulting in multimillion-dollar annual savings.
Zero major incidents post-launch due to robust validation and automation.

data pipelineCDCKafkaSnowflakeETLreal-time analyticsdata synchronizationscalability

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable, real-time data synchronization pipeline for an application with rapidly growing data volumes, similar to Figma. The pipeline must move data from a transactional PostgreSQL database to an analytical data warehouse (e.g., Snowflake), ensuring low latency, high data integrity through validation, and cost efficiency. Include mechanisms for initial full data load, incremental updates using Change Data Capture (CDC), and robust error handling.

Practice Interview

Other design angles

· Design a data pipeline that supports multiple data sources and destinations, focusing on a unified CDC framework.· Architect a multi-tenant analytics platform where each tenant's data is synchronized in real-time with configurable freshness levels.· Propose a cloud-native, serverless data pipeline leveraging managed services for CDC, streaming, and warehousing.