ByteByteGo·April 1, 2026

Datadog's Data Replication Platform: From OLTP to Real-time Search

This article details how Datadog tackled performance bottlenecks in their shared PostgreSQL database by implementing a robust, asynchronous data replication platform. They shifted from using Postgres for real-time search workloads to a dedicated search platform, leveraging Change Data Capture (CDC) with Debezium, Kafka, and a Schema Registry. The solution highlights critical distributed system trade-offs, particularly between consistency and availability, and the importance of automation for managing complex data pipelines at scale.

Distributed Systems Databases & Storage Performance & Scaling

Read original on ByteByteGo

The Challenge: PostgreSQL Overload

Datadog initially relied on a shared PostgreSQL database for many services. While suitable for OLTP (Online Transaction Processing) workloads, it struggled significantly with real-time search queries involving large joins and denormalized datasets. A specific 'Metrics Summary' page experienced p90 latencies of 7 seconds due to expensive joins between 82,000 active metrics and 817,000 configurations. Standard optimizations like indexing and query tuning proved insufficient, indicating a fundamental mismatch between the workload and the database's design.

Architectural Shift: CDC to a Dedicated Search Platform

To resolve the issue, Datadog adopted a Change Data Capture (CDC) approach. Instead of forcing PostgreSQL to act as a search engine, they replicated data from Postgres into a specialized search platform (like Elasticsearch). This involved flattening relational data into denormalized documents suitable for search queries. The core components of this replication pipeline are:

Debezium: An open-source CDC tool that reads the PostgreSQL Write-Ahead Log (WAL) to capture all data changes (inserts, updates, deletes).
Kafka: A durable, high-throughput message broker that buffers the change events streamed from Debezium. Kafka's persistence and replay capabilities ensure no data loss even if consumers are temporarily down.
Sink Connectors: Components that push the data from Kafka into the chosen search platform, transforming it as needed.

ℹ️

Key Benefit

This architecture allows applications to continue writing to PostgreSQL as before, while search queries are directed to the purpose-built search platform, effectively separating OLTP and OLAP workloads.

Asynchronous Replication and Consistency Trade-offs

Datadog chose asynchronous replication over synchronous. Synchronous replication guarantees strong consistency but introduces significant latency and tightly couples system performance to the slowest replica. Asynchronous replication, while introducing a 'replication lag' where the replica is temporarily behind the source, offers superior speed and resilience. For Datadog's use cases (search, filtering, analytics), a few hundred milliseconds of eventual consistency was an acceptable trade-off for eliminating 7-second page loads. This decision is a practical application of the CAP theorem, prioritizing availability and partition tolerance over strong consistency across the entire distributed system.

Handling Schema Evolution

Schema changes in a CDC pipeline are complex, as they must propagate to all downstream consumers. Datadog implemented a two-pronged defense:

Automated Validation: Analyzes SQL schema migration scripts before deployment, flagging risky changes (e.g., adding NOT NULL constraints) that require coordinated rollout.
Multi-tenant Kafka Schema Registry: Configured for backward compatibility, ensuring new schemas can still be read by consumers understanding older schemas. Avro serialization is used for its native support for schema negotiation.

Platform Automation with Temporal

Manually configuring each CDC pipeline (Postgres logical replication, Debezium instances, Kafka topics, sink connectors, etc.) is operationally intensive. Datadog leveraged Temporal, a workflow orchestration engine, to automate the entire provisioning process. This turned a one-off fix into a company-wide, self-service platform, enabling teams to request pipelines through an automated system, significantly reducing operational burden and human error. This automation expanded the initial Postgres-to-search pipeline to other use cases like Postgres-to-Postgres (for database unwinding), Postgres-to-Iceberg, Cassandra replication, and cross-region Kafka replication.

💡

Considerations for CDC Platforms

While powerful, a CDC replication platform introduces overhead: eventual consistency, schema evolution constraints, and significant operational burden for the underlying infrastructure (Debezium, Kafka, Schema Registry, Temporal). It's justified when workloads don't fit primary databases, multiple teams need data in different shapes, and manual management is untenable at scale. For simpler needs, batch syncs or read replicas might suffice.

CDCChange Data CapturePostgreSQLKafkaDebeziumSchema RegistryAsynchronous ReplicationCAP Theorem