This article details how Datadog tackled performance bottlenecks in their shared PostgreSQL database by implementing a robust, asynchronous data replication platform. They shifted from using Postgres for real-time search workloads to a dedicated search platform, leveraging Change Data Capture (CDC) with Debezium, Kafka, and a Schema Registry. The solution highlights critical distributed system trade-offs, particularly between consistency and availability, and the importance of automation for managing complex data pipelines at scale.
Read original on ByteByteGoDatadog initially relied on a shared PostgreSQL database for many services. While suitable for OLTP (Online Transaction Processing) workloads, it struggled significantly with real-time search queries involving large joins and denormalized datasets. A specific 'Metrics Summary' page experienced p90 latencies of 7 seconds due to expensive joins between 82,000 active metrics and 817,000 configurations. Standard optimizations like indexing and query tuning proved insufficient, indicating a fundamental mismatch between the workload and the database's design.
To resolve the issue, Datadog adopted a Change Data Capture (CDC) approach. Instead of forcing PostgreSQL to act as a search engine, they replicated data from Postgres into a specialized search platform (like Elasticsearch). This involved flattening relational data into denormalized documents suitable for search queries. The core components of this replication pipeline are:
Key Benefit
This architecture allows applications to continue writing to PostgreSQL as before, while search queries are directed to the purpose-built search platform, effectively separating OLTP and OLAP workloads.
Datadog chose asynchronous replication over synchronous. Synchronous replication guarantees strong consistency but introduces significant latency and tightly couples system performance to the slowest replica. Asynchronous replication, while introducing a 'replication lag' where the replica is temporarily behind the source, offers superior speed and resilience. For Datadog's use cases (search, filtering, analytics), a few hundred milliseconds of eventual consistency was an acceptable trade-off for eliminating 7-second page loads. This decision is a practical application of the CAP theorem, prioritizing availability and partition tolerance over strong consistency across the entire distributed system.
Schema changes in a CDC pipeline are complex, as they must propagate to all downstream consumers. Datadog implemented a two-pronged defense:
Manually configuring each CDC pipeline (Postgres logical replication, Debezium instances, Kafka topics, sink connectors, etc.) is operationally intensive. Datadog leveraged Temporal, a workflow orchestration engine, to automate the entire provisioning process. This turned a one-off fix into a company-wide, self-service platform, enabling teams to request pipelines through an automated system, significantly reducing operational burden and human error. This automation expanded the initial Postgres-to-search pipeline to other use cases like Postgres-to-Postgres (for database unwinding), Postgres-to-Iceberg, Cassandra replication, and cross-region Kafka replication.
Considerations for CDC Platforms
While powerful, a CDC replication platform introduces overhead: eventual consistency, schema evolution constraints, and significant operational burden for the underlying infrastructure (Debezium, Kafka, Schema Registry, Temporal). It's justified when workloads don't fit primary databases, multiple teams need data in different shapes, and manual management is untenable at scale. For simpler needs, batch syncs or read replicas might suffice.