Menu
Datadog Blog·June 4, 2026

Achieving High Availability for PostgreSQL on Kubernetes with Patroni and Synchronous Replication

This article details Datadog's journey to improve PostgreSQL high availability on Kubernetes after discovering failover issues during a gameday. It focuses on the architectural redesign using Patroni for cluster management and synchronous replication to prevent data loss, highlighting key considerations for reliable database operations in a cloud-native environment.

Read original on Datadog Blog

Ensuring high availability for stateful applications like PostgreSQL databases in a Kubernetes environment presents unique challenges. This case study from Datadog illustrates how crucial it is to thoroughly test failover scenarios and design robust solutions to prevent data loss and minimize downtime, especially when operating critical services.

The Failover Challenge in Kubernetes

Initially, Datadog's PostgreSQL setup on Kubernetes faced potential data loss during failovers. This often happens when replication is asynchronous and the primary instance fails before changes are flushed to replicas. A critical lesson is that high availability isn't just about having replicas; it's about ensuring a *safe* and *consistent* transition of leadership without data compromise.

Leveraging Patroni for PostgreSQL HA

Patroni is an open-source solution that provides robust high-availability for PostgreSQL. It acts as a clustering and failover manager, coordinating PostgreSQL instances and ensuring a healthy primary is always available. Key features include automatic failover, replica management, and integration with distributed configuration stores like etcd or ZooKeeper for cluster state.

ℹ️

Synchronous Replication for Data Safety

To explicitly prevent data loss, Datadog implemented synchronous replication. This ensures that a transaction is not considered committed until it has been written to the transaction log (WAL) on at least one replica, guaranteeing that no committed data is lost during a primary failure. While improving data safety, synchronous replication can introduce higher write latencies, which is a critical trade-off to consider.

  • Automatic Failover: Patroni monitors cluster health and promotes a replica if the primary fails.
  • Replica Management: It automatically handles base backups and replica creation.
  • Consensus: Uses a distributed consensus store (e.g., etcd) for robust primary election.
  • Synchronous Mode: Configurable to ensure transactions are committed to replicas before acknowledging to the client.

The successful redesign highlights the importance of choosing the right tools (Patroni), understanding replication modes (synchronous vs. asynchronous), and rigorously testing failure scenarios to build truly resilient database systems in dynamic environments like Kubernetes.

PostgreSQLKubernetesHigh AvailabilityFailoverPatroniSynchronous ReplicationCloud-NativeDatabase Architecture

Comments

Loading comments...