Airbnb engineered Sitar-agent, a Kubernetes sidecar, to reliably and quickly deliver dynamic configurations to thousands of service instances without requiring redeployments. This article delves into the architecture, design decisions, and trade-offs made to ensure high availability, performance, and multi-language support for configuration delivery within their infrastructure, emphasizing the journey from initial sync to continuous updates.
Read original on Airbnb EngineeringThis article explores the architectural evolution of Airbnb's Sitar-agent, a critical Kubernetes sidecar designed for dynamic configuration delivery. It highlights the challenges of distributing configuration changes reliably and quickly across a large, diverse service fleet and the design choices made to overcome these, focusing on balancing reliability, performance, scalability, and multi-language support. The system ensures configurations are always available, even when the central Sitar Service is down, and updates propagate within tens of seconds.
The configuration delivery process involves several steps: 1. Config Creation/Update: Developers commit changes via Git or UI, stored with versioning and ACLs in the Sitar Service. 2. Hourly Snapshot Upload: A Snapshot Service periodically uploads compressed full-state config snapshots to AWS S3. 3. Pod Startup Preload: On pod startup, `sitar-agent` first downloads the latest snapshot from S3, then performs an initial sync with the Sitar Service for any subsequent changes. This dual-phase preload ensures fast restarts and resilience to Sitar Service unavailability. 4. Periodic Update: After startup, the agent continuously polls the Sitar Service for updates every few seconds, incorporating jitter to avoid thundering herd issues. 5. Config Read: Applications read configurations from a local disk via a Sitar client library, which maintains an in-memory cache and detects file changes to refresh values transparently.
System Design Lessons
When designing distributed systems, carefully weigh the trade-offs between resource efficiency, operational complexity, development effort, and multi-language support. A slightly less performant but more maintainable and reliable solution (like the sidecar model or SQLite) can be a superior choice for large-scale, diverse environments. Prioritize isolation for critical components to prevent cascading failures.