Menu
Airbnb Engineering·June 4, 2026

Building a Reliable Dynamic Configuration Sidecar at Scale with Sitar-agent

Airbnb engineered Sitar-agent, a Kubernetes sidecar, to reliably and quickly deliver dynamic configurations to thousands of service instances without requiring redeployments. This article delves into the architecture, design decisions, and trade-offs made to ensure high availability, performance, and multi-language support for configuration delivery within their infrastructure, emphasizing the journey from initial sync to continuous updates.

Read original on Airbnb Engineering

This article explores the architectural evolution of Airbnb's Sitar-agent, a critical Kubernetes sidecar designed for dynamic configuration delivery. It highlights the challenges of distributing configuration changes reliably and quickly across a large, diverse service fleet and the design choices made to overcome these, focusing on balancing reliability, performance, scalability, and multi-language support. The system ensures configurations are always available, even when the central Sitar Service is down, and updates propagate within tens of seconds.

Sitar Configuration Delivery Lifecycle

The configuration delivery process involves several steps: 1. Config Creation/Update: Developers commit changes via Git or UI, stored with versioning and ACLs in the Sitar Service. 2. Hourly Snapshot Upload: A Snapshot Service periodically uploads compressed full-state config snapshots to AWS S3. 3. Pod Startup Preload: On pod startup, `sitar-agent` first downloads the latest snapshot from S3, then performs an initial sync with the Sitar Service for any subsequent changes. This dual-phase preload ensures fast restarts and resilience to Sitar Service unavailability. 4. Periodic Update: After startup, the agent continuously polls the Sitar Service for updates every few seconds, incorporating jitter to avoid thundering herd issues. 5. Config Read: Applications read configurations from a local disk via a Sitar client library, which maintains an in-memory cache and detects file changes to refresh values transparently.

Key Architectural Decisions

  • Sidecar vs. Main Container: The decision to keep `sitar-agent` as a separate sidecar container was crucial. While a library in the main container would reduce cost and operational surface, the significant downsides included increased multi-language complexity (requiring implementations in Java, Python, Go, TypeScript, Ruby), lack of isolation (a bug in Sitar logic could crash the main app), and blurred operational metrics. The sidecar approach prioritizes reliability, operational clarity, and development efficiency despite higher resource consumption.
  • Pull Model with Server-side Optimizations: `sitar-agent` uses a pull-based model, polling the Sitar Service every 10 seconds. To manage the load from tens of thousands of pods, Airbnb implemented server-side optimizations. These include a short-TTL cache (10s) to absorb most requests without hitting the database, and a token-based mechanism that tells the service to skip scanning for changes before the last fetch, significantly reducing database access and processing time. This approach maintains the pull model's simplicity while scaling efficiently for acceptable latency.
  • Local Datastore Selection (SQLite vs. RocksDB): The legacy Sparkey-backed datastore, designed for write-once, read-many workloads, struggled with Sitar's frequent writes and multi-language needs. After evaluating SQLite and RocksDB, SQLite was chosen despite being 2-3x slower than RocksDB for reads. The primary reasons were SQLite's mature multi-language support (official bindings for all Airbnb's primary languages), built-in WAL mode for concurrent reads/writes (eliminating complex custom locking), and a simpler operational model. RocksDB's higher performance came with increased operational complexity and less mature multi-language ecosystem.
💡

System Design Lessons

When designing distributed systems, carefully weigh the trade-offs between resource efficiency, operational complexity, development effort, and multi-language support. A slightly less performant but more maintainable and reliable solution (like the sidecar model or SQLite) can be a superior choice for large-scale, diverse environments. Prioritize isolation for critical components to prevent cascading failures.

Kubernetessidecardynamic configurationscalabilityreliabilitydistributed systemsAWS S3SQLite

Comments

Loading comments...