Netflix developed an event-driven orchestration platform to automate code changes and migrations across its vast and diverse software fleet, aiming to reduce migration times from months to days. This platform uses composable, 'Lego-like' steps, integrates automated canary validation, and incorporates compliance checks to ensure safety and confidence in large-scale changes. The core architectural challenge was to balance flexibility for unique migrations with the need for standardized, repeatable processes for common updates.
Read original on InfoQ CloudNetflix faced a significant challenge with software migrations: libraries often had many active versions due to slow adoption of updates, leading to a "long tail" of maintenance. Critical vulnerabilities, like Log4j, highlighted the need for rapid, fleet-wide code changes. The goal was to automate all code changes within a week, and critical vulnerabilities within two days, with minimal effort for both platform teams and software owners. Key requirements included handling the diverse characteristics of the fleet (languages, security, business units, monorepos vs. microservices) and ensuring changes were applied safely without breaking production systems.
Netflix's solution is a fleet-wide automation platform centered around an event-driven orchestration engine. This system allows platform teams to create "campaigns" to update "targets" (software units) along a defined "path" of automated steps. The architecture decouples the state machine from the event consumer, enabling flexibility for events to originate from various internal and external systems. This design ensures the system can react to diverse triggers and progress changes asynchronously.
At the heart of the platform are composable, predefined units of automation, likened to Lego bricks. Each step has its own state, allowing for flexible path creation to accommodate unique migration requirements while also offering pre-configured paths for common updates (e.g., dependency updates). The state machine processes incoming events, determines the next step, updates step states, launches child workflows (step handlers) for specific automation tasks, and manages edge cases like pausing, resuming, and failure handling.
Safety First: Automated Canary and Compliance Checks
To ensure safety, the platform implements several checks: - Draft Pull Requests: Changes are initially made in draft PRs, awaiting all PR checks to pass. - Automated Canary Validation: Integration with resilience teams enables canary deployments. If a canary fails, the rollout stops, preventing broader impact. - Phased Rollouts: Changes are rolled out by criticality, allowing early detection of issues in lower-risk applications. - Compliance Checks: Ensures changes align with team preferences and security requirements. - Easy Interventions: Provides a 'big red stop button' for manual pauses at any point.
The typical migration path involves a code transform step (using custom scripts, GenAI-prompted containers, or pre-configured codemods), followed by draft pull request creation, and then an extensive validation step. This validation leverages automated canaries, a crucial mechanism to test changes in a small production subset before widespread deployment, significantly boosting confidence in automated changes.