Menu
Airbnb Engineering·April 28, 2026

Airbnb's Skipper: An Embedded Workflow Engine for Durable Execution

Airbnb developed Skipper, an embedded workflow engine, to address the challenges of durable execution for long-running, multi-step business processes. It provides reliable, exactly-once semantics and crash recovery without the operational overhead of external orchestration clusters or the vendor lock-in of cloud-managed services. Skipper's design focuses on simplicity, embedding within services, leveraging existing infrastructure, and enabling developers to write business logic cohesively.

Read original on Airbnb Engineering

The Challenge of Durable Execution

Many critical business processes, such as processing insurance claims or payments, involve multiple steps that can span minutes, hours, or even days. These long-running workflows are susceptible to failures like server crashes, leading to issues such as duplicate processing, corrupted state, or incomplete operations. Traditional architectures often struggle with ensuring exactly-once semantics and reliable state management across these failures. Existing solutions, like external orchestration engines (e.g., Cadence, Temporal) or cloud-managed workflow services, often introduce significant operational complexity, infrastructure dependencies, or vendor lock-in, which was problematic for Airbnb's Tier 0 services requiring high availability and minimal external dependencies. Homegrown queue-based systems, while avoiding external dependencies, lead to fragmented domain logic and repeated efforts to solve idempotency and retry logic across teams.

Introducing Skipper: An Embedded Workflow Engine

Skipper is Airbnb's solution: a lightweight, embedded workflow engine designed to provide durable execution capabilities directly within existing services. Unlike external orchestrators, Skipper runs as a library within the application, eliminating the need for a separate cluster or external critical dependency. Its primary goal is to simplify the development of robust, multi-step business processes by abstracting away the complexities of state management, crash recovery, and retries, allowing developers to focus on domain logic.

  • No new critical dependencies: Operates as an embedded library, preventing a single point of failure that a central orchestration cluster would present.
  • Leverages existing infrastructure: Utilizes the host service's existing database (MySQL, Unified Data Store) for state persistence, avoiding new data store management.
  • Self-service integration: Simple library integration with minimal configuration for Java/Kotlin services.
  • Simple programming model: Promotes writing cohesive business logic with an annotation-based contract, reducing boilerplate.
  • Performance neutrality: Designed to coexist with latency-sensitive services using separate thread pools and efficient hibernation.

How Skipper Achieves Durability: Replay with Checkpointing

Skipper's durability mechanism relies on a replay model with checkpointed actions. Workflows are defined as sequences of "Actions" (e.g., API calls, database updates), and the results of these actions are checkpointed to the database. If a workflow needs to wait (e.g., for an external event or timer), its current state is persisted, and the workflow hibernates. Upon restart or when a condition changes (e.g., a signal arrives), Skipper replays the workflow from the beginning. Crucially, previously executed and checkpointed actions do not re-execute; their stored results are returned instantly, allowing the workflow to pick up exactly where it left off. This approach simplifies crash recovery and ensures forward progress. Unlike some event-sourced systems, Skipper directly persists state fields rather than reconstructing state from a full event log, prioritizing efficiency for long-running workflows.

kotlin
// Define workflow logic as normal-looking Kotlin class
ChargeAndAccept : Workflow() {
    private val billing = actions<BillingActions>()
    private val reservations = actions<ReservationActions>()

    @StateParam var paymentCaptured = false

    @WorkflowMethod
    suspend fun execute(r0: Reservation): Reservation {
        val r1 = billing.charge(r0) // durable side-effect boundary
        waitUntil { paymentCaptured } // durable wait (resumes after restart)
        return reservations.markAccepted(r1)
    }
}

// Side effects live in Actions; one annotation makes it checkpointable
class BillingActions : Actions() {
    @Execute(checkpoint = true)
    suspend fun charge(r: Reservation): Reservation = billingApi.chargeAsync(r.id, r.amount).await()
}
workflow enginedurable executionidempotencycrash recoveryembedded systemsorchestrationmicroservices architecturestate management

Comments

Loading comments...