Dev.to #systemdesign·June 24, 2026

Architecting Durable Runtimes for AI Agents: Beyond Sandboxes

This article, based on a presentation by Virein Baraiya from Orkes, discusses the necessity of moving beyond traditional sandboxes for running AI agents in production. It advocates for a "durable runtime" that ensures agent state persistence, crash recovery, and auditability. The core architectural idea involves separating agent "reasoning" from "execution" and leveraging workflow engines like Netflix Conductor for robust orchestration.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on Dev.to #systemdesign

Limitations of Traditional AI Agent Runtimes

Traditional AI agent development often treats agents as entities running in an in-memory loop (LLM in the loop), where the LLM observes state, decides on tool calls, updates state, and continues the loop. While simple, this model has significant drawbacks in production environments, particularly for long-running or complex tasks.

Lack of long-running task support: Sandboxes or micro-VMs are inefficient for tasks spanning days or weeks, especially those involving human intervention (Human-in-the-loop). Keeping the entire process active consumes excessive CPU and memory.
Absence of crash recovery: In-memory states are lost upon process crashes (e.g., network timeouts, power failures), making it impossible to resume accurately from the last point.
Complex multi-agent coordination: Inter-agent communication within sandboxes requires intricate glue code for IP addressing, retry logic, and state management.

Core Architectural Principle: Separate Reasoning and Execution

The proposed architecture advocates for a clear separation of concerns to address the aforementioned issues. The sandbox is repurposed primarily for securing the execution of tools generated by the agent, which might carry security risks or bugs. The key insight is that the LLM should only be responsible for *planning* or *proposing* actions, not directly executing them. A resilient underlying runtime system takes over the actual tool invocation.

💡

Agent as Dynamic Sagas

Instead of static, predefined workflows, AI agents can be viewed as dynamically constructed 'Sagas' or long-running transactions. While the LLM generates the workflow on-the-fly (late-bound sagas), the underlying durable runtime is crucial for recording each step, enabling persistence and recovery, similar to how traditional sagas ensure transactional integrity across distributed services.

Solution Components: Conductor and Agent Span

Netflix Conductor (Microservice Workflow Engine): This open-source project (originally from Netflix, now maintained by Orkes) serves as the durable foundation. It records every step of an agent's execution—LLM calls, tool invocations, state transitions, human inputs—into a persistent ledger using a database (e.g., PostgreSQL, Redis). Conductor enables features like:On-demand suspension: Workflows can pause and release all CPU/memory resources during long waits (e.g., human approval), resuming precisely when triggered, even months later.Idempotency handling: Due to at-least-once delivery semantics, tools (especially non-idempotent ones) require careful retry logic.
Agent Span (Agent Runtime): Built atop Conductor, Agent Span acts as a specialized runtime for AI agents. It functions as a 'compiler' that converts agents defined using popular SDKs (e.g., LangGraph, OpenAI Agents) into Conductor's durable workflows without modifying the agent's business logic. This allows agents to seamlessly leverage Conductor's persistence and orchestration capabilities.

Architectural Benefits of Durable Runtimes

Deterministic Guardrails: Security and safety checks are enforced by the framework, not reliant on the LLM's decision-making. This prevents agents from bypassing critical safeguards due to hallucinations (e.g., accidental database deletion).
Full Auditability and Replay: Every action and decision made by the agent is logged in the ledger, enabling complete auditing months later. Developers can "replay" execution, even mocking LLM outputs, for debugging and compliance.
Efficient Testing and Evaluation: The ability to alter LLM outputs at specific steps and observe subsequent tool chains and business logic provides more deterministic and effective testing for agent behavior.

AI AgentsDurable RuntimesWorkflow OrchestrationNetflix ConductorDistributed SagasCrash RecoveryScalabilitySystem Design for AI