Cloudflare Blog·June 17, 2026

Architecting Production-Grade AI Agents with Cloudflare's Agents SDK and Flue

This article discusses the emerging architectural stack for building production-grade AI agents, focusing on the Cloudflare Agents SDK and the Flue framework. It addresses common distributed systems challenges like durable execution, secure code execution, and persistent storage that agents face in cloud environments. The solution involves a three-layer architecture: framework, harness, and a platform that provides core primitives for reliability and scalability.

Distributed Systems AI & ML Infrastructure Cloud & Infrastructure

Read original on Cloudflare Blog

The article highlights the maturation of agent harnesses, moving AI agents from prototypes to load-bearing infrastructure. This shift introduces significant distributed systems challenges, particularly concerning reliability and state management in a cloud environment. Key problems include graceful resumption from interruptions, secure execution of untrusted code, and persistent access to tools and data without losing context or wasting resources.

The Three-Layer Stack for Production AI Agents

To address the complexities of scaling AI agents, a new architectural stack is emerging, comprising three distinct layers:

The Framework (e.g., Flue): Provides project structure, conventions, integrations, CLI, and developer experience. It focuses on how agents are built and integrated into existing workflows.
The Harness (e.g., Pi, Project Think): Implements the core agentic loop, managing tool calls, result processing, context management, and task progression.
The Runtime/Platform (Cloudflare Agents SDK): Offers the foundational compute, state, and storage primitives necessary for durable and scalable agent execution.

ℹ️

Architectural Pattern

This layered approach separates concerns, allowing frameworks to focus on developer experience, harnesses on agent logic, and the platform on fundamental distributed systems challenges like durability and security. This mirrors traditional software architecture patterns where infrastructure provides robust primitives to higher-level application logic.

Durable Execution for AI Agents

Agent turns can be long-running and multi-step, making them susceptible to interruptions or crashes. Losing in-memory state during such events leads to poor user experience and wasted compute. The Cloudflare Agents SDK tackles this with Durable Objects and a Fiber mechanism.

typescript

import { Agent } from "agents";
import type { FiberRecoveryContext } from "agents";

class MyAgent extends Agent {
  async doWork() {
    await this.runFiber("my-task", async (ctx) => {
      const step1 = await expensiveOperation();
      ctx.stash({ step1 }); // Checkpoint progress
      const step2 = await anotherExpensiveOperation(step1);
      this.setState({ ...this.state, result: step2 });
    });
  }

  async onFiberRecovered(ctx: FiberRecoveryContext) {
    if (ctx.name !== "my-task") return;
    const { step1 } = (ctx.snapshot ?? {}) as { step1?: unknown };
    if (step1) {
      const step2 = await anotherExpensiveOperation(step1);
      this.setState({ ...this.state, result: step2 }); // Resume from checkpoint
    }
  }
}

This mechanism leverages `runFiber()` for checkpointing state to the Durable Object's SQLite storage and `onFiberRecovered()` to resume execution from the last valid checkpoint after an interruption. This ensures that agent state is never volatile and provides fault tolerance, critical for production systems.

Secure and Efficient Code Execution

Instead of an ever-growing list of tools, the platform allows agents to execute generated code. For secure execution, `@cloudflare/codemode` wraps Dynamic Workers to run LLM-generated code in isolated, ephemeral Worker isolates. This approach offers significant advantages over traditional container-based sandboxes:

Speed: Isolates start in under 10ms, much faster than booting containers.
Cost-Efficiency: Significantly cheaper per execution ($0.002 per load) compared to containers.
Security: Each code snippet runs in its own Worker isolate with strictly controlled bindings, preventing malicious code from affecting the host or other agents.

Durable Filesystem and Dynamic Workflows

Agents, especially coding agents, often require a persistent filesystem. The Agents SDK provides `@cloudflare/shell`, offering a durable virtual filesystem backed by SQLite within the Durable Object. This enables common file operations (read, write, grep, diff) without the overhead of a full container. For more complex scenarios requiring a full OS, Cloudflare Containers are available, and `@cloudflare/workspace` aims to bridge the virtual filesystem with container environments.

Additionally, for orchestrating multi-step, complex tasks, `@cloudflare/dynamic-workflows` allows agents to generate and execute durable workflows. This feature enables agents to reliably coordinate sequences of operations, persist intermediate steps, and retry failures, making them suitable for intricate tasks like code reviews or research pipelines.

AI AgentsDistributed SystemsDurable ObjectsServerlessCloudflareFault ToleranceCode ExecutionScalability