This article discusses the emerging architectural stack for building production-grade AI agents, focusing on the Cloudflare Agents SDK and the Flue framework. It addresses common distributed systems challenges like durable execution, secure code execution, and persistent storage that agents face in cloud environments. The solution involves a three-layer architecture: framework, harness, and a platform that provides core primitives for reliability and scalability.
Read original on Cloudflare BlogThe article highlights the maturation of agent harnesses, moving AI agents from prototypes to load-bearing infrastructure. This shift introduces significant distributed systems challenges, particularly concerning reliability and state management in a cloud environment. Key problems include graceful resumption from interruptions, secure execution of untrusted code, and persistent access to tools and data without losing context or wasting resources.
To address the complexities of scaling AI agents, a new architectural stack is emerging, comprising three distinct layers:
Architectural Pattern
This layered approach separates concerns, allowing frameworks to focus on developer experience, harnesses on agent logic, and the platform on fundamental distributed systems challenges like durability and security. This mirrors traditional software architecture patterns where infrastructure provides robust primitives to higher-level application logic.
Agent turns can be long-running and multi-step, making them susceptible to interruptions or crashes. Losing in-memory state during such events leads to poor user experience and wasted compute. The Cloudflare Agents SDK tackles this with Durable Objects and a Fiber mechanism.
import { Agent } from "agents";
import type { FiberRecoveryContext } from "agents";
class MyAgent extends Agent {
async doWork() {
await this.runFiber("my-task", async (ctx) => {
const step1 = await expensiveOperation();
ctx.stash({ step1 }); // Checkpoint progress
const step2 = await anotherExpensiveOperation(step1);
this.setState({ ...this.state, result: step2 });
});
}
async onFiberRecovered(ctx: FiberRecoveryContext) {
if (ctx.name !== "my-task") return;
const { step1 } = (ctx.snapshot ?? {}) as { step1?: unknown };
if (step1) {
const step2 = await anotherExpensiveOperation(step1);
this.setState({ ...this.state, result: step2 }); // Resume from checkpoint
}
}
}This mechanism leverages `runFiber()` for checkpointing state to the Durable Object's SQLite storage and `onFiberRecovered()` to resume execution from the last valid checkpoint after an interruption. This ensures that agent state is never volatile and provides fault tolerance, critical for production systems.
Instead of an ever-growing list of tools, the platform allows agents to execute generated code. For secure execution, `@cloudflare/codemode` wraps Dynamic Workers to run LLM-generated code in isolated, ephemeral Worker isolates. This approach offers significant advantages over traditional container-based sandboxes:
Agents, especially coding agents, often require a persistent filesystem. The Agents SDK provides `@cloudflare/shell`, offering a durable virtual filesystem backed by SQLite within the Durable Object. This enables common file operations (read, write, grep, diff) without the overhead of a full container. For more complex scenarios requiring a full OS, Cloudflare Containers are available, and `@cloudflare/workspace` aims to bridge the virtual filesystem with container environments.
Additionally, for orchestrating multi-step, complex tasks, `@cloudflare/dynamic-workflows` allows agents to generate and execute durable workflows. This feature enables agents to reliably coordinate sequences of operations, persist intermediate steps, and retry failures, making them suitable for intricate tasks like code reviews or research pipelines.