Cloudflare Blog·March 19, 2026

Cloudflare Workers AI: Scaling Large Language Models for Agentic Workloads

This article details Cloudflare's enhancements to Workers AI to support large language models (LLMs) like Kimi K2.5, focusing on the underlying infrastructure changes for efficient inference. It highlights architectural optimizations such as custom kernels, prefix caching, and redesigned asynchronous APIs to improve performance, reduce costs, and ensure reliability for AI agent workloads at scale.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on Cloudflare Blog

Introduction to Cloudflare Workers AI for LLMs

Cloudflare is positioning its Developer Platform as a robust environment for building and deploying AI agents. This involves not just providing an execution environment via Durable Objects, Workflows, and Dynamic Workers, but also integrating powerful AI inference capabilities. The latest update to Workers AI enables the execution of large, frontier open-source models, exemplified by Moonshot AI's Kimi K2.5, directly within the platform. This allows for a unified platform experience, handling the entire agent lifecycle from execution to model inference, addressing the growing demand for complex agentic tasks powered by sophisticated LLMs.

Architectural Optimizations for Large Model Inference

Serving large LLMs efficiently requires significant architectural changes to an inference stack. Cloudflare's approach for Kimi K2.5 on Workers AI involves several key optimizations:

Custom Kernels: Development of proprietary custom kernels on top of their Infire inference engine to optimize model performance and GPU utilization beyond off-the-shelf solutions.
Parallelization Techniques: Leveraging data, tensor, and expert parallelization strategies to distribute the computational load and maximize throughput.
Disaggregated Prefill: Separating the prefill (processing input tokens) and generation (generating output tokens) stages onto different machines to achieve better throughput and higher GPU utilization. This ensures GPUs are not underutilized during the often sequential prefill phase.

💡

The Value of Managed Inference Platforms

The article emphasizes that these optimizations are complex and require deep ML and DevOps expertise. Platforms like Workers AI abstract this complexity, allowing developers to consume LLM inference as a service without needing to be ML Engineers or SREs, drastically reducing operational overhead and accelerating development.

Platform Improvements for Agentic Workloads

Beyond model-specific optimizations, Cloudflare has introduced platform-level features critical for agentic workloads:

Prefix Caching and Session Affinity: To combat the high cost and latency of processing large context windows in multi-turn conversations, Workers AI implements prefix caching. This caches input tensors from previous requests, only processing new input tokens. A new `x-session-affinity` header allows routing requests to the same model instance, maximizing cache hit rates for faster Time to First Token (TTFT) and higher Tokens Per Second (TPS).
Redesigned Asynchronous APIs: Recognizing the challenges of serverless inference (e.g., capacity constraints), Cloudflare revamped its asynchronous API. This pull-based system processes queued requests when model instances have headroom, ensuring durable execution for non-real-time use cases like code scanning or research agents, mitigating 'Out of Capacity' errors and providing more predictable throughput for batch-like workloads. Event notifications are also available to avoid polling.

shell

curl -X POST \
  "https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/moonshotai/kimi-k2.5" \
  -H "Authorization: Bearer {API_TOKEN}" \
  -H "Content-Type: application/json" \
  -H "x-session-affinity: ses_12345678" \
  -d '{ "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is prefix caching and why does it matter?" } ], "max_tokens": 2400, "stream": true }'

Cloudflare WorkersLLM InferenceAI AgentsDistributed InferenceGPU OptimizationCachingAsynchronous APIServerless AI

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable and cost-efficient distributed LLM inference platform capable of serving frontier-scale models for agentic workloads. Your design should incorporate strategies for maximizing GPU utilization, implementing effective caching mechanisms like prefix caching, and supporting both synchronous and asynchronous inference requests while managing capacity and ensuring reliability. Consider how clients would interact with this platform, including session management and error handling.

Practice Interview

Focus: distributed LLM inference platform with prefix caching and async processing

Other design angles

· Design only the caching layer for an LLM inference service, focusing on prefix caching, cache invalidation, and ensuring high cache hit rates across multiple inference nodes.· Design an API gateway specifically for AI agents that routes requests to various LLM providers, handles rate limiting, and optimizes for multi-turn conversations through session affinity and caching.· Architect a serverless function platform with integrated AI inference capabilities, detailing how GPU resources are managed, scaled, and allocated for diverse synchronous and asynchronous ML workloads.