Menu
Cloudflare Blog·March 19, 2026

Cloudflare Workers AI: Scaling Large Language Models for Agentic Workloads

This article details Cloudflare's enhancements to Workers AI to support large language models (LLMs) like Kimi K2.5, focusing on the underlying infrastructure changes for efficient inference. It highlights architectural optimizations such as custom kernels, prefix caching, and redesigned asynchronous APIs to improve performance, reduce costs, and ensure reliability for AI agent workloads at scale.

Read original on Cloudflare Blog

Introduction to Cloudflare Workers AI for LLMs

Cloudflare is positioning its Developer Platform as a robust environment for building and deploying AI agents. This involves not just providing an execution environment via Durable Objects, Workflows, and Dynamic Workers, but also integrating powerful AI inference capabilities. The latest update to Workers AI enables the execution of large, frontier open-source models, exemplified by Moonshot AI's Kimi K2.5, directly within the platform. This allows for a unified platform experience, handling the entire agent lifecycle from execution to model inference, addressing the growing demand for complex agentic tasks powered by sophisticated LLMs.

Architectural Optimizations for Large Model Inference

Serving large LLMs efficiently requires significant architectural changes to an inference stack. Cloudflare's approach for Kimi K2.5 on Workers AI involves several key optimizations:

  • Custom Kernels: Development of proprietary custom kernels on top of their Infire inference engine to optimize model performance and GPU utilization beyond off-the-shelf solutions.
  • Parallelization Techniques: Leveraging data, tensor, and expert parallelization strategies to distribute the computational load and maximize throughput.
  • Disaggregated Prefill: Separating the prefill (processing input tokens) and generation (generating output tokens) stages onto different machines to achieve better throughput and higher GPU utilization. This ensures GPUs are not underutilized during the often sequential prefill phase.
💡

The Value of Managed Inference Platforms

The article emphasizes that these optimizations are complex and require deep ML and DevOps expertise. Platforms like Workers AI abstract this complexity, allowing developers to consume LLM inference as a service without needing to be ML Engineers or SREs, drastically reducing operational overhead and accelerating development.

Platform Improvements for Agentic Workloads

Beyond model-specific optimizations, Cloudflare has introduced platform-level features critical for agentic workloads:

  • Prefix Caching and Session Affinity: To combat the high cost and latency of processing large context windows in multi-turn conversations, Workers AI implements prefix caching. This caches input tensors from previous requests, only processing new input tokens. A new `x-session-affinity` header allows routing requests to the same model instance, maximizing cache hit rates for faster Time to First Token (TTFT) and higher Tokens Per Second (TPS).
  • Redesigned Asynchronous APIs: Recognizing the challenges of serverless inference (e.g., capacity constraints), Cloudflare revamped its asynchronous API. This pull-based system processes queued requests when model instances have headroom, ensuring durable execution for non-real-time use cases like code scanning or research agents, mitigating 'Out of Capacity' errors and providing more predictable throughput for batch-like workloads. Event notifications are also available to avoid polling.
shell
curl -X POST \
  "https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/moonshotai/kimi-k2.5" \
  -H "Authorization: Bearer {API_TOKEN}" \
  -H "Content-Type: application/json" \
  -H "x-session-affinity: ses_12345678" \
  -d '{ "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is prefix caching and why does it matter?" } ], "max_tokens": 2400, "stream": true }'
Cloudflare WorkersLLM InferenceAI AgentsDistributed InferenceGPU OptimizationCachingAsynchronous APIServerless AI

Comments

Loading comments...