This article details Cloudflare's enhancements to Workers AI to support large language models (LLMs) like Kimi K2.5, focusing on the underlying infrastructure changes for efficient inference. It highlights architectural optimizations such as custom kernels, prefix caching, and redesigned asynchronous APIs to improve performance, reduce costs, and ensure reliability for AI agent workloads at scale.
Read original on Cloudflare BlogCloudflare is positioning its Developer Platform as a robust environment for building and deploying AI agents. This involves not just providing an execution environment via Durable Objects, Workflows, and Dynamic Workers, but also integrating powerful AI inference capabilities. The latest update to Workers AI enables the execution of large, frontier open-source models, exemplified by Moonshot AI's Kimi K2.5, directly within the platform. This allows for a unified platform experience, handling the entire agent lifecycle from execution to model inference, addressing the growing demand for complex agentic tasks powered by sophisticated LLMs.
Serving large LLMs efficiently requires significant architectural changes to an inference stack. Cloudflare's approach for Kimi K2.5 on Workers AI involves several key optimizations:
The Value of Managed Inference Platforms
The article emphasizes that these optimizations are complex and require deep ML and DevOps expertise. Platforms like Workers AI abstract this complexity, allowing developers to consume LLM inference as a service without needing to be ML Engineers or SREs, drastically reducing operational overhead and accelerating development.
Beyond model-specific optimizations, Cloudflare has introduced platform-level features critical for agentic workloads:
curl -X POST \
"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/moonshotai/kimi-k2.5" \
-H "Authorization: Bearer {API_TOKEN}" \
-H "Content-Type: application/json" \
-H "x-session-affinity: ses_12345678" \
-d '{ "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is prefix caching and why does it matter?" } ], "max_tokens": 2400, "stream": true }'