This article discusses the emerging trend of running AI agents in cloud environments rather than locally, driven by challenges like setup complexity, resource consumption, and the need for long-running tasks. It highlights how major AI companies like OpenAI, Anthropic, and Cursor are investing in cloud agent platforms to enable persistent, scalable, and secure execution of AI agents, presenting new system design considerations for orchestrating these distributed intelligent systems.
Read original on The Pragmatic EngineerThe article observes a significant architectural shift in how AI agents are deployed and managed. Traditionally, many AI coding agents ran on local developer machines, leading to issues such as CPU overload, slow system performance, and the inability to support long-running, autonomous tasks. The industry is now moving towards cloud-hosted environments for AI agents, which offer several advantages: reduced setup overhead, parallel execution capabilities, and better suitability for persistent, long-duration operations.
Building platforms for cloud-based AI agents introduces novel system design challenges. OpenAI's acquisition of Ona (formerly Gitpod), a leader in cloud development environments (CDEs), highlights the use of CDEs as sandboxed, persistent environments for agents. This enables agents to access tools, systems, and context without being tied to a single device or active session.
Orchestration of Distributed Agents
Key engineering work involves designing and operating systems for orchestrating agents at scale. This includes defining abstractions for product teams, ensuring secure execution, and managing the lifecycle of long-running tasks. Expertise in distributed systems, cloud infrastructure, Python, and Rust is highly valued for these roles.
A notable challenge from Cursor's experience is the lack of a 'complaint mechanism' for cloud agents. Unlike local agents that can surface errors to humans, long-running cloud agents require new ways to report issues. Cursor's solution of agents 'confessing' issues to an infrastructure team points to the need for robust monitoring, logging, and feedback loops within the platform architecture.
Furthermore, ensuring the resilience and fault tolerance of long-running agents is critical. Handling node terminations mid-execution and seamlessly migrating agent processes between nodes are complex distributed systems problems that need to be addressed in the platform's design.