Dev.to #systemdesign·April 3, 2026

Architectural Deep Dive into Claude Code's LLM Agent Loop

This article dissects the core `while(true)` loop powering Claude Code's AI coding agent, revealing its state machine architecture for managing complex interactions with large language models and tools. It highlights critical design decisions like avoiding recursion for stack overflow prevention and implementing streaming tool execution for significant performance gains, showcasing a robust approach to building interactive AI agents.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on Dev.to #systemdesign

The article dives into the fundamental architecture of an AI coding agent, specifically Claude Code, focusing on the main `while(true)` loop that orchestrates its operations. This loop represents the core of the agent's interaction model: sending context to an LLM, processing responses (text and tool calls), executing tools, and feeding results back into the next iteration. This LLM talks, program walks paradigm is common across AI coding agents, but Claude Code's implementation in `query.ts` stands out due to its sheer scale and intricate state management.

State Machine vs. Recursion for Long Conversations

A crucial architectural decision highlighted is the shift from recursion to a state machine approach. Early versions of Claude Code used recursion, which proved to be a fatal flaw in long conversations due to stack overflow issues. The current design mitigates this by employing a `while(true)` loop with a persistent `state` object. This `state` object carries all necessary context between iterations, allowing the agent to manage conversations with hundreds of tool calls without deep call stacks. Each `continue` statement within the 1,421-line loop body signifies a state transition, enabling robust error recovery and turn management.

Optimizing Performance with Streaming Tool Execution

💡

Key Optimization

Streaming tool execution is a powerful technique for reducing latency in LLM-powered agents. By overlapping tool execution with LLM response generation, overall task completion time can be significantly improved, offering a better user experience without requiring faster underlying models.

One of the most significant architectural optimizations is streaming tool execution. Unlike traditional agents that wait for the LLM to generate all output before executing any tools, Claude Code's `StreamingToolExecutor` allows tools to start running as soon as their calls are identified in the streaming LLM response. This parallel execution dramatically reduces latency. For instance, an example showed a 40% speedup (from 30s to 18s for 5 tool calls) purely due to this architectural design, demonstrating how efficient scheduling can outperform raw processing speed.

Context Compression Strategy

Managing the LLM's context window is paramount for long-running conversations. Claude Code employs a multi-stage, prioritized context compression strategy to ensure that relevant information fits within token limits. This reactive and proactive approach prevents API errors and maintains conversational flow.

Snip Compact: Trims overly long individual messages in history.
Micro Compact: Finer-grained editing based on `tool_use_id`, designed to be cache-friendly.
Context Collapse: Folds inactive context regions into summaries.
Auto Compact: Triggers a full, aggressive compression when total tokens approach the API threshold.
Reactive Compact: An emergency, one-time compression triggered if the API returns a 413 (prompt too long) error, acting as a circuit breaker.

These mechanisms operate in priority order, attempting lightweight options first and only escalating to heavier compression if necessary. This layered approach is critical for maintaining performance and reliability under varying conversational loads and complexities.

LLMAI AgentSystem ArchitectureState MachineStreamingContext ManagementPerformance OptimizationTooling

Comments

Loading comments...

Architecture Design

Design this yourself

Design the core orchestration engine for an AI coding assistant, focusing on a robust `while(true)` state machine for managing multi-turn conversations and tool execution. Include mechanisms for efficient context compression and implement streaming tool execution to minimize latency and optimize user experience.

Practice Interview

Focus: LLM agent orchestration loop with state management, streaming tool execution, and multi-stage context compression

Other design angles

· Design a generic LLM agent framework that supports pluggable tools and a configurable context management strategy, independent of a specific coding assistant application.· Focus on designing the context compression subsystem for an interactive LLM application, detailing the different compression stages and their priority, along with error recovery mechanisms for token limit overages.· Design a real-time conversational AI system that utilizes streaming responses and parallel processing for external API calls/tool execution to achieve sub-second response times.