ByteByteGo·March 18, 2026

OpenAI Codex Architecture: Orchestrating an AI Coding Agent

This article details the architectural decisions behind OpenAI's Codex, highlighting how the core engineering challenges revolved around system orchestration rather than just the AI model. It explores the agent loop, prompt and context management, and the multi-surface architecture designed to enable the agent to work across various environments efficiently.

AI & ML Infrastructure Distributed Systems API Design

Read original on ByteByteGo

OpenAI Codex, a cloud-based coding agent, exemplifies how the most significant engineering efforts in AI products often lie in the surrounding system architecture, not solely in the AI model itself. The `codex-1` model, fine-tuned for software engineering, is merely one component within a larger, intricate system that addresses challenges like prompt assembly, conversation memory management, and multi-platform compatibility.

The Agent Loop and Tool Execution

The core of Codex is an agent loop that processes user input, constructs a prompt, sends it to the model for inference, and receives a response. Crucially, the model's response can be a tool call (e.g., 'run this shell command'). The agent's harness executes these tool calls, appends the output to the prompt, and queries the model again. This iterative cycle, often involving dozens of steps, allows the model to perform complex tasks like reading files, running tests, editing code, and fixing linting errors. The harness manages execution, permissions, and loop termination, separating reasoning (model) from execution (harness).

Prompt Construction and Memory Management

Prompts in Codex are layered, incorporating environment context, project-specific instructions (`AGENTS.md`), sandbox rules, and conversation history. Every tool call output and new conversation turn is appended to the prompt, leading to a quadratic increase in data transfer over time. To mitigate this cost while maintaining statelessness and Zero Data Retention requirements, OpenAI implemented prompt caching. This leverages the prefix property of appended content to reuse computation from previous inference calls, effectively keeping model computation closer to linear despite quadratic data transfer. However, any alteration to the prefix invalidates the cache, making consistency vital.

💡

Context Window Limits and Compaction

When conversation history exceeds the model's context window limit, Codex employs a compaction mechanism. This replaces the full history with a smaller, representative version, often an encrypted payload of the model's latent state. This highlights that managing context windows is a critical engineering problem for conversational AI agents.

Multi-Surface Architecture: The App Server Protocol

To enable Codex to operate across various interfaces (CLI, VS Code, web, desktop apps, third-party IDEs) without rewriting agent logic, OpenAI developed the App Server. After an unsuccessful attempt with MCP (which lacked rich interaction patterns like streaming progress, mid-task pausing for user approval, and structured diffs), they built a custom JSON-RPC protocol. This protocol wraps the 'Codex core' (agent loop, thread management, tool execution) and supports bidirectional communication over standard I/O. Clients can send requests to the server, and the server can send requests back to the client (e.g., for user approval), allowing for flexible human oversight.

VS Code Extension/Desktop App: Bundle the App Server binary and run it as a child process via stdio.
Web App: Runs the App Server in a cloud container, streaming events to the browser via HTTP. State persists on the server, ensuring work continues if the tab is closed.
Third-Party Integrations: Partners interact with newer App Server binaries while maintaining stable clients, leveraging the protocol's backward compatibility.

📌

Lesson Learned: Evolving Abstractions

The evolution of Codex's architecture from a simple CLI, through a failed MCP integration, to the robust App Server protocol, illustrates a key system design principle: the optimal abstraction often emerges through iteration and learning from less suitable initial approaches.

AI AgentSystem ArchitectureOrchestrationPrompt EngineeringContext ManagementCachingMulti-platformJSON-RPC

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable and extensible AI coding assistant platform that supports integration across multiple client surfaces (IDE extensions, web, CLI). Focus on the architectural components required for orchestrating AI model interactions, managing conversation context efficiently, and enabling bidirectional communication for human oversight and tool execution.

Practice Interview

Focus: AI agent orchestration and multi-surface protocol

Other design angles

· Design a generic AI agent orchestration service that can be used by various AI models and tools, emphasizing pluggable tool execution and state management.· Design the context management and prompt caching layer for a conversational AI system that needs to handle long-running interactions and quadratic data growth efficiently while maintaining statelessness.· Design a custom, bidirectional RPC protocol for an interactive, distributed application that needs rich interaction patterns beyond standard stateless APIs.