This article details the architectural decisions behind OpenAI's Codex, highlighting how the core engineering challenges revolved around system orchestration rather than just the AI model. It explores the agent loop, prompt and context management, and the multi-surface architecture designed to enable the agent to work across various environments efficiently.
Read original on ByteByteGoOpenAI Codex, a cloud-based coding agent, exemplifies how the most significant engineering efforts in AI products often lie in the surrounding system architecture, not solely in the AI model itself. The `codex-1` model, fine-tuned for software engineering, is merely one component within a larger, intricate system that addresses challenges like prompt assembly, conversation memory management, and multi-platform compatibility.
The core of Codex is an agent loop that processes user input, constructs a prompt, sends it to the model for inference, and receives a response. Crucially, the model's response can be a tool call (e.g., 'run this shell command'). The agent's harness executes these tool calls, appends the output to the prompt, and queries the model again. This iterative cycle, often involving dozens of steps, allows the model to perform complex tasks like reading files, running tests, editing code, and fixing linting errors. The harness manages execution, permissions, and loop termination, separating reasoning (model) from execution (harness).
Prompts in Codex are layered, incorporating environment context, project-specific instructions (`AGENTS.md`), sandbox rules, and conversation history. Every tool call output and new conversation turn is appended to the prompt, leading to a quadratic increase in data transfer over time. To mitigate this cost while maintaining statelessness and Zero Data Retention requirements, OpenAI implemented prompt caching. This leverages the prefix property of appended content to reuse computation from previous inference calls, effectively keeping model computation closer to linear despite quadratic data transfer. However, any alteration to the prefix invalidates the cache, making consistency vital.
Context Window Limits and Compaction
When conversation history exceeds the model's context window limit, Codex employs a compaction mechanism. This replaces the full history with a smaller, representative version, often an encrypted payload of the model's latent state. This highlights that managing context windows is a critical engineering problem for conversational AI agents.
To enable Codex to operate across various interfaces (CLI, VS Code, web, desktop apps, third-party IDEs) without rewriting agent logic, OpenAI developed the App Server. After an unsuccessful attempt with MCP (which lacked rich interaction patterns like streaming progress, mid-task pausing for user approval, and structured diffs), they built a custom JSON-RPC protocol. This protocol wraps the 'Codex core' (agent loop, thread management, tool execution) and supports bidirectional communication over standard I/O. Clients can send requests to the server, and the server can send requests back to the client (e.g., for user approval), allowing for flexible human oversight.
Lesson Learned: Evolving Abstractions
The evolution of Codex's architecture from a simple CLI, through a failed MCP integration, to the robust App Server protocol, illustrates a key system design principle: the optimal abstraction often emerges through iteration and learning from less suitable initial approaches.