ByteByteGo·June 3, 2026

OpenAI's Data Agent Architecture for Exabyte-Scale Data Platforms

OpenAI engineered a 'vanilla' data agent to navigate its massive 1.5 exabyte data platform, addressing the challenge of finding and understanding relevant data among 90,000 datasets. The agent's reliability stems from a simple architecture backed by a robust data infrastructure and a sophisticated context assembly layer, which feeds a single LLM with carefully curated information to generate accurate SQL queries and verified answers.

AI & ML Infrastructure Databases & Storage Distributed Systems

Read original on ByteByteGo

OpenAI's data platform, storing 1.5 exabytes across 90,000 datasets, faced a significant challenge: analysts spent hours identifying and understanding the correct tables for queries, even before writing SQL. To solve this, OpenAI developed an in-house data agent, designed to be 'vanilla' in its core architecture but highly effective due to strong underlying data infrastructure and sophisticated context engineering.

Agent Architecture: Simplicity Through Context

The core of the data agent is intentionally simple: a single LLM (GPT-5.5) combined with a harness. The harness orchestrates the LLM's reasoning by providing tools, assembling relevant context, and running an iterative loop of reason-act-observe. This contrasts with more complex agent systems that might involve multiple LLMs, routers, or fine-tuning. The reliability isn't in a complex agent, but in the meticulous data acquisition and context provision.

Key Components

LLM: GPT-5.5 is used for all requests, handling SQL generation, result inspection, correction, and reasoning.
Runtime: The orchestrator that parses LLM output, dispatches tool calls, and feeds results back into the model in a loop.
Context Assembly: This is where the core engineering effort lies, building rich context for the LLM.
Tools: A small, curated set of 13 tools for company context, knowledge bases, big data systems (Airflow, Spark), and metadata services.

The Six Layers of Context Assembly

The critical factor for the agent's success is the quality of the context provided to the LLM. A bare schema is insufficient; the agent uses six layers of context, prepared offline and retrieved at runtime, to ensure accuracy:

Table usage metadata: Schema, lineage, and historical query patterns (prioritizing popular dashboards from data scientists).
Human annotations: Curated descriptions from table owners about business meaning, ownership, criticality, and caveats.
Codex enrichment: A nightly job that crawls pipeline code to infer table contents, derivation, freshness, and usage, reading 100-200 tables in 5-10 minutes each.
Institutional knowledge: Embedded documents from Slack, Google Docs, Notion, served via an access-controlled retrieval service.
Memory: Saved corrections and learnings from past conversations (global or personal).
Runtime context: Live queries to the data warehouse or other platform systems (Airflow, Spark) to fill gaps.

💡

System Design Takeaway: Context over Complexity

This case study highlights that for LLM-powered agents, the *quality and relevance of the input context* often outweighs complex multi-LLM architectures or intricate routing. Investing in robust data infrastructure and intelligent context assembly can yield more reliable and scalable results than attempting to make the LLM itself 'smarter' through convoluted prompting or model layering. The 'vanilla' agent design proves that simplicity at the agent level can be achieved if the surrounding data platform handles the complexity of context provision.

LLMData PlatformData AgentContext RetrievalAI ArchitectureSystem DesignScalabilityData Governance

Comments

Loading comments...

Architecture Design

Design this yourself

Design an enterprise-scale data platform for petabyte-scale data, incorporating an LLM-powered data agent capable of answering natural language queries by synthesizing information from 90,000+ tables. Focus on the architecture of the data agent, particularly its context assembly layer, offline indexing, and runtime retrieval mechanisms, ensuring accuracy and reliability without complex LLM routing or fine-tuning strategies. Detail how human annotations, code analysis, usage metadata, and institutional knowledge are integrated to provide rich context to the LLM.

Practice Interview

Focus: LLM-powered data agent with context assembly

Other design angles

· Design just the context assembly and retrieval service for a data agent, assuming the LLM and runtime are external components.· Design a data governance and discovery platform that uses an LLM-powered agent to help users understand data lineage, ownership, and semantic meaning across disparate data sources.· Design a modular data agent framework that allows pluggable context layers and tools, illustrating how it could be adapted for different domain-specific data challenges beyond just SQL generation.