Menu
ByteByteGo·May 18, 2026

Designing a Multi-Agent AI System for Data Warehouse Support at Grab

Grab's data engineering team built a multi-agent AI system to automate support for their Analytics Data Warehouse. This article details the architectural decisions, including decoupling the LLM 'brain' from specialized 'hand' agents, and the division into read-only investigation and human-reviewed enhancement pathways. It also covers critical production challenges like context overflow and tool bloat, along with their solutions.

Read original on ByteByteGo

Grab's Analytics Data Warehouse (ADW) team faced a common challenge: their expert engineers spent significant time answering repetitive questions about data. To scale their support and free up engineers for more complex tasks, they designed and implemented a sophisticated multi-agent AI system. This system automates the investigation and, in a semi-automated fashion, the enhancement of their vast data infrastructure, which includes over 15,000 tables.

Core Architectural Principles

The design philosophy behind Grab's AI system centered on two key principles to ensure both capability and maintainability:

  • Decoupling Brain from Hands: The system separates the LLM's reasoning (the 'brain') from specialized agents and tools that interact with systems, fetch information, and run queries (the 'hands'). This allows for easier debugging by isolating issues to either reasoning or tool interaction.
  • Specialized Agents over Monolith: Instead of a single large AI, Grab opted for multiple smaller, specialized agents. This modular approach improves maintainability, allows independent improvements, and makes failures traceable, accepting increased coordination complexity and some latency as a trade-off for accuracy and reliability.

Two Pathways: Investigation and Enhancement

A crucial architectural decision was segmenting the system into two distinct pathways based on risk profiles: read-only operations (investigation) and write operations (enhancement).

  • Investigation Pathway (Read-Only): Handles queries like "Why does this data look wrong?" It uses four collaborating agents: a Classifier for routing, a Data Agent for data querying, a Code Search Agent for lineage tracing, and an On-call Agent for production health monitoring. A Summarizer Agent then synthesizes findings.
  • Enhancement Pathway (Write Operations): Manages requests that modify production pipelines, such as adding columns. A single Enhancement Agent performs tasks like generating schema and code changes, but critically, every step requires human review and approval due to the higher risk of write operations.
💡

System Design Insight: Risk-Driven Architecture

Separating functionalities based on their risk profile (e.g., read vs. write operations) is a fundamental system design pattern. It allows for tailored security, approval workflows, and fault tolerance mechanisms, optimizing for both efficiency and safety in different parts of a system.

Challenges and Solutions in Production

Real-world usage exposed several challenges not apparent in demos, leading to robust solutions:

  • Context Overflow: Long conversations and agent handoffs led to LLM context window limits. Grab's solution involves real-time token tracking, automatic summarization of older messages, and pruning tool outputs before handoffs, managed by an orchestrator.
  • Tool Bloat: Too many verbose tools degraded performance. The team aggressively simplified tool descriptions and outputs, highlighting that fewer, well-designed tools are superior.
  • Risky Code Execution: Agents with database access and code generation capabilities posed security risks. A multi-layered defense was implemented, including input classification, SQL validation, resource limits for queries, and human review for critical operations, ensuring PII protection and cost control.
AI agentsLLM architecturemulti-agent systemsdata warehouseautomationsystem designdeveloper productivityGrab

Comments

Loading comments...