Grab's data engineering team built a multi-agent AI system to automate support for their Analytics Data Warehouse. This article details the architectural decisions, including decoupling the LLM 'brain' from specialized 'hand' agents, and the division into read-only investigation and human-reviewed enhancement pathways. It also covers critical production challenges like context overflow and tool bloat, along with their solutions.
Read original on ByteByteGoGrab's Analytics Data Warehouse (ADW) team faced a common challenge: their expert engineers spent significant time answering repetitive questions about data. To scale their support and free up engineers for more complex tasks, they designed and implemented a sophisticated multi-agent AI system. This system automates the investigation and, in a semi-automated fashion, the enhancement of their vast data infrastructure, which includes over 15,000 tables.
The design philosophy behind Grab's AI system centered on two key principles to ensure both capability and maintainability:
A crucial architectural decision was segmenting the system into two distinct pathways based on risk profiles: read-only operations (investigation) and write operations (enhancement).
System Design Insight: Risk-Driven Architecture
Separating functionalities based on their risk profile (e.g., read vs. write operations) is a fundamental system design pattern. It allows for tailored security, approval workflows, and fault tolerance mechanisms, optimizing for both efficiency and safety in different parts of a system.
Real-world usage exposed several challenges not apparent in demos, leading to robust solutions: