InfoQ Architecture·May 20, 2026

Designing a Multi-Agent AI System for Engineering Support at Grab

Grab's Analytics Data Warehouse (ADW) team developed a multi-agent AI system to automate engineering support, aiming to reduce repetitive operational tasks and improve resolution efficiency. The system tackles internal requests like SQL debugging and data warehouse troubleshooting, freeing engineers for higher-value development and platform improvement. Key architectural decisions included separating investigation and enhancement workflows, consolidating tools, and integrating safety and context management.

AI & ML Infrastructure Distributed Systems DevOps & SRE

Read original on InfoQ Architecture

Grab's Analytics Data Warehouse (ADW) team implemented a multi-agent AI system to automate engineering support for its large-scale data platform. This initiative addresses the challenge of significant operational effort consumed by repetitive support tasks within a platform supporting over 1,000 internal users and managing 15,000+ tables. The goal is to shift engineering focus from reactive firefighting to proactive system building and platform improvement.

Multi-Agent Architecture Overview

The system utilizes a multi-agent architecture orchestrated by a LangGraph-based workflow engine and FastAPI services. This setup coordinates routing, tool execution, and state management across specialized agents. Incoming engineering requests are initially classified and then routed to agents responsible for specific tasks such as context retrieval, code search, or solution generation. Each agent operates with constrained responsibilities to enhance clarity and predictability of outputs.

Workflow Separation: Investigation vs. Enhancement

Investigation Workflows: Designed for diagnostic tasks like query analysis, log retrieval, schema lookup, and issue summarization. These agents focus on understanding and diagnosing problems.
Enhancement Workflows: Focus on generating actionable outputs such as code changes, SQL fixes, and automated merge requests. These agents aim to resolve issues directly or propose solutions.

ℹ️

Architectural Principle: Specialization of Agents

The separation of investigation and enhancement paths is a crucial architectural decision that helped reduce complexity in agent reasoning and improved reliability in production workflows. This highlights a common pattern in system design: breaking down complex problems into more manageable, specialized components.

Key Architectural Decisions and Challenges

Beyond workflow separation, the team made several other critical design choices and addressed challenges:

Tool Ecosystem Consolidation: Initially, over 30 internal tools were exposed. This was later reduced to a smaller, curated toolset. This decision improved maintainability and reduced unpredictable tool selection by agents, illustrating the trade-off between flexibility and control.
Safety and Governance: The system incorporates validation layers for SQL execution and mechanisms for sensitive data handling. All enhancement workflows producing code changes require human-in-the-loop review before deployment, emphasizing the importance of human oversight in AI-driven systems.
Context Management: Managing relevant state across multi-step agent reasoning within token constraints was a significant technical challenge. The solution involved structured context compression and selective retrieval strategies to enable agents to retain necessary information without exceeding operational limits.

AI agentsmulti-agent systemengineering supportautomationLangGraphFastAPIdata platformGrab

Comments

Loading comments...

Architecture Design

Design this yourself

Design a multi-agent AI system for automating engineering support in a large-scale data platform, similar to Grab's ADW team. Your design should include components for request classification, specialized investigation and enhancement agents, a workflow orchestration engine, a curated tool access layer with safety features, and robust context management for multi-step reasoning. Discuss architectural decisions for scalability, reliability, and human-in-the-loop governance.

Practice Interview

Other design angles

· Design only the workflow orchestration and agent communication layer for such a system, focusing on state management and fault tolerance.· Design a system for automated SQL debugging and schema lookup using a single, sophisticated AI agent, outlining the tools and data sources it would integrate with.· Design the human-in-the-loop review process and safety mechanisms for an AI system that generates and proposes code changes in a production environment.