InfoQ Architecture·March 13, 2026

DoorDash's LLM Conversation Simulator for Chatbot Testing

DoorDash developed a simulation and evaluation framework to test LLM-powered customer support chatbots at scale. This system enables rapid iteration and validation of changes to prompts, context, and integrations, significantly reducing hallucination rates and improving chatbot reliability before production deployment. The core innovation is using an LLM to simulate customer interactions, allowing for diverse conversation paths to be explored and evaluated automatically.

AI & ML Infrastructure DevOps & SRE Tools & Frameworks

Read original on InfoQ Architecture

The Challenge of Testing LLM-Powered Chatbots

Traditional customer support automation relies on deterministic decision trees, making testing straightforward. However, LLM-powered chatbots handle natural language conversations, leading to unpredictable outcomes with even minor changes. This inherent non-determinism makes conventional testing inadequate and poses a significant challenge for validating LLM-based systems before production deployment.

ℹ️

Key Problem

How do you thoroughly test a chatbot that never answers the same way twice, especially when small adjustments can have wide-ranging, unpredictable effects across conversation paths?

DoorDash's Simulation and Evaluation Flywheel

DoorDash addressed this by building an offline experimentation framework consisting of an LLM-powered customer simulator and an automated evaluation system. This framework allows engineers to run hundreds of simulated conversations rapidly, accelerating development and experimentation cycles.

LLM-Powered Customer Simulator: An LLM acts as the customer, generating multi-turn conversations based on historical support transcripts to mimic real customer intents and behavioral patterns.
Mocked Backend Services: Critical dependencies like order lookups or refund workflows are reproduced with mocked APIs, creating realistic operational scenarios without relying on live systems.
Automated Evaluation System: Classifies conversation outcomes against predefined policies and metrics, including compliance, hallucination rates, tone, and task completion accuracy. An LLM-as-judge can be used to detect specific failure modes.

Continuous Development Loop

The simulator and evaluation system form a continuous development loop. Engineers identify failure cases, add specific evaluation checks, and generate new simulations targeting those scenarios. Adjustments to prompts, retrieval strategies, or context handling are validated across these simulations. For instance, using this flywheel, DoorDash significantly reduced hallucination rates by about 90% by developing a case state layer to structure tool history for the chatbot, preventing context overload.

Architectural Benefits for System Design

Robustness in LLM Integration: Demonstrates a crucial architectural pattern for integrating generative AI into critical systems – building sophisticated testing and validation layers around non-deterministic components.
Accelerated Iteration: The flywheel approach allows for rapid testing and iteration, essential for the fast-evolving landscape of LLM development and prompt engineering.
Proactive Issue Detection: Shifting left on quality by detecting and resolving issues like hallucinations and policy non-compliance in a simulated environment before production, reducing risks and improving user experience.

LLMChatbotsTestingSimulationAIAutomationDevOpsCustomer Support