DoorDash developed a simulation and evaluation framework to test LLM-powered customer support chatbots at scale. This system enables rapid iteration and validation of changes to prompts, context, and integrations, significantly reducing hallucination rates and improving chatbot reliability before production deployment. The core innovation is using an LLM to simulate customer interactions, allowing for diverse conversation paths to be explored and evaluated automatically.
Read original on InfoQ ArchitectureTraditional customer support automation relies on deterministic decision trees, making testing straightforward. However, LLM-powered chatbots handle natural language conversations, leading to unpredictable outcomes with even minor changes. This inherent non-determinism makes conventional testing inadequate and poses a significant challenge for validating LLM-based systems before production deployment.
Key Problem
How do you thoroughly test a chatbot that never answers the same way twice, especially when small adjustments can have wide-ranging, unpredictable effects across conversation paths?
DoorDash addressed this by building an offline experimentation framework consisting of an LLM-powered customer simulator and an automated evaluation system. This framework allows engineers to run hundreds of simulated conversations rapidly, accelerating development and experimentation cycles.
The simulator and evaluation system form a continuous development loop. Engineers identify failure cases, add specific evaluation checks, and generate new simulations targeting those scenarios. Adjustments to prompts, retrieval strategies, or context handling are validated across these simulations. For instance, using this flywheel, DoorDash significantly reduced hallucination rates by about 90% by developing a case state layer to structure tool history for the chatbot, preventing context overload.