BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News DoorDash Builds LLM Conversation Simulator to Test Customer Support Chatbots at Scale

DoorDash Builds LLM Conversation Simulator to Test Customer Support Chatbots at Scale

Listen to this article -  0:00

DoorDash has developed a simulation and evaluation flywheel to accelerate the development and testing of large language model (LLM) powered customer support chatbots. The system allows engineers to run hundreds of simulated conversations within minutes, significantly speeding experimentation cycles. Context engineering improvements validated through this framework reduced hallucination rates by roughly 90 percent before deployment.

As DoorDash noted in a LinkedIn post, sharing this work,

The fundamental challenge is validating LLM-based support systems before production: How do you test a chatbot that never answers the same way twice?

Customer support automation has traditionally relied on deterministic decision trees, where users follow predefined paths based on menu selections or keywords. Such workflows allowed developers to validate changes with conventional tests. LLM-powered agents, however, handle natural conversations, meaning small adjustments to prompts, context, or backend integrations can produce unpredictable outcomes across multiple conversation paths.

To address this, DoorDash built an offline experimentation framework combining an LLM-powered customer simulator with an automated evaluation system. The simulator generates multi-turn conversations reflecting real customer interactions, using historical support transcripts to derive customer intents, conversation flows, and behavioral patterns. Backend dependencies, such as order lookups or refund workflows, are reproduced with mocked service APIs, enabling realistic operational scenarios.

Simulation Workflow Overview (Source: DoorDash Blog Post)

In the simulation environment, an LLM plays the customer while the production chatbot responds as it would in a real interaction. The simulator adapts to the chatbot’s responses, handling scenarios such as clarification requests, frustration signals, or repeated issues. Alongside the simulator, an automated evaluation framework classifies outcomes against predefined policies and metrics, including compliance, hallucination rates, tone, and task completion accuracy. Simulator and evaluation together form a continuous development loop. Engineers identify failure cases, add evaluation checks, and generate additional simulations targeting those scenarios. Prompt adjustments, retrieval strategies, or context improvements are validated across hundreds of conversations before deployment.

The flywheel also addressed hallucinations caused by overloaded context windows. Early launches revealed that excessive raw events and logs could mislead the chatbot, producing errors such as misinterpreted fields or invalid policy suggestions. Engineers implemented a binary hallucination metric and test scenarios derived from observed failures. Iterating with the flywheel, they developed a case state layer that structures tool history for the chatbot. The simulator enabled rapid testing of multiple context configurations and prompt strategies, quickly exposing failure modes and validating improvements.

 

Simulation-Evaluation Flywheel (Source: DoorDash Blog Post)

The DoorDash flywheel follows a structured problem-to-production workflow. Engineers begin by identifying a customer issue, often through manual review of support cases or early simulations. They then create an LLM-as-judge evaluation to detect the failure mode, calibrating it against human judgment to ensure accuracy. Once the evaluation is trusted, the simulator generates conversations representing the current system, and evaluations identify failures. Engineers analyze errors, adjust prompts, context handling, or tool outputs, and iterate until the evaluation pass rate reaches acceptable thresholds. Before deployment, guardrails such as hallucination detection, tone assessment, and issue classification are validated with the full evaluation suite, ensuring improvements hold in live traffic.

About the Author

Rate this Article

Adoption
Style

BT