Enhancing A/B Testing at DoorDash with Multi-Armed Bandits

While experimentation is essential, traditional A/B testing can be excessively slow and expensive, according to DoorDash engineers Caixia Huang and Alex Weinstein. To address these limitations, they adopted a "multi-armed bandits" (MAB) approach to optimize their experiments.

When running experiments, organizations aim to minimize the opportunity cost, or regret, caused by serving the less effective variants to a subset of the user base. Traditional A/B testing relies on fixed traffic splits and predetermined sample sizes that remain unchanged throughout the experiment. As a result, even if a clear winner emerges early, the experiment continues until it reaches its predetermined stopping condition. To make things worse, opportunity cost compounds as the number of concurrent experiments increases, encouraging teams to run experiments sequentially to reduce regret, but at the expense of significantly slower iterations.

The multi-armed bandits approach offers a way to adaptively allocate traffic based on performance, accelerating learning while reducing waste. It does so by repeatedly selecting among multiple choices whose properties are only partially known and refining those selections as the experiment progresses and more evidence is gathered:

For our purposes, this strategy allocates experimental traffic toward better-performing variants based on ongoing feedback collected during the experiment. The core idea is that an automated MAB agent continuously selects from a pool of actions, or arms, to maximize a defined reward, while simultaneously learning from user feedback in subsequent iterations.

This strategy enables a balance between exploration, i.e., learning about all candidate options, and exploitation, i.e., prioritizing the best‑performing options as they emerge, until the experiment converges on the best option.

According to Huang and Weinstein, MAB helps reduce experimentation costs enough to evaluate many distinct ideas quickly.

At the core of DoorDash's MAB approach is Thompson sampling, a Bayesian algorithm known for its strong performance and robustness to delayed feedback. In extreme synthesis, the algorithm samples from posterior reward distributions (that is, after a decision cycle) to decide allocations and updates reward expectations as new data comes in to prepare for the next decision cycle. At each decision cycle, the expected reward is used to determine choice allocation.

Adopting MAB is not without challenges, say DoorDash engineers. In particular, it makes inference on metrics not included in the reward function much harder, which in turn encourages teams to choose more complex reward metrics to capture as much insight as possible. By contrast, traditional A/B testing allows post-experiment analysis of any metric once the experiment concludes.

Moreover, because MAB adjusts allocations more aggressively, it can potentially lead to inconsistent user experiences when the same user interacts with a feature multiple times. DoorDash plans to address these limitations by adopting contextual bandits, leveraging Bayesian optimization, and implementing sticky user assignment to enhance overall user experience.

The concept of a multi‑armed bandit comes from probability theory and machine learning. It describes a problem using a slot machine analogy: a gambler is facing multiple slot machines (which are sometimes called “one‑armed bandits”) and must decide which to play, how often, in what order, and when to try a different machine.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Sergio De Simone

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter