A/B Testing at Booking.com

We want customer sentiment to drive product development. Hypothesis proven by experiments are the best to discover customer sentiment. That's how Stuart Frisby, principal designer at booking.com, argued for extensive use of A/B testing to the OSCON attendees in Amsterdam.

A/B testing - the act of comparing different versions of a given feature to understand which one performs better - done right has several prerequisites.

Everything must be tested, but it must be tested atomically. If you don't test one change at a time, you don't control your variables and it becomes impossible to get unambiguous results. Although there are many A/B testing tools on offer, Frisby believes that they are suboptimal, because they lack the context and flexibility needed to do proper and extensive testing. Frisby advocates that you should build your own or at least use tools that you can fix and adapt to your own context.

The organization must build a culture of data-driven product development, instead of relying on the ideas of experts. Hiring people with an entrepreneurial mindset will enable an "ask why" culture that forces everyone to question what it doesn't understand. As the ultimate motivator, good A/B testing will prove many times that you/your boss/the industry experts are wrong. In your context.

Frisby described an hypothetical A/B test, on the effects of changing background colors. This is not an A/B test that Frisby recommends to do in practice, as he believes that changing colors is not the right way to address user problems. But it's simple enough to convey the process. The hypothesis for this experiment would be:

The color of our Primary Call-to-Action [the "Book Now" button - Ed.] is overpowered by the presence of higher contrast elements on the web site.

The metrics that could measure the experiment result:

We'll know if this is true if more people click on a higher contrast button, and end up making a booking.

They would then publish two versions of the button: the one already in use, with a blue background, and a new one with a green background:

Let's assume that the green button led to a decreased conversion rate of 2.2%, down from 2.7%. The hypothesis was not proven, so booking.com would keep the original button.

When starting with A/B testing, the organization must watch out for common mistakes. Don't do "big shot A/B testing", by changing to many things at once. Don't do "fringe A/B testing", by focusing on just a small, even if important, part of your product. Like your landing page, for example. Frisby also dwelled a bit on the idea of "assumed reproducibility".

"Assumed reproducibility" is the idea that experiments made by others can be reproduced in your own setting. But context is king. What may work for others may not work for you. Frisby suggests a hierarchy of reliable data sources (from most to least reliable): your own experiment data; your opinion, because you know your own product best; someone else's opinion; someone else's experiment data, because it provides a false sense of certainty.

Frisby doesn't recommend A/B testing for all scenarios. If your web application does not have enough traffic, your results may not be meaningful. Don't do A/B testing unless you can define objective metrics that enable you to make decisions based on facts. Finally the organization must be prepared to accept that A/B testing will contradict much of what it believes to be right, which is not as easy as it sounds.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter