Large Scale Experimentation at Spotify
When you want to scale the number of A/B tests to do many experiments at the same time, you need to adopt your processes and platform, and it might also impact your culture, claimed Ben Dressler, experimentation lead at Spotify. Doing product research with controlled experiments helps to confront your ideas about how customers will use your product in reality, and check if those ideas actually impact user behaviour.
Most of the time its ok if users participate in several A/B tests at once, since the randomised assignment will even out any impact across your test groups. It does become a problem if you generate conflicting experiences for some users, for example serving white text in one test and white background in another, said Dressler.
InfoQ spoke with Dressler after his talk about why companies should experiment, how you can scale A/B testing, and what you can do when people are sceptical about A/B testing.
InfoQ: Why should companies experiment?
Ben Dressler: Most companies or organisations are trying to impact certain outcomes. In a product-driven organisation like Spotify, that’s usually a set of business metrics that are based on a lot of customers performing certain actions, like buying something or continuing to use the product. And usually employees have a number of ideas of how to best achieve that. Gathering both qualitative and quantitative data on those customers is a great way of improving your understanding of what will make those key behaviours more or less likely. But without running controlled experiments you won’t be able to know if your actions, e.g. launching a feature, are actually causing those behaviours to change - or whether it’s purely a correlation and pouring more resources into that feature won’t actually pay off.
A/B testing has the reputation of purely being a tool for optimising website details, but it's fundamentally a tool to confront your ideas with reality and check if they do what you thought they would do.
InfoQ: How can you scale A/B testing?
Dressler: Scaling the number of tests you’re running depends on a few things: how fast you can turn around a test, how big of an audience you have, and how many tests per user you can run. Since the audience size is usually fixed, you want to run more tests per user and turn them around faster. The problems that usually occurs at this stage are overhead for teams, both technical and process-wise, until you’ve streamlined the process enough. If you have an app and need to hard code every change you’ll be bottlenecked by app release cycles, and engineers and designers might need to get comfortable with the idea of shipping tests that are not fully polished. A good idea is to pick a few teams to spearhead this, and then figure out what changes to the platform and processes you need to scale it to all teams.
Running more tests per user means that experiments can potentially collide and create a broken user experience. If one team tests changing all fonts to white - and another team changes all backgrounds to white - people who end up in both tests will not be able to use the product. There are different solutions but it’s worth pointing out that users who participate in several tests at once is ok most of the time. Since you’re randomly assigning users to your test groups there should be an even amount of affected users in all of your test groups. A/B tests only care about differences between your groups, so if everyone is equally affected, it won’t mess with your results.
InfoQ: Can you give an example of an experiment?
Dressler: A while ago we saw some patterns in our research that made us believe we might be missing an opportunity around Spotify’s navigation. We formed an idea that by simplifying our app navigation we might be able to give new users a better idea of what they can do in Spotify, and thus increase the chance for them to stay on the platform.
Conventional wisdom would have us jump into a design sprint, do some user testing, and eventually test rolling out whatever we ended up with. While our designers did indeed lead the charge with some early explorations, we quickly jumped into a first set of A/B tests. One test looked into changing the navigation UI (and only the UI) while another test changed the information architecture (category labels and structure). Those were not at all polished experiences and the intention was never to launch them to a bigger audience. What we were after was an indication of whether we’d be able to actually impact user behaviour with this. If a radical change would not even change click-through rates, we probably wouldn’t want to sink more resources into this idea. Results however, suggested that we were changing the behaviour of users in some of the test groups for the better. Having built confidence this way, we continued exploring different design prototypes in small sample user testing to rule out variations and to gather a lot of contextual observations very quickly. Only then did we run the more traditional A/B optimisations to reach the refined version we ultimately released to our users.
InfoQ: What would you say to people who are sceptical of A/B testing?
Dressler: First of all, I’d say that it’s good to approach experimentation with due respect. When complex engineering and advanced statistics come together it can be easy to make mistakes. Weaknesses in your development process and data collection will be magnified when building many variations and running statistical tests. A/B testing is one of the absolute power tools of product research and requires specialised expertise and potentially changes to your process and culture. Chances are that you will introduce some friction.
That said, experiments are also uniquely powerful and hold a ton of potential beyond squeezing out a few more clicks. If you gear up appropriately and run smart experiments, you can avoid sinking a ton of resources into the wrong idea, gather vital information early on to de-risk huge projects or enable bottom-up innovation by testing many small ideas and lifting up the ones that make an impact. And when it comes to speed, just consider this: going 200mph isn’t all that fast if you’re running in the wrong direction.
I’d encourage everyone to try their hand at this and in centuries of dealing with faulty experiments and imperfect measurements, science has given us a few good coping mechanisms: re-run tests and see if the same outcomes can be repeated, facilitate a community that reviews each other’s work and keeps iterating on the practices and the underlying instruments. And most importantly, be mindful that there is no such thing as absolute certainty. Making decisions based on imperfect information will always be the job of product managers, and experiments are nothing more or less than a powerful tool to help with that job.