Airbnb: Using Guardrails to Identify Changes with Negative Impact across Teams

Airbnb rolled out an internal Experiment Guardrails system to identify potentially negative impacts of changes across different teams. Whenever a proposed change does not pass any of the guardrails, it is escalated for further analysis by affected teams and stakeholders, explains Airbnb data scientist Tatiana Xifara.

Airbnb Guardrails system aims to prevent a team from launching a change in production that, while improving that team's own metrics, could have a negative impact on different metrics relevant to other teams at Airbnb. This is a scenario that is typical of large organizations where each team focuses on a specific set of goals, e.g., fraud detection, customer satisfaction, overall revenue, etc., and where it might not be always evident what kind of cross-team trade-offs a change may have.

As it currently stands, Xifara explains, Airbnb Guardrails systems includes three guardrails that account for different statistical dimensions of the effects of a proposed change: impact, power, and statistically significative negative, shortened to stat sig negative.

Briefly, the impact guardrail ensures that the metric does not percentually decrease below a given escalation threshold t. The power guardrail aims to ensure the experiment is run long enough so its results are statistically significant. This is expressed by requiring that the standard error is less than a given fraction F of t, e.g. StdErr < 0.8 * t. Finally, the stat sig negative guardrail can be used to ensure any statistically significative, negative impact on a given metric, as small as it may be, is escalated. This guardrail provides a further guarantee for especially sensitive metrics, such as revenue, where even minimal decreases could generate huge effects.

Airbnb Guardrails system is based on a few arbitrary constants that will determine its performance. Two of them are the escalation threshold t and the fraction F of t that is used by the power guardrail.

While t is easy to understand since it relates directly to the metric it is applied to, for example the page load time, F requires some additional consideration. If you choose a lower value for F, you will need to run the experiment longer so it is exposed to a higher number of users. This could make your experiment impractical to run or make it impact negatively your development speed.

Another aspect to look into is the fact that not all experiments will have the same global coverage, which is the percentage of users assigned to an experiment. Lower-coverage experiments would need thus to run longer than higher-coverage ones if they had to pass the same power guardrail. To make all experiments complete in a similar time, then, the escalation threshold t can be defined as a function of the global coverage, allowing t to be higher for a smaller coverage.

According to Xifara, Airbnb Guardrails system flags about 25 experiment per month. Upon escalation and review, roughly 20% of those, i.e., 5, are eventually stopped. What is key is choosing all constants to strike a balance between metrics safeguard and speed of development.

It goes without saying that the Guardrails system is not restricted to any specific set of metrics and that each organization will have its own set of metrics to constantly keep an eye on. Xifara and Andersen's article includes much more information than what can be covered here, so do not miss it if you are interested in the details.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Risk Management topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter