BT

Expedia's Journey toward Site Resiliency: Embracing Chaos Testing in Dev and Production at QCon SF

| by Daniel Bryant Follow 696 Followers on Nov 19, 2017. Estimated reading time: 3 minutes | NOTICE: The next QCon is in San Francisco Nov 5 - 9, 2018. Save an extra $100 with INFOQSF18!

At QCon SF, Sahar Samiei and Willie Wheeler presented "Expedia's Journey Toward Site Resiliency", and discussed the building of a community of practice around resilience testing within Expedia. The results have generally been positive: Netflix's Chaos Monkey has been running daily in production since May 15th; resilience tests have been added to four Tier 1 service pipelines; and there has been an increase in organisational awareness in regards to the value of building resilient services.

Samiei, senior product manager at Expedia, began the talk by stating that in 2016 Expedia was the 11th largest Internet company by revenue, at $8.77B. With a "back of the envelope calculation" of revenue loss due to unplanned site unavailability, moving from 99% uptime ($87.7M potential loss) to 99.9% site availability ($8.77M loss) results in ~$80M difference:

Keeping [the Expedia site] up protects tens of millions of dollars of revenue per year.

Expedia has a "test-and-learn" culture, and innovation is about constantly iterating products and features. Resilience is not always treated as a first-class citizen: there are often too many competing priorities, there are major misconceptions about resilience, and team autonomy can mean it is challenging to diffuse learnings and tooling effectively.

To address these issues, Wheeler, principal application engineer at Expedia, discussed how a shared learning space was created within Expedia, which facilitated the sharing of information around resilience, and led to the creation of "resilience champions". Much effort was made to collect and present baseline resiliency data, in order to allow teams to track improvements.

A large organisation such as Expedia has a plethora of tooling and platforms in use, and it can be a challenge to steer adoption. Wheeler discussed how the focus on core principles was more valuable than individual tooling, and shared how his team defined a "resilience engineering lifecycle":

  1. Prioritise services that will benefit from improved resilience
  2. Investigate vulnerabilities
  3. Apply resilience patterns
  4. Conduct resilience experiment in test
  5. Conduct resilience experiments in production (increasingly referred to as "chaos testing")

Services within Expedia are classified as Tier 1 (essential), Tier 2 (important) and Tier 3 (nice to have). Scorecards and reporting were used to share and highlight information around a service's resilience, such as the number of incidents and current availability. This combination of tiered service classification and scorecard data enabled the prioritisation of resilience testing in order to get the biggest return on investment.

Resilience testing in dev, test and production

Vulnerabilities were investigated with interactive experiments -- for example using the Gremlin chaos testing toolset -- and a service's resilience was defined within a maturity model: survive instance loss; survive dependency loss; survive AZ loss; survive region loss; and so forth. When the vulnerabilities were identified and understood, the team applied a series of resiliency patterns to address them:

  • Autoscaling
  • Rate limiting
  • Circuit-breaking - for example, protecting services with Netflix's Hystrix
  • Bulkheads - as popularised in Michael Nygard's book "Release It!"
  • Multi-geographic deployment - for example, multi-zone and multi-region
  • Database failover

Resilience experiments were conducted in tests as an addition to the continuous delivery pipeline. Production experiments were conducted with the use of Netflix's Simian Army and Chaos Monkeys. Due to Expedia's core value of autonomy, and the resilience team wanting to champion improvements (and not simply break things), each service owner could "opt-in" to a resilience testing whitelist. Each service exposed core health checks and metrics, and these were examined pre-, during, and post-attack.

Anatomy of a resilience test

The results of resilience testing have generally been positive: Chaos Monkey has been running daily in production since May 15th; resilience tests have been added to four Tier 1 service pipelines; there has been an increase in organisational awareness; and a resilience community of practice has been established with 65+ active members. In regards to the challenges, establishing development team engagement has been a struggle due to limited team capacity, and the drive for improving Expedia's products is still currently greater than the need for improved resilience.

Samiei and Wheeler concluded the talk by discussing that the focus of resilience engineering at Expedia for 2018 will be around automation, specifically: service mesh/proxy-based resilience testing enablement (e.g. via Linkerd or Envoy); testing via service discovery; and increased observability. The primary goal is to reduce the cost of resilience engineering through automation.

The slides for Sahar Samiei's and Willie Wheeler's "Expedia's Journey Toward Site Resiliency" (PPTX, 25MB) talk can be found on the QCon SF website. The video for this and all QCon SF talks will be made available over the coming months on InfoQ.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT