BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Building 'Failure as a Service' at Netflix without the Simian Army

| by Daniel Bryant Follow 138 Followers on Jun 13, 2015. Estimated reading time: 2 minutes | NOTICE: The next QCon is in San Francisco Nov 13-17, 2017. Join us!

At QCon New York 2015, Kolton Andrus discussed Netflix’s Failure Injection Testing (FIT) platform, which allows the injection and monitoring of arbitrary failure scenarios to a targeted group of customers using the Netflix production web services. FIT allows Netflix to maintain an ‘antifragile’ programming culture, which results in the creation of systems that are resilient to failure.

Andrus, a senior software engineer at Netflix, began the talk by stating that software application failure testing should be conducted within a live production environment primarily for three reasons: this makes systems immune to failure, prevents larger outages, and allows verification of correct behaviour within a realistic production deployment. Andrus suggested that failure testing at scale within a production environment is much like hormesis:

Failure testing is a form of Hormesis - we imbibe the poison to become immune.

Andrus introduced Netflix’s Failure Injection Testing (FIT) ‘failure as a service’ platform, which has previously been written about on the Netflix Tech Blog. The traditional approach to failure testing at Netflix has leveraged the Simian Army, but the use of these applications can lead to unwanted problems propagating through to customers under exceptional circumstances. In particular, the effects of the ‘latency monkey’ have occasionally caused unintended cascading failures, and as such, Netflix developers have become cautious in its deployment.

The FIT application provides a web-based user interface (UI) that allows Netflix developers to define a specific failure scope, for example, a single customer or a cohort of customers. This limits the potential ‘blast radius’ of the failure testing. A ‘Halt all Failures’ button is also provided within the UI, which allows any developer to immediately stop all FIT failure testing in the case that unintended Netflix customers are being inadvertently affected.

Netflix utilise a custom API gateway/proxy named Zuul to perform routing (and other actions) for all inbound traffic to the Netflix web services. The FIT platform supplies failure metadata to Zuul, which allows the incoming requests from the targeted failure scope (customer/cohort) to be identified and marked as candidates for failure injection. Injected failures can include adding latency to a request, returning an arbitrary HTTP status code, or throwing an error. An example of potential failure injection points can be seen in the diagram below:

The FIT UI allows failures to be monitored, and also customers and devices to be traced. Andrus provided a live demonstration of the use of FIT on the production Netflix website, and injected a failure scoped to his Netflix customer account that caused only non-customised film recommendations to be shown on his account home page. After the demonstration was complete, Andrus disabled the failure injection and reloaded his account home page to show that the standard personalised film recommendations were once again visible.

Nassim Nicholas Taleb’s notion of antifragility (the opposite of fragility) was also referenced, and it was suggested that tooling such as FIT could allow the creation of an antifragile software development process:

Aggressive failure testing creates not just robust programs, but an antifragile programming culture

Andrus concluded the talk by stating that in his experience of working within the Netflix team he believes that failure testing is a worthwhile investment, testing in production is sustainable, and this technique can harden systems against failure.

More information about Kolton Andrus’s “Breaking Bad at Netflix: Building Failure as a Service” talk can be found at the QCon New York 2015 website. The Netflix FIT application is not yet available as open source, but additional information can be found within a recent Netflix Tech Blog post.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT