Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York

How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York

This item in japanese

At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is a lot of value in learning how things go right, during "normal" operation of the system.

Kitchens, a senior site reliability engineer (SRE) at Netflix, began the talk by outlining key responsibilities within the SRE role: assisting with defining service level indicators (SLIs) and service level objectives (SLOs), helping to monitor and manage error budgets, running “war rooms” during incidents, and conducting post-incident reviews. Within a complex distributed system, as found at Netflix, failure is the normal state, and the Netflix SRE team has evolved their practices and process so that although failure is important, it is “no longer interesting”.

Ryan Kitchens -- SRE role

The most important thing that can be learnt is how to build capacity into the system in order to encounter failure successfully; the ability to recover effectively is much more valuable than preventing an incident. Having said this, the Netflix team has become “pretty good at preventing incidents”, primarily through the use of chaos engineering, game days, and related practices. Quoting Charity Majors, Kitchens stated that “availability is made up”, and that “the nines don’t matter if users aren’t happy”. Success isn’t necessarily the absence of failure, and having 99.999% uptime is practically meaningless if the users are unable to use the system as they intend.

Next, the argument was made that “there is no root cause”, and instead an incident is typically caused by a “perfect storm” of multiple contributing factors. The “three pillars of fallacy” -- consisting of comprehension, understandability, and predictability -- are a useful reminder as to how we misunderstand incidents. For example, incidents cannot typically be fully comprehended, and remediation items will often contribute to further incidents. Incidents are not made up of causes; we do not “find” them, and instead we construct them, and develop our understanding by creating a narrative. And learning from the last incident will not allow you to predict the next one; complex systems are not deterministic.

There is no groot cause for an incident.

Kitchens argued that we can learn more from incidents by asking “more ‘how,’ rather than ‘why’”. It is more beneficial to make sure that things go right, rather than preventing them from going wrong, and to understand this you have to know how normal work is conducted. Traditional incident investigation uses only a small portion of the total experience, and often ignores data from the time in which things go right: “typical, everyday performance goes unknown and ignored”.

A series of “contributors and enablers, mitigations, and risks” factors were presented from Netflix incident reports, many of which will be recognisable by anyone who has dealt with (or been included within) an incident. There are often difficulties in handling an incident due to “islands of knowledge”, and these should be identified in follow-ups and documented within artifacts, as discussed in J. Paul Reed’s MSc thesis “Maps, Context, and Tribal Knowledge: On the Structure and Use of Post-Incident Analysis Artifacts in Software Development and Operations”. Creating a timeline for an incident can also help, and here John Allspaw’s work was recommended, specifically his MSc thesis, “Trade-Offs Under Pressure: Heuristics and Observations Of Teams Resolving Internet Service Outages”.

In the penultimate section of the talk, the audience was prompted to think about “how hard are people working just to keep the system healthy?”, and "how should this information be passed up the chain of management within an organisation?" Kitchens mused that some organisations may not share this kind of data, because “everything looks good”, but this falls foul of not investigating what normal operations looks like. You have to talk to people to discover this important information, and working with people in this context can be the “hardest problem in tech”.

Conducting interviews with engineers, and moving from “why did things go wrong?” to “how did things go right?” is a challenging but valuable exercise. Be careful not to overwhelm or unintentionally intimidate interviewees, and strive to ask effective open-ended questions. Elicit descriptions to construct “how we got here”, and try to convey what the world looked like from their perspective.

How we respond is important.

Ultimately, how an organisation responds post-incident is vitally important. Avoid using language like “we must be more careful”, “we must avoid making mistakes”, or “we must have more discipline”. A series of books were recommended that provide more information on this topic, such as “Language Bias in Accident Investigation”, “Debriefing Facilitation Guide”, “The Field Guide to Understanding ‘Human Error’”, and “Pre-Accident Investigations”.

The video recording of Ryan Kitchens' presentation will be made available on InfoQ over the coming months. Additional information on the talk can be found on the QCon New York website.

Rate this Article