BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Too Big To Fail: Lessons Learnt from Google and HealthCare.gov

Too Big To Fail: Lessons Learnt from Google and HealthCare.gov

This item in japanese

Bookmarks

At QCon New York 2015, Nori Heikkinen shared stories of failures and lessons learnt during her time working as a site reliability engineer (SRE) at Google and HealthCare.gov. The discussion of managing large-scale outages included recommendations for preparation, response, analysis and prevention. Heikkinen recommended that organisations should embrace disaster prevention by regularly practicing failure detection and recovery scenarios.

Heikkinen began the talk by referencing the confidence of the Titanic creators before its maiden voyage, and suggested that we should learn to create plans to handle the inevitable bad things that happen. After asking the audience “what do you do when the unsinkable sinks?”, Heikkinen continued by sharing stories and lessons learned from her time working as a traffic SRE at Google, and as part of the ‘tech surge’ team performing remedial work on HealthCare.gov.

The first story, “a city vanishes: a tale of two router vendors”, revolved around an incident at Google where the entire Atlanta-based ‘metro’ point-of-presence (PoP) region appeared to vanish from the Google network. Ultimately this was due to a bug in a new router being tested, which when queried by SNMP for an unknown piece of data, would spectacularly crash. Heikkinen used this story (and the corresponding response by the Google SRE team) to demonstrate that preparedness is an important element of handling failure.

Modelling ahead of time is essential, and techniques such as load testing, capacity planning and regression analysis allow an SRE team to understand how the system should behave given specific conditions. Visualising systems and associated data in real time allows operators to react in real time, and also to answer questions such as “what effect would X have?”.

The second story, “Satpocalypse”, detailed the time when all of Google’s “satellite” edge web servers broke simultaneously. The satellite servers are ephemeral nodes placed at the edge of Google’s network that terminate incoming traffic in order to reduce latency for users that are communicating with the ‘core’ web servers. All of the satellite servers simultaneously failed due to a bug in the server decommissioning process, which ultimately erased the entire fleet’s storage devices. However, due to extensive automation that had been implemented, all of the incoming traffic termination responsibility fell back onto the core web server clusters, and end-users experienced no visible issues.

Heikkinen presented the morale of this story as “learn how to panic effectively”. An Incident Command System (ICS), as used by many emergency services, is an effective way to centralise response, allow cross-agency collaboration, and can scale as the incident demands. Google have modified the traditional ICS to fit within their culture, and have created a lightweight process that consists of less bureaucracy and also removes the need for an enforced organisational hierarchy (which Google does not have).

The third story, “the call is coming from inside the house”, came from Heikkinen’s time working within HealthCare.gov. During the final week of open enrollment on the site the login systems crashed, which due to the spaghetti architecture, caused the entire application to effectively crash. The ultimate cause was a contractor, who was attempting to generate a report requested by his manager, running a complex query as a superuser against the application’s primary Oracle database. This superuser query resulted in all other database queries being slowed. The contractor responsible had observed the issues occurring (and the reaction), but said nothing due to the toxic culture of the working environment, in which the person feared that speaking up would cause them to lose their job.

People are often the ‘fifth nine’ of system availability

Heikkinen suggested that people are often the ‘fifth nine’ of system availability (from the five nines of availability principle). Developing a culture of response is essential, and team members must care enough to monitor and react to situations. Work undertaken by engineers should also be connected to the bottom line, as this creates the correct incentives throughout the organisation. Creating a culture of responsibility is also very important, and blameless postmortems are highly valuable for retrospecting on how issues developed and were correspondingly handled. Heikkinen recommended that operational experience should also inform future systems design.

For the final lesson, Heikkinen suggested that organisations should embrace disaster prevention by regularly practicing failure detection and recovery scenarios. For example, by running ‘disaster in recovery training (DiRT)’ sessions, where synthetic failure scenarios are designed and executed.

The best way we’ve found to prevent disaster is to actively engage with it.

Google conducts DiRT sessions regularly, and real systems are broken to allow the teams in training to react to a realistic situation (the tests are halted if the breakage becomes user-facing). Also discussed was a role-playing technique named “wheel of misfortune”, where team members are selected at random, and past outages or potential ‘future weirdness’ scenarios are acted out. Heikkinen noted that she had seen several real bugs discovered using this technique.

Finally, Heikkinen shared Mikey Dickerson’s “hierarchy of reliability” that contains lessons learnt from HealthCare.gov, and was inspired by Maslow’s hierarchy of needs (with the most important need at the base of the pyramid).

In conclusion, Heikkinen noted that the “hierarchy of reliability” contains needs that are related to her story themes: response (bottom two need), analysis (postmortems), and preparation (the following two needs). The hierarchy does not include prevention, partly because 100% uptime is impossible, and partly because the bottom three needs in the hierarchy must be addressed within an organisation before prevention can be examined.

Additional details of Nori Heikkinen’s QCon New York 2015 “Too Big To Fail” talk can be found on the conference website.

Rate this Article

Adoption
Style

BT