What can software learn from industries like aerospace, transportation, or even retail during national disasters? This week’s podcast is with Emil Stolarsky and was recorded live after his talk on the subject at Strangeloop 2017. Interesting points from the podcast include several stories from Emil’s research, including the origin of the checklist, how Walmart pushed decision making down to the store level in a national disaster, and where the formalized conversation structure onboard aircraft originated. The podcast mentions several resources you can turn to if you want to learn more and wraps up with some of the ways this research is affecting incident response at Shopify.
Key Takeaways
- Existing industries like aerospace have built a working history of how to resolve issues; it can be applicable to software issues as well.
- Crew Resource Management helps teams work together and take ownership of problems that they can solve, instead of a command-and-control mandated structure.
- Checklists are automation for the brain.
- Delegating authority to resolve system outages removes bottlenecks in processes that would otherwise need managerial sign off.
- When designing an alerting system, make sure it doesn’t flood with irrelevant alerts and that there’s clear observability to what is going wrong.
Subscribe on:
Show Notes
What do you do for Shopify today?
- 1:45 I’m a production engineer for Shopify, which is our SRE role.
- 1:55 I work on the job infrastructure and cache team; we look after cache setup, job queues and make sure that it meets the SLAs.
What does incident management look like at Shopify?
- 2:15 Before incident management was applied at Shopify, we had a traditional on-call engineer who would come in when paged and fix any issues.
- 2:25 Around a year ago, Kirk Russel joined from Google and introduced the NIMS system (National Incident Management System).
- 2:40 NIMS is a US Government developed system for managing large scale disasters.
- 2:45 The idea is that you formalise the process for having someone run the response to an incident, and we applied that at Shopify.
- 3:00 Now, when an incident occurs, we have a ChatOps integration tool that we use to run through the incident; we also have a dedicated on-call for incident commanders.
- 3:10 If an incident is severe enough, that person will be paged and will be managing the response to the incident.
- 3:20 They will be the point of contact for incident communication, dealing with the communications team, ensuring that the right people are called, updating the status page, co-ordinating with leadership and other stakeholders.
- 3:55 Originally when we rolled out the system, we realised that we had someone playing that role, but it was never formalised.
- 4:05 Sometimes people would feel comfortable doing it, and sometimes they wouldn’t.
- 4:10 When we rolled the system out, we trained people in the system and talked about what a good response looks like, as opposed to just focusing on the root cause of the issue.
Is Shopify a co-located or distributed team?
- 4:20 We have five offices spread across North America, with one in Europe. Our development doesn’t do a follow-the-Sun model, but it does span multiple time zones.
How does that affect incident response?
- 4:40 It hasn’t been an issue - we have multiple rotations for morning and afternoon cover.
What made you decide to look at transportation industries and compare that to software?
- 10:50 When Kirk joined, he told us how they had adopted the NIMS system from the US Government.
- 5:10 I saw the effectiveness of us adopting this system, and while watching air accident investigation reconstructions on TV, I was fascinated by how they lay out all of the pieces and do great forensic analysis to find out what went wrong.
- 5:25 At one point I realised: whenever there’s an issue, we’re trying to figure out what is wrong with a complex system.
- 5:45 Flying is incredibly safe - but the NTSB (National Transport Safety Board, responsible for investigating transport related accidents in the US) has done over 140,000 investigations, to the point today where if you flew once a day, every day, then it take 4,000 years to be in an accident.
- 6:20 If the NTSB figured out how to do this, over time, we would probably get there - but we could short-circuit that by learning from what they did.
- 6:30 I started reading up about the airline industry which led to medicine, oil and gas, natural disasters.
- 6:55 Once you pull that thread, you realise how they’re all interconnected and share information, and I feel that it uncovers a massive body of knowledge that we can learn from.
- 7:05 We’re seeing incident management command systems being spread out in companies in our industry now - but there’s so much more.
- 7:15 That was really exciting - so I wanted to share what I had learned.
- 7:25 We’ve been lucky in the software industry - generally, when software fails it doesn’t have cataclysmic consequences.
- 7:30 Only certain services that we’ve run have fallen into the limelight when they fail.
- 7:40 Every single nuclear accident gets reported.
- 7:45 You’re legally required to report aviation incidents.
- 7:50 There’s writing on the wall that this is going to happen to us eventually.
Why are checklists so important to the aviation industry?
- 8:10 In the 1930s, the US Army/Air Force were trying to procure a new bomber.
- 8:20 Boeing had built the B17 - it could fly twice as far and hold twice as much weight.
- 8:45 On the second launch, the aircraft crashed about 20 seconds after lift off.
- 8:55 The pilots were experienced - one was the head test pilot for the Army, and the other was the chief test pilot for Boeing.
- 9:15 The B17 was one of the most complex aircraft developed at the time by Boeing, and they realised that the solution wasn’t more training - it was a checklist.
- 9:25 The reason was that the aircraft was so complex it was impossible for an individual to hold all the context and steps in your mind.
- 9:40 They created a stepped checklist - what to do before take-off, what to do during take-off, in flight, before landing, during landing, and after landing.
- 10:40 We can’t look at the history of these industries and not learn anything from them.
What is crew resource management and why is it important?
- 11:30 United Airlines 173 flew from New York to Portland via Denver in 1978.
- 11:50 As they came in for landing, they lowered the landing gear but one of the gears failed to lock.
- 12:00 They executed a go-around, and flew around trying to diagnose the problem.
- 10:50 They did that for about an hour, and as they were coming into land the engines ran out of fuel and the aircraft landed short of the runway.
- 12:15 During the investigation they discovered that had run out of fuel before landing - a fact noted by the co-pilot and engineer on the voice recorder.
- 12:30 The captain was so focused on trying to troubleshoot the problem that he ignored their observations.
- 12:40 In that era, the captain was the highest authority on the plane and couldn’t be overruled.
- 12:44 There were several accidents in that era that were traced back to pilots not being able to propagate the information to the others.
- 13:14 So NASA went off and developed Crew Resource Management - a framework for how to communicate in a cabin, during issues, and in general.
- 13:34 When you read it - it’s basic things; call out the person you’re talking to, tell them what the problem is, state why you think that’s a problem, state how you want to fix it and wait for an acknowledgement.
- 13:49 This all sounds obvious - but when I was going back over the material, I thought back to all of the different incidents I’ve seen at Shopify, where something’s bad and people are trying to help and throwing in different solutions.
- 13:59 When we fix it and look back in the retrospective, we see that someone had called out the proper issue at the very beginning.
- 14:14 You wonder if pilots have had to deal with this kind of process, what could we learn?
- 14:24 When I was listening to pilots talk about near misses, you would never fail to hear them talk about crew resource management.
- 14:39 When pilots are training, crew resource management is drilled into their heads.
What other industries or disasters did you look at?
- 15:04 Hurricane Katrina happened in 2005 - and Walmart prepared for it ahead of time.
- 15:09 They pushed down responsibility to store managers and gave them the leeway to handle the issues locally.
- 15:29 A store manager in Louisiana couldn’t get into the store, so took a bulldozer to drive through the walls to get supplies for the emergency services to use for free.
- 15:44 When your service is down; you don’t want a decision that might bring the service back up to be blocked waiting for authority to perform that action.
- 15:59 Another benefit is foresight and planning ahead - after Katrina, Walmart was one of the first organisations that brought supplies into the city.
- 16:19 South of America is known for hurricanes, and the supply chains are designed to be able to survive these natural disasters.
How do you think these stories affect how you do incident response at Shopify?
- 16:39 We’re looking at Crew Resource Management, and during our on-call training, we’re going to look at how we can keep the conversational structure formalised.
- 16:59 There’s a balance - we’re not an airliner - but it will be interesting to see what we can learn from those industries.
- 17:09 We have a ChatOps tool that we use for managing incidents - we’re going to be looking at integrating checklists into that.
- 17:24 This will mean that we don’t have to think of the basic things when we’re doing an incident response.
- 17:29 I mentioned in the talk that checklists are mental automation.
- 17:34 You don’t want to think how to do the obvious stuff; you’re going to have to update the incident page.
- 17:39 If you have bad code, you will have to lock deploys or revert.
- 17:44 You can focus your mental cycles of debugging the actual problem.
- 17:54 We can drop a checklist in with these bullet points into the ChatOps room.
- 18:04 In the cabin of an airplane, you’ll have checklists - electronically, but with a backup paper one if necessary.
- 18:09 We’ll be rolling out those changes and seeing if we can integrate them into our incident response.
What would you recommend for people wanting to follow up on this?
- 18:34 Sidney Dekker has written many books on postmortems, and how to identify root causes.
- 18:44 Todd Conklin is a researcher in the safety/organisational behaviour field, which has a great podcast called pre-accident investigations.
- 18:54 He’ll go and interview people in different fields, and ask them how they deal with incidents and emergencies.
- 19:04 One of the interviews is from the fire chief in Antwerp, Belgium; another is on the safety of Scuba diving; another is how surgeons perform retrospectives.
What surprised you the most in your learning?
- 20:09 The effectiveness of checklists.
- 20:14 I always thought checklists were taking the thinking out of something - follow the steps, don’t question it.
- 20:24 I dismissed it initially - in fact, most people dismiss them immediately.
- 20:34 When you look into the data, they are such an invaluable tool.
- 20:44 With checklists, what you’re doing is getting rid of the obvious.
- 20:54 We’re trying to automate technology to do the boring jobs repeatedly and correctly; a checklist is the same for the mind.
What similarities have you seen in your investigations and software?
- 21:09 I was reading about alerting philosophy in the Google SRE book.
- 21:14 Reading into the description of the three mile island incident, during the incident they had hundreds of alerts go off, and a single printer that would indicate what the alerts are.
- 21:24 What had happen is that the alerts occurred, the operators decided that the alerting system was faulty and decided to ignore them.
- 21:40 In the alerting philosophy chapter, the authors caution that you don’t want too much alerting - clear observability into the system to indicate something was going wrong.
- 21:49 Three mile island had a single gauge to tell them that something was going wrong.
- 21:54 Afterwards there was a bulletin issued saying that when you are designing alerting systems, make sure that you can clarify what the key issues are.