InfoQ Homepage Presentations Whispers in the Chaos: Monitoring Weak Signals

Whispers in the Chaos: Monitoring Weak Signals

View Presentation

Speed:

Download

48:09

Summary

J. Paul Reed looks at what the safety sciences have to say about humans operating complex socio-technical systems, including how resilience engineering can help, the role heuristics play in incident response, and more. All of these provide insight into ways to improve one the most advanced and effective monitoring tools available to keep those systems running: ourselves.

Bio

J. Paul Reed has over fifteen years experience in the trenches as a build/release engineer, working with such companies as VMware, Mozilla, Postbox, Symantec, and Salesforce. He speaks internationally on release engineering, DevOps, operational complexity, and human factors.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

We're going to be talking today about Whispers in Chaos. And you might ask yourself, chaos? What do we mean by chaos? Well, we're going to be talking about incidents. So I want to start with a bit of a survey here. Raise your hand if you're SRE or in that world and you work incidents. Keep your hands up. Raise your hands if you are a developer and you get pulled into working incidents. A lot of those, okay. And raise your hand if you see the people with their hands up working incidents and you just want to know when the site's going to be back up and you have to read those reports. Yes, so that's pretty much everyone in the room. So that's good. You're in the right place. We're going to have fun with this topic today.

So I want to start with a question, another question. How do you know an incident is occurring? Think about that for a second. How do you know an incident is occurring? Monitoring. It's not a trick question. I hear someone at the back going, "Kubernetes," no. No, it's just monitoring. I know, by the way, it's observability now. So, not monitoring. I know there are some arguments about that. We can talk about that later tonight at the social event.

But the question that we're going to be looking at today, we're going to be talking about a couple of major things. How do you know what to do when an incident is occurring? And that's a little bit more of an interesting question. How do we know what to do when that monitoring and alerting goes off? So those are the two major things we're going to be looking at; is how we, when we work in the systems, deal with the chaos of incidents, and then what we can do to get better at dealing with chaos and incidents.

So this is the obligatory "about me" slide. I'm not going to go through it. I'm going to point out two things, though. The Twitter name is up-top. So if I say something during the talk that is confusing or you want to DM me, DMs are open. You're welcome to tweet at me. The other thing I want to point out is that last bullet point, Masters of Science candidate in Human Factors and Systems Safety. I actually just finished the masters up. But the reason I call it out is because a lot of the things that we're going to be talking about come from the work that's been done in the safety sciences in the last 80, 90 years and my classmates are, you know, pilots and air traffic controllers and nuclear engineers. So a lot of this research that's been done started there but it's creeping actually, not slowly, pretty quickly into our industry because it turns out when you look at those systems, those complex systems, there's a lot in common with what they deal with.

How Do You Know What to Do When an Incident is Occurring?

So back to that question I started with. How do you know what to do when an incident is occurring? Well, one of the first parts to answering this question is really to figure out how humans react to problems. And so there are a couple of ways that our brains deal with that. They're very boringly named: System 1 and System 2. System 1 is automatic, it doesn't take a lot of effort for us to use System 1. And interestingly, there's not really a sense of voluntary control and I'll give you an example of this in a second. But it's not like we think, "Oh, I'm going to use system 1 to solve this problem." It just kind of happens.

Now again, System 2 is described as sort of effortful. We use it for complex computations. One of the most interesting things about System 2 is that it's associated with subjective experience of agency, choice, and concentration. The point being that System 2 thinking, when we say, "I'm concentrating on a problem," that's System 2. When we think we're making a decision, and we talk about agency in a situation, that's System 2. Now, this comes from Daniel Kahneman's work. He was a psychologist, actually. And interestingly, he won the Nobel Prize in Economics in 2002 for describing these two systems. And he talks in the book "Thinking Fast and Slow." This is really his career's work in this book. It's a pretty chunky book but it's worth a read if you're interested in this stuff.

So what are some examples of the System 1 and System 2? Well, System 1, if I say, "Complete bread and …" most people are going to say, "Butter," right? "Mom and …" most people are just going to say, "Dad," Orienting to the source of a sudden sound. So if somebody drops a glass out at lunch, us turning our heads to find that sound is all System 1. Again involuntary: two plus two equals what? Interestingly, finding a strong move in chess is actually a System 1 part of the brain, but only if you're a chess master. And this is important. We'll come back to it. But it's very interesting that System 1, for certain types of activities, for certain types of people, actually becomes System 1.

System 2. I think the most interesting example from System 2 is focusing on a particular voice in a crowded room. So we've all had this problem where we're at a crowded room. We're talking to someone at a party. Somebody drops a glass and we might try to find where that noise came from and we aren't able to listen to what they are saying anymore because there's too much ambient noise in the room. So this is an example of, it actually takes concentration for us to understand what the other person is saying. Counting the occurrence of the letter "A" on this slide. I like filling out a tax form. I don't know how much agency there is in that. But that's an example of System 2, and of course, validating a complex logical argument.

The real important point about all of these is that these processes, when our attention is disrupted, the process that we're working on is disrupted. So it's the idea that it actually takes effortful attention to do these System 2 tasks. But as it turns out, it's more complicated than that. It always is. And so there's more research to look at. And what we're going to be looking at is John Allspaw's thesis. This is John. He was the CTO of Etsy. He was also the program chairperson, one of the program chairs for Velocity Con for a number of years. And he did research at Lund University in Sweden and this is his thesis: "Tradeoffs under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages." That's a mouthful. But what he basically did is the most extensive post-mortem ever done on a particular incident and he turned it into a master's thesis.

"The Incident"

So we're going to go through and look at some of the things that he found out and some of the things that he described as he examined, very deeply, one of his teams at Etsy working a particular outage. So let's start with the incident, description of the incident.

This is in 2014 during one of the busy holiday shopping seasons. They noticed, one of the afternoons where, you know, people are trying to buy presents for their loved ones, that the personalized home page for logged-in users was experiencing loading issues. It was slowing down. People were having trouble loading the page. So, of course, how many people do retrospectives and they do timelines. People are familiar with the timelines. This is right out of, "The Field Guide to Understanding Human Error," by Sidney Dekker. By the way, if you find this topic in general interesting and you want to learn more about it, the field guide is actually a really good book to start with. It's easy to read. There are lots of pictures. It's funny. Jessica was pointing out the stock clip art there. There's lots of stock clip art in there.

But the point that I want to point out with this is this is an example that he gives of a timeline. You'll notice there are different components of the timeline. A lot of times, when I look at post-mortem reports from organizations, the timeline is just like, the time, it's like a table, usually. It's the time and the event and that's the timeline. And Dr. Dekker talks more about a timeline, that it can actually be useful in helping us learn about a particular incident and what it looks like. So you notice there's graphing data about the system at the top. There's the tasks of various people. So what were people doing at these particular moments of time in there? And then also, what were they saying? What were they communicating with each other?

Now, what you'll notice is once you get this granularity of the timeline, you'll see these vertical lines across the timeline. I call them "Capital E" events. They're basically pivot points in the particular incident or outage or whatever you're working, that are actually relevant to learning. So you'll notice the third one, T3, that's where the task that somebody was doing ends and they say you see it and you'll see the graphs. They plateau and then they start to come down again. So the point of this is once you put all of this data together, you can start to get a richer understanding of what's going on.

So that's what John did. He had a bunch of data because Etsy uses ChatOps. They work incidents in IRC, actually, at least at this point in time. He was able to get the web logs from different services. So they have a deploy tool that they use. He was able to get the web logs to see who accessed what deploy tool or what tool when. You'll notice there's staff directory access in there. So that was actually people trying to find other people to pull into the incident, find their cell phone number, that sort of thing. And you'll notice, though, that's just for one engineer. But he was able to get those pivot points in the incident. And again, this is just for one engineer.

So incidents are often a team sport. Working on incidents are often a team sport. So he put together all of the engineers, the things that they said in IRC together on one timeline and he called them, I always liked this, combined IRC utterances. I guess that's the academic term for talking on IRC. Anyway, so he put all these together and he was able to see, "Okay, who is talking to who about what, when?" and again, get more rich data about this particular incident.

Now, I want to call out a couple of engineers that were involved in this incident: InfraEng1 and InfraEng2. We're going to revisit those two engineers and we'll call them Alice and Bob. But I wanted to call those out because we'll revisit a couple of interesting things with those two infrastructure engineers.

Heuristics

So Allspaw identified three monitors, or heuristics he called them, that engineers used to work incidents. We're going to go through them and I want you to see if any of these heuristics resonate with your own experience if you work incidents or you've seen people work incidents or you're responsible for helping people get better at working incidents.

So the first one is what you might expect: change. Did anything change in this system? We know this system is in a good state and now it's in a bad state. So this includes commits to GitHub repos, deployments. It also includes network changes, infrastructure changes, any of that. Engineers will look and see, "Okay, we have a problem. What changed?" That's the easy, first heuristic, number one, that they go for.

Well, what happens if nothing changed? And it turns out in this particular incident, there had been no code deployments. There had been no network or infrastructure changes. Nothing had changed and yet the system was now in a bad state. What do engineers do? Well, in talking with them and interviewing them and looking at all of the data, he found the second heuristic is to go wide. And he described it as widening the search to any potential contributors imagined. So this is the idea that, "Okay, we have a problem. We've diagnosed it and we can't directly attribute it to any change that has occurred. So now it's time to get kind of creative and imaginative. What things in the universe of possibilities about the system that I am familiar with, because I maybe built it or I operate it, could be causing this class of problem?"

And he talks about this idea of as engineers, it's sort of a breadth-first search. And then once we find something, we go do a depth-first search mode. So engineers are very broad and then as they diagnose things or look at things, then they say, "Oh, maybe it's the database," and then they'll start going down the path and narrowing their universe of possibilities to the system that they're looking at. So it's this interesting interplay between breadth to depth, back to breadth, back to depth, as they try to diagnose the problem.

Then the third heuristic that he talked through is what he called convergent searching. And this is the idea, if you look at this idea of going very breadth-first search, finding something and then going very deep, then the question becomes, "Well, how do I know to switch back to a breadth search mode for the problem space? How do I know to make that trade-off?" And this third heuristic, the convergent searching describes how that mechanism works.

And the point that he made is that the convergent searching is used to confirm or disqualify diagnoses by matching signals and symptoms. And this is really important. It's not that engineers just use this heuristic to know what is happening, they actually use it to prune the tree of possibilities. That's how they know that, "No, it's not the database. Let's go back up to the breadth level of we can set the database aside and it's working as designed," or whatever, or it's not the network, or it's not, you know, whatever it may be, message queues. So that's actually the really critical point, is that engineers use it both to find solutions and also to discard particular solutions. And it's how we make that decision to switch.

So what do we mean? What heuristics, when we're confirming or disconfirming, do we actually use? And he identified a couple of sub-heuristics. So we use a specific and past diagnosis of a problem that we've run into. And this can be in the system that we've worked with before or it can actually be other systems, experience from our careers. Or a general, but recent, diagnosis that comes to mind. So what does he mean by this? The first is a really painful incident memory. It's something that we remember for our entire career and we tell stories to each other and in the hallway track at conferences about these really painful incidents. Oftentimes, in the case we'll see, that's experience that we had maybe at the company we're at or maybe if we were hired to work on a particular technology. It's what makes us good in that technology, is we've got a lot of stories, painful stories, about using that particular technology. Or it's an incident that is still in your L1 cache. It's an incident that you worked with this particular system that's very recent and therefore, it's memorable, and it's front of mind, or maybe back of mind simmering in the background because it's a situation that you've seen before.

So let's talk a little bit about what the actual problem was with this particular incident. So the page load time increase was caused by CDN cache misses due to an HTTP 400 status in a particular API from a closed store referenced by a blog post in the sidebar. So let's dissect what this means. Etsy has this idea of stores and the stores can be open or closed. If the person stops selling items or if they don't pay their Etsy bill or whatever it might be, they can close a particular store. And it turned out that an employee that had written a blog post about something totally unrelated happened to have a closed store. And the API then was querying for this information to populate, I think the byline of the blog post, to populate that about the users’, the employee's, store, except it was closed.

And because that endpoint was returning a 400, the CDN was then not caching the result of the page which means every single user that hit that page during this very heavy Christmas time shopping season was getting the page rendered completely through the system, as opposed to having the CDN copy. So that was causing the particular slowdown. And you'll see on the side there, that was the blog module there was causing the problem.

So what Allspaw did to figure out how the engineers debugged this particular problem - and one thing I actually want to point out from that last slide - that's an example of there was a change in the system but it wasn't readily apparent. There wasn't a code deploy. There wasn't any infrastructure change, nothing like that. It turned out there was a change, but it wasn't readily obvious to the engineers that posting a blog post could cause the site to go down.

So he used a method called process tracing and in particular, he was looking at critical relayed observations. Those are in circles and then a stated hypothesis. So, in other words, the engineers are looking at something and they see something. So that's a observation and notably, they have to tell someone else. They have to actually talk about it in IRC or do something. And then, the square was a hypothesis. These are statements like, "I think it's the database. I think it's caching. I think …" whatever it might be.

So let's zoom in here a little bit and talk about Alice and Bob here. And this is timeline format. So left is earlier in the incident, and right is later in the incident. You'll notice that really early on in the incident, Bob says frozen shop. Is it a frozen shop? And this is something that not a lot of other people actually say yes, no. It doesn't get a lot of, "Oh, yes. I bet it's that." But Bob persists that throughout the debugging session, that he thinks it's a frozen shop and it turns out eventually that's what it turned out to be, a closed or frozen shop.

Now, Alice, on the other hand, one of the first things that she looks at is varnish queuing. "I think it's our varnish queues are having problems." And then she followed up on that diagnosis and kind of chased it down. Now the question you ask yourself, and this is what Allspaw asked these engineers is, "Okay, why did you pick those particular things to search on and look into?" And what he found out is that it turns out that Bob, who had been an engineer, I think for two or three years at this point at Etsy, one of the first incidents that Bob ever worked within the first couple of months at Etsy involved a frozen shop error and it was a costly painful outage that Bob happened to remember. So Bob's tendency, as someone who knew about this particular system, looked at outages as "Oh, yes. Let's just check frozen shops. Is that the problem?" even if there wasn't necessarily anything that pointed to frozen shop. That was such an impactful incident in his career that started at Etsy or his time at Etsy that that's what he constantly went to. And by the way, it had been fruitful in the past. They had had problems continuing through his career over those past two to three years at Etsy.

When John went and asked Alice about, "Okay, why did you say varnish queuing?" There were a couple of interesting things. The first thing is the reason Alice said that, was because Etsy had been having problems with their varnish queues for the past six weeks. So in her mind, one of the things that she asked early on in all of the incidents that they were having is, "We've been having varnish queuing problems for a while. Let's check that." And really, it's a theme. We've all had that problem where maybe we switched databases or we bring in a new technology. And if there are problems with the system, we might say, "Oh, you know, we just switched to some new no SQL database. Let's check that." This is an example of that.

One of the humorous things about when John went to interview Alice is, he sat down with Alice and he said, "Okay, I'm going to talk to you about an incident on December 4th where this happened and the page loaded slowly. Do you remember that incident?" And Alice said, "Yes, I do. That was when we had the varnish problem." This is so ingrained that in Alice's memory, even though it wasn't varnish, she thought this incident without any more information just than a date, kind of a rough time period, she thought it was related to the varnish caching. So this is sort of an example where sometimes these heuristics are really, really strong. It gets encoded in our memory in certain ways because these events are impactful or because the context that we have for this system where we might be having a problem with a certain sub component is in our memory.

Finally, he came up with a bonus heuristic, testing to fix. So I have a question for you. Raise your hand- when you do a deployment, as a rule, how many of you make sure your tests pass, so you're doing a test pass, before you do the deployment? All right, that's about 75% of the room. Okay. Now, imagine if you had an outage and you made a one-line change to a feature flag on a configuration file. You got it reviewed by someone on your team and there was pressure to get the site back up. How many of you would wait for all of the unit tests to complete before deploying? Okay, that's a lot less hands up. Why is that? That's interesting to think about. It should be the same, right?

Well, this is a heuristic. It's a heuristic that we all use and we just saw that. So try and ask. It's like, okay, well, engineers wait for automated tests to finish before deploying. One, on the left there is, "Never," and five on the right is, "Always." And what's interesting is that it's a heuristic and especially, a context-sensitive one. It will shift given the situations that we're in and the pressures and trade-offs that we have to make. I'd like to point out, I like this one engineer. "I never wait for tests. Just deploy it. YOLO, all the time. Whatevz. It's cool."

So the point is we use these two systems of our brain all the time when we're working in chaotic, often incident-driven situations. And it's interesting to me the survey we just did. A lot of that heuristic is almost System 1. It's pretty easy for us to go from, "Yes. Of course, I would always test," to, "Well, maybe. I don't know. Somebody's breathing down my neck. I might do something different."

Shifting gears a little bit, how do you get better at detecting an incident is occurring? So I asked how do you detect an incident is occurring? How do you get better at detecting an incident is occurring? You monitor things better. It's still not a trick question. The person in the back is saying “Lisp”. No, no, it's just monitoring. But then let's ask ourselves, how do you get better at knowing what to do when an incident is occurring? How do we figure out how to make us better at this task that we're talking about?

Elements of Expertise

Well, let's talk about this concept of expertise. A couple of researchers named Hoffman and Klein split expertise into a number of descriptive categories and tried to define what expertise means so that then they could study it. And they sort of defined it as novice, journeyman, and experts, were their kind of three buckets of people that they were studying. So they defined it as experts use their knowledge base to recognize typicality in situations. So experts are able to understand when they approach a situation, "Is this typical? Are the strategies that I've used in the past going to suffice or is this way out of the realm of what I'm used to?" So the example they give, really good fire commanders when they roll up on a fire, they can sense, "Is this kind of going to play out? Is this a typical fire or is something weird going on here?" And you might see it if a chemical plant was on fire. It's not going to be your typical fire.

They have the ability to make fine-grained discriminations. So this is the ability, as they're debugging the problem or working a problem, they're able to, in real time, go, "This is not working. Let's make an adjustment here," and they know where to make it. And it's not like, "Let's shift direction entirely." It's fine-grained discriminations and then observe what happens.

This was actually one that was particularly interesting to me. They're able to use mental simulation. And what they meant was they're able to, based on their experience, understand how a situation got to this point. So they're able to say, "Okay. Well, what could have happened?" If I come up to an incident and something's going on, they're able to say, "Okay, there's probably like four ways that we got to this state we're in now." But also, they're able to figure out, what are the possible avenues going forward? What are the ways that this is going to play out? And they can simulate that in their mind and use these three things together to then make better decisions.

And they also use- though they described it as knowledge bases used to apply higher-level rules or higher-order rules. The point here, and this is a critical point, is they found that it's not that novices or journeymen do things differently. It's not that there's a different mechanism in our brains that makes novices and experts and journeymen do things differently. It's that experts have a larger knowledge base to draw from. And that's what makes them able to make better decisions because they have more experience and knowledge base to draw about making those typical, non-typical, fine-grained discriminations and changes in their patterns of behavior.

Interestingly, Hoffman and Klein said, "Experts see things that other people do not see." A person gains the ability to visualize how a situation developed and imagine how it's going to turn out and experts can see what is not there. So the example that they give, if you're kind of curious what does this mean, is a really good music tutor was tutoring, I think, a viola student, and was noticing that their notes were a little off. And that's actually pretty common. But what they noticed was that the student wasn't correcting the note on notes that were particularly long. And that was the bit of advice that they gave to the student, is that if the note is out of tune and you hear it and it's a long note, you actually need to reposition your fingers to correct that. And that's something the student wasn't doing and that's what made the tutor an expert in playing the viola.

Another example, kind of a contrived example they gave, is a new employee in a consulting firm and they're maybe working on proposals. And they might see that the last six proposals that they saw on the system, there was a bunch of frenzied activity at the end and it seems like, "Okay, what's going on? All of our proposals are kind of a train wreck." Well, somebody who's been in that system would point out to them, "You didn't see all the proposals that went totally right and went out the door and you didn't even notice they went so well.” And the reason I bring that as hypothetical up, and that's a hypothetical they gave in the research, is because there's a fun question. Does anyone post-mortem why the site is up? No, we don't, and that's an example of information that, you know, there's something that's happening there that's making it successful. What is that? Because maybe one of the incidents, one of the issues there is that the thing that makes it successful isn't there. Experts see what is not there.

So how can you promote this experience? How can you get better at building expertise? Ten thousand-hour rule. So people might have heard the 10,000-hour, 10-year rule. It was popularized by Malcolm Gladwell in "Outliers." But if you've heard some of the criticism of Malcolm Gladwell, you might say, "Okay, well, that's great. He takes scientific studies and then he tells good stories that make us all feel good and then, you know, whatever. But it's not a really good thorough reading of the topic." So I actually went back and pulled the study that he was referencing: "The Role of Deliberate Practice and the Acquisition of Expert Performance." This is a paper from 1993. And what's interesting is the researcher, after this research was popularized by Gladwell, actually wrote a paper saying, "This is what Gladwell got wrong and stop yelling at me because it's more nuanced than that." So this is a follow-up paper in 2013, "Why Expert Performance is Special and Cannot be Extrapolated from Studies of Performance in the General Population: a Response to Criticisms."

Expert Performance

So the point of this is, the original paper and then the follow-up, makes the point that practice over 10,000 hours or over 10 years, you need about 10 years of time to actually do that much practice. And they studied specific kinds of systems and that was notable. You can't take this and just apply it to any skill. We'll talk a little bit about that. They studied chess. They studied music. They studied sports, things like that that are particular types of systems. And what makes those systems special is that they're relatively stable systems. So chess, you can discover new things and people do, but the rules are pretty much the same. Same thing with sports, if you're playing a sport. Same thing with music. It's a relatively stable system.

And the reason that's important is because it has to be something that you can do deliberate effortful practice on. In fact, the research calls it deliberate practice. And you can't do that on every type of activity that you might think of. One of the examples that they give is if all of us went to a new city and we were trying to find our way in the city, at some level and at some point, we would get sufficiently good at finding our way in that city and now those skills would turn into System 1 thinking. If you live in SF, you're just going to go downtown and you don't really think, "How am I going to get there?" and you might come up with four different ways to get there and they're all roughly fine. I'm not going to spend 10 years or 10,000 hours practicing that particular skill.

So it has to be something that you can deliberately practice and something that you can keep in that System 2 domain, that you can actually concentrate on and pay attention to the movement of your fingers as you play the viola or the layout of chessboards over thousands of games so you can start to see patterns. So that's the other important part. We might call this flow in our domain. Well, its flow in every domain. But if you've experienced that idea of flow, that's what they're talking about with deliberate practice.

So let's look at this expertise in other domains. Are people familiar with the "Miracle on the Hudson," that crash? Yes. So this was January 15th, 2009. About 135 seconds after takeoff, and you saw it right there, at just over 2,800 feet, a plane hit a flock of birds. Both engines were disabled and they had to turn back and there's a lot of chatter about should they land at various airports and, of course, if you know the story, they ended up in the Hudson. You'll see the aircraft turn there for the river there, on the Hudson River, on the right there. The captain was Chesley "Sully" Sullenberger and the first officer was Jeff Skiles, which I mention that because we're going to talk about both of them and their activities. So one of the interesting things that the NTSB- of course, they investigated this accident and they found a number of oddities in the captain's behavior which, if the outcome had not been better, everybody survived, there were minor injuries, but everybody survived, those questions may have had a different tone to them. But they found a number of interesting things.

The first thing they found out was the pilot, Sully, turned on the auxiliary power unit almost immediately after he lost both engines. Now, that's odd because turning on the APU is something like number 15 in the emergency checklist. So there's no reason he should have done it other than his thousands of hours of experience in that particular aircraft. Now, the Airbus A320 is a highly technically advanced aircraft and it uses electricity for basically everything including moving the flight services. The thing that generates the electricity in the air are the engines. If you don't have engines, you need power somehow. That's why he turned on the APU first thing. If he had not done that, he would have lost a bunch of instruments and he actually would not have been able to move the flight surfaces to control the aircraft. So that's interesting.

He pretty immediately took control of the aircraft from his co-pilot. You might think, "Well, okay. It's an ego thing," or "He's responsible. So it's a responsibility thing." And when they asked him about it, he said, "No, it has nothing to do with that." It turns out that he at that point had about 4,700 hours of pilot time in the A320. His copilot, literally, his first flight in the A320. Now, the copilot had 15,000 hours of flight experience but had just come off the check ride from this particular type of aircraft and he knew that to pass the check ride, he would have to know basically the checklist by heart. You have to know them by heart to pass the check ride. So, he was able to pull the checklist out and run it because he had literally just taken that test. And so he was better equipped to do that. Well, the more experienced pilot, who had more time in that type of aircraft, was able to manage the weird quirks that you're only going to know when you fly that aircraft for 4,500 hours or 4800 hours.

So this is what expertise looks like in aviation. These are the typicality and non-typicality. This is the fine-grained discriminations and you actually see in the inputs, if you look at the flight data recorder that he put into the aircraft, are all examples of expertise. Oh, and one last thing: I'm a pilot. So you're always taught if you lose an engine, try to make it back to the airport. He actually made a decision to not land back at LaGuardia which seemed odd at the time, and I'm sure the insurance company for the plane and the people asked him why. But it turns out, in every model simulation they've done, he would have crashed into buildings in New York or New Jersey. He would not have made an airport. And that again turns out Sully was a glider pilot. So he knew how long he had in that particular aircraft.

So this is what expertise looks like in aviation. Of course, we all know what expertise looks like in the tech industry: 10,000 hours’ worth of experience with it not maybe being DNS. So you might ask yourself, "Okay, well, technology, is that a stable system? Technology is not like chess, is it?" And what's interesting about that is that technology, the things that we deal with are stable enough-ish, that we can keep learning and keep interested in our own learning to keep that into System 2. So I'll give you a great example. I'm a VI person, don't hate me, and I have enough experience with Emacs to know how to exit Emacs. And so the thing is- and I'm sure same thing with VI users- but here's the point is I don't need 10,000 hours of Emacs experience to know how to exit Emacs. That's in my brain and I'm done.

So there are certain skills in technology that lend themselves to that type of thinking around way-finding around a city and once we get to proficiency, we're not going to spend 10,000 hours learning it. There are other skills like incident response or nuances of network protocols or how deep, if you're working on, like, the JavaScript VM, how that actually gets translated into machine code. That's something you can spend deliberate effortful practice on for 10,000 hours and that's how you can become an expert in our industry in whatever that particular technology is.

Transforming Experience into Expertise

So how can we transform this experience into expertise? Well, there are four ways to do it. Personal experiences: this is the opportunity to be continually challenged. The important point about this is what I said earlier. You have to be continually challenged. You can't just be doing the same thing over and over for 10,000 hours and expect it to be useful. It won't be. You have to be challenged in those 10,000 hours.

Directed experiences: this is receiving tutoring so as to be able to tutor. This is a critically important point. The tutoring is not about receiving the knowledge. It's about the experience that you need to be a tutor to other people. That's how you become an expert. It's not just the getting. It's the giving the context and the tutoring to someone else, passing it on.

Manufactured experiences: so this is training simulation. We put pilots in simulators, that kind of thing. And then vicarious experiences: these are painful memorable events that we craft into stories, again, to tell others, the hallway track that we tell in interviews or we tell people over drinks.

So in our industry, on-call was a great example of that. If you're constantly challenged in on-call, that will make you an expert on the systems that you're debugging when you're on-call. Directed experiences, training, code review, pair programming, Wiki's and run books. I want to call this out. This is why code review is super important. We've all probably gotten a bad code review and we've gotten a good one. And the point here is that that's a skill and the whole reason that we should be training engineers and working with them to get good code reviews is so that they can give good code reviews to the people that they work with after us.

Chaos engineering and game days: there's a lot of discussion in the industry about those particular topics. They're very front of mind. That's manufactured experience. And then of course, vicarious experiences. I remember this one incident where it was DNS because it always is. I want to say one thing about the vicarious experiences. There's a great quote from Dr. Richard Cook who is a researcher in this area. And he says, "Success comes from experience. Experience comes from failure."

The Rasmussen Triangle

The second way you can build expertise is exploring discretionary spaces. So I want to talk a little bit about the Rasmussen Triangle. Rasmussen was a researcher in the safety sciences who cut his teeth on Three Mile Island where there were consequences, pretty serious ones. And so he came up with a number of models over the years that are pretty fascinating and they stand the test of time because I think he came up with this in the mid '90s.

And so let's go through what's on this triangle. So we have the boundary of economic failure up there on the right. That is, if we cross that boundary in a system, we won't do it because it'll be too expensive. Nobody will pay for it. And that lower boundary there, unacceptable workload, that's the idea that humans are lazy. I don't mean that as an insult. We are evolutionarily hardwired to get the most amount of work done for every gram of sugar that we burn in our brain. So we're always going to be finding shortcuts, workarounds, ways to make us more efficient.

And then finally, over there, we have the boundary of functionally acceptable performance and acceptable risk. That's a long-winded way of saying when we cross that boundary, bad things happen. So in this model, Rasmussen talked about something called pressure gradients and these are things that push the system that we're in towards a particular boundary or away from a particular boundary. And so pushing away from the economic boundary, we've got the business' need for cheaper, better, faster. That's where we're pushing. We've got maximum work for least effort. So that's us working in the system. We want to be more efficient. Where are we all of us in the system pushing it? Yes.

So in his model, there was kind of this idea of a dotted line as well. And then they later added this idea of calling this the discretionary space. And the point here is I work with a lot of companies that work on incidents and that sort of thing, on struggle with incidents. And they always ask me like, "How does Netflix do it? How does Amazon do it? How does Google do it? They never have any outages." And the point that I make to them, it's not that they don't have outages or incidents, it's that they have them, and as a customer, we don't see them. And the reason that we don't see them is because they get really good at exploring that discretionary space and they understand they've found ways to figure out where that boundary is or figure out where that dotted line is. And when they get close to those boundaries, because, by the way, these things are always moving, they're able to slow down and go, "Wait, we need to be very deliberate about what's going on here and we need to slow down and not push the system over that limit where we have a boom."

So this is directly out of the site reliability engineering book. I kind of like that they have this Maslow's SRE hierarchy of needs. But the point that I want to make is monitoring an incident response and post-mortem or root cause analysis are very low in the stack. There's something that we all need to be doing. Why? Because they're a way for all of us to gain expertise in the systems that we're operating. That's why this is important. That's why this allows capacity planning and development, and product, and testing, and release and all that other stuff to work as well as it does at these organizations that we all look up to or try to figure out what they're doing.

So when Etsy looks at their post-mortem process, and that's, again, where John worked, where Allspaw worked, they ask two questions at the end of a debriefing, which is what they call it. Did, at least, one person learn one thing that will affect how they work in the future? And did, at least, half of the attendees say they would attend another debrief? Because that second question is super important. If those debriefs are blameful and finger-pointing and just not fun, people won't want to go to them. So that's important to them. And the second thing that they care about is just one person has to learn one thing that they'll take back to their job differently on a daily basis. In other words, it's not about remediation items for them. They don't really even care about them. It's about, did one or more people learn something?

So how do you get better at knowing what to do when an incident is occurring? That's the question that I started this section with. Well, the answer is simple. Create space and experiences to facilitate the cultivation of ourselves and our team so as to improve our collective heuristics at detecting weak signals and ambiguity in the complex socio-technical systems we operate in which we exist. Or, practice makes better. It does not make perfect. It makes better. Expertise takes time and space. So you need to create that time and space if you want expertise. That's how this works.

Finally, amid the chaos of the systems that we work in day in and day out, it's just us out here. Now, this picture was on the opening slide and this is…some of you might have recognized it. It's from Voyager 1. It was taken in 1990. And it's often called the "Pale Blue Dot" video or photo, because Carl Sagan actually had Voyager, told NASA, "Turn it around and take a picture." And he said, "Look again at the dot. That's here. That's home. That's us. All of human history, everyone you've ever known, everyone that has ever done something you ever knew about has been on that dot. So we need to be good to each other on our journey to expertise because it is just us out here amid that chaos."

So a couple last thoughts. Research, the bibliography is in the slides. I'm not obviously going to go through this. But if you want to see where all the stuff comes from, it's there. Also, if you find this area of human factors and system safety and these thoughts swirling around interesting, as applied to our industry, that is my thesis. You can ask me what the title means if you want. I'll be around later. But it's going to be published by the end of the year, for sure. And so if you're interested, you can ping me on Twitter about it. I'll send you a copy. That's all I got.

Questions and Answers

Woman 1: Just wondering if there are any lessons to be learned from medical diagnoses?

Reed: So it's interesting. There are a lot of lessons there. So a couple of my classmates, one was an orthopedic trauma surgeon and one was an OB/GYN. And so the context there is a lot different because I will come in with incidents about tech companies and say, "Oh, you know, people couldn't get their cat gifts for 20 minutes and that cost us, you know, $15,000." And they'll be like, "Yes, I gave the wrong medication to a baby and they died," so that the 10 other conversations are different. That said, I think one of the biggest things about healthcare that we can learn is they have a rich tradition of running people into the ground in their residences they're on call for 24 hours at a time and that sort of thing. And there's a lot of research showing that doesn't actually work.

And so I think looking at some of the practices that they're trying to change, and you'll actually see it with aviation, air traffic control, healthcare; there's a lot of shifting mentality and ideas around that we are still behind on. We should watch those industries for those changes so that we can pick those ideas up faster.

See more presentations with transcripts

Recorded at:

Jan 03, 2019

J. Paul Reed

InfoQ Software Architects' Newsletter