BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews J. Paul Reed on Healthy Postmortems, Complex Systems and Resilience

J. Paul Reed on Healthy Postmortems, Complex Systems and Resilience

Bookmarks
   

1. Hello. I'm Manuel Pais and I'm here at Agile Conf 2015 with J. Paul Reed, also known as the Sober Build Engineer. Thanks for accepting our invitation, Paul. Can you briefly introduce yourself to our audience?

As you mentioned, I go online by the @SoberBuildEngineer. There's a funny story about that which is always kind of interesting to explain. Actually, I was working in a startup and there were a bunch of bottles of scotch on my desk that people had bribed me to do builds because my background is release engineering. So a friend of mine who is a co-worker said, "Is there such a thing as a Sober Build Engineer?"

And then there's a Facebook article about their release engineering team and they actually have a full bar so I just thought it was funny, this sort of dichotomy. But that's what I go by online. My background is release engineering. I moved into doing consulting in the DevOps space which it's kind of funny that release engineering sort of morphed into this DevOpsy thing. So I work with clients from startups to Fortune 50s. I'm trying to do the DevOps. So that's what I do.

   

2. And your talk here at Agile 2015 focused on operational retrospectives, also known as post-mortems. What does a healthy post-mortem look like to you?

I think one of the biggest things that I focus on is there's sort of a script you can go through with the post-mortem and that's what I talked about on the talk, sort of the structure of it. When you talk about healthy, you really want to look at outcomes from that process. I think a lot of times you get into organizations and teams falling in the trap of we call it blameful post-mortems where people want to point fingers. They want to direct the blame in a certain way. Those tend to be less useful. The outcomes from that process tend to be less useful.

We talk about what healthy looks like. It actually looks different in different organizations with different people, but the outcomes are really what you can look at and they're focused on how they can help the organization improve whatever they're doing, which for different organizations they're going to be doing different things. It looks different but again outcomes are kind of what you can look at as the great equalizer.

   

3. Do you think post-mortems should have concrete objectives like finding root cause of an incident or they better be open meetings just to discuss and cover as many issues as possible?

It's funny. I've worked with a bunch of teams coaching them on how to do productive actionable retrospectives. It's interesting to me how often “root cause analysis” comes up. I speak to this in the talk that that's not a thing. If you look at the science, that's not a thing but we focus on it so much. So I just find that sort of amusing. I actually think it's important to have a structure and again in the talk I talked a lot about constructing the timeline, walking through that, looking at the data, and then trying to sort of figure out what the remediation is from that. And then we actually talk about what makes that remediation actionable.

I think a lot of teams in fact [in similar fashion to] when I was younger and was at a startup, we used to do retrospectives that were very free form. I used to call them popcorn retrospectives because we literally would get the team together and just eat popcorn and somebody would do a presentation where they collected all the feedback.

What was interesting about that free form was it was actually very useful for the team but it didn’t result in necessarily actionable feedback. So it was this kind of stress release thing but it didn’t actually improve the next release, which was kind of interesting to me.

   

4. More like kind of a status meeting?

Honestly, it was kind of just a complaint session. So the thing was it actually was useful and that's what you find because people are people, people are humans and they need to get that stuff out. So it was useful in that regard, but it wasn’t actually useful in outcomes in terms of things we could do to make the next one better.

When I work with teams, what we try to do, again that was a more free form meeting, so we try to actually put a little more structure in place so that people can actually have that release of whatever bad stuff happened during the [software] release or whatever or during an outage. That's actually interesting as a side note with the Agile community because they're used to retrospectives from an end of sprint perspective, and this is not that. This is like the website went down, the database went down; it's an operational retrospective. For me I had to point that out to the audience that I'm not talking about a sprint retrospective. I would imagine those are a little more free form. But here we are actually trying to come up with remediation items that are actionable that the team can do that is going to make their life better and that's going to be a little different than a standard sprint retrospective.

So to answer your question, it's long-winded, but I think the structure is actually useful. If you don’t have that, it's easy to fall into that just people blabbing.

   

5. It's interesting you mentioned the sprint retrospectives. That was a question I had as well. Do you think it would be beneficial to apply post-mortem style of retrospectives during development, for example if there was a major bug and try to sort out where that came from? And vice versa, could ops teams benefit from doing not only post-mortems but maybe kind of sprint style retrospectives?

It's an interesting question. I think the push back that you would get to that is there is always this trade-off between how much do we spend time investigating a situation versus doing the job that we have to do. And in the talk I talk a lot actually about aviation so I give an example with a private pilot and a small plane where the FAA kind of said "Well, the pilot screwed up." That was the outcome of the report. So it wasn’t this thing that we imagine on TV where there's a team and people investigating. The scope of the investigation was very short.

The thing is that you were asking about in the context of like a big bug. I think time and again you see people talking about how much cheaper it is to fix that upstream as early as possible. If you think of the system like an immune system, it doesn’t tick the “we need to respond to it” because they fix the bug way up front, right? So we tend to address that with Agile practices, extreme programming, you know, software as a craft practices. That's how we address things like big bugs or stuff like that. We don’t do kind of a formal kind of retrospective.

That makes sense because we're trying to address that again up front where it's cheaper. Now, what's interesting is when that bug gets all the way through to production and then blows up, then that's where you want to understand more about the system and what allowed that to happen. But I don’t think you would see that earlier in the sprint. We're at the [Agile] conference, where they're talking about how can we catch that earlier. So that totally makes sense.

Now, you were asking about operations teams. That's an interesting question. Operation teams in general tend to be very interruptive. And you were talking about in a sprint context, right? It's often hard for operations teams. You see them trying to figure out what would a sprint look like to them because, for instance, if the site goes down, you can't say, "Well, we'll get to that in our next sprint."

It's interesting to me because again my background is release engineering, so we straddled where we would be doing software projects that would fit into a sprint, but then also be very operational, very interruptive. So I think when you see operations teams looking at this process, they are actually looking at it again from an operational standpoint, not so much a development standpoint. Or if they're doing configuration management or whatever, they're not going to be looking at it from that perspective. Again, I think it's an important distinction to make. There's a totally different process.

One point I will also make that we actually didn’t get to talk about in the talk because you could spend a whole day on it, and in fact there are workshops on this, is incident response, how do you respond to an incident. So how you respond to the incident directly feeds into the retrospective process. So that is hard to sort of think about when you're not actually responding. It's just your daily work in terms of like a sprint developing some configuration management code or something like that.

   

6. I guess my question was in terms of ops team maybe doing regular retrospectives not within sprints but just on a regular schedule to look for things like “how does our infrastructure look like now”? What are the pain points or how can we improve our work flow?

Right. So one of the last slides in the talk is like food for thought, and I talked a little bit about mini-mortems and pre-mortems. That's the concept that if you're going to do something like a database switchover or some big platform shift, you're planning for it. It's part of the normal work that you're talking about that you might do in some sort of sprint or something like that. The fact that you might go get together as a team, game out what it looked like, what it's going to look like, go through all of the things that you might run into, that's what they call a pre-mortem, that can actually be very useful.

What's interesting about this is the research. They've done this with other professions like surgeons, pilots, roles like that where you have to get into a certain mode. So they talk about surgeons always spending all this time, we see it in TV shows, scrubbing up and they're having conversations while they're scrubbing up and they're getting ready to do the surgery. They've done some research that shows that the brain actually switches modes when they go through that ceremony, if you will.

So the pre-mortem is actually really about that, bringing that ceremony to a big project that is not an outage but it's part of your daily work. Mini-mortems, which I actually didn’t get to talk about, are kind of funny. They're the idea that if you are in an environment where maybe you have blameful post-mortems, teams like to point fingers and things like that, that you constantly improve your work in very small cycles. If there's a mistake that no one would notice, that you actually can talk about it with your team, with someone on your team.

This was a small team. When I was looking at improvement and retrospectives with teams, this was a team that had a culture of blame and they wanted to improve within their team. So they were doing these mini-mortems where, if they dropped the wrong table but they had a backup database so they had a hot standby and were able to redirect it with no outage at all, they would talk about that in a less formal way [compared to a post-mortem] but still with a little bit of structure to improve their day-to-day work.

You do see that and that's actually fascinating to me because it's almost like a team that wants to improve within a larger organization that doesn’t actually value necessarily that, and they're able to do that on this very small scale and get some big wins doing it.

   

7. I'd like to go back to the issue of root cause analysis. Like you said in your talk, you think it's not a real thing. Can you explain this a bit?

A lot of times I think we talk about RCAs (root cause analysis). It's a part of a lot of processes that you see in lots of different things but in IT operations you see people talking about root cause. I think what we're learning in general is a lot of the work that I do and the lens that I look through the world involves complexity, so complexity theory and that sort of thing. The thing about root cause is that there's this assumption of linearity.

The example I give is this: a lot of times people will have an outage. They will do a root cause analysis. They will come up with the solution and it is the solution for the problem that they had. They will fix it and then two months later, six months later, a year later, they will have a very similar outage again. You'll see leaders in the organization, executives frustratingly ask "Why? I thought you fixed that?"

So the reason that root cause analysis is a less useful way to think about the problem is that it gives you this sense of security that actually doesn’t exist because of our complex systems that we work in. In fact, actually I spent more of the talk than I thought I would on complexity theory because I think it's really important to understand that we work in complex systems.

So a lot of the things that we talk about, in fact I tweeted this at Agile [conference], somebody was talking about risk analysis. Because people do this analysis, root cause analysis, risk analysis and they think that they have data that is actionable. The risk analysis was in the context of security and so they do the analysis and they still get hacked. It's one of those things that I think people at the forefront of this area, they're looking at the complexity part of it. And they're realizing that if you do root cause, there's nothing inherently wrong with it, it's just that when we say it we think we've solved the problem and maybe we haven't. In fact, in a lot of cases, we might not have solved it.

That's the part where I talked about the five whys, the domino theory. It's a very compelling way to think about accidents, but it may not get us what we want and that is the major point.

Manuel: What properties do complex systems have that make it hard to, like you said, do a risk analysis that actually...

Solves the problem? Yeah, the main one is that they're hard to map. We often think we can map them out and then analyze what we have seen. I talked a little bit about the Higgs boson and the Heisenberg uncertainty principle and the idea that you can either map where it is or you can map the vector that it's on but you can't have both. Actually, I sort of simplified the principle, it's actually a trade-off. The more precise you can locate the particle, the less you know about its momentum.

Those are the systems that we work in. A great example is if we have an outage and then we try to do a root cause analysis maybe a week later or two weeks later, our system could have changed very easily. We're committing code all the time. We're making operational changes. I think that part makes it interesting.

The other thing is that complex systems are emergent. They have emergent properties. So I talk about success of the system and failure in the system is an emergent property. That's what makes them difficult is that there are these properties that emerge from the interactions. A lot of times the ones that bite us are the ones that we don’t even know that they're interacting in that way.

We see this a lot. The great example is when the US-East-1 Availability Zone and Amazon went out and all of your favorite websites went down. That's a great example of complexity in action. A quintessential cloud story of what not to do, but people were moving in the cloud, they had no idea that that was going to bite them in that way, great example of an emergent property of a complex system.

   

8. [...] That leads me to the question of how much effort is it worth investing in actually trying to prevent failures as opposed to improving how you handle them and how fast?

Manuel's full question: That's a great example because actually I had a question on how Netflix was affected by that outage as well and they're well known for investing a lot in the resiliency of their systems, with the Chaos Monkey that kills random production instances. Even for them, that came as a surprise, something they didn’t think of. So that leads me to the question of how much effort is it worth investing in actually trying to prevent failures as opposed to improving how you handle them and how fast?

That's always a good question. I think there is a lot of focus right now on how do you prevent failure? This idea of defense in depth, that we defend our system against that. I think that's really important, especially in the security space, it's super important. What's interesting actually about the security case, I went to a talk on security so that's why it's front of mind for me, but it has also recently in DevOps become like a big thing. A lot of that stuff is really simple to do like just patch the versions of things. A lot of times you'll find in your infrastructure you've got old buggy versions and they're known vulnerable versions of things.

I think that's important but I actually think Netflix is a great example. They are really good at what they do in terms of being able to be resilient because they've invested in resilience. They basically told all of their teams that you need to have fairly decoupled resilient systems. Instead of defense in depth, they've taken the “we know it's going to be chaotic, we know that if we move to the cloud we don’t own that infrastructure so weird stuff is going to happen.” Amazon may shut down an instance because they need to reboot it or whatever it is. So they've invested in resilience.

I think it's one of those things where it's a path and it is a continuum, but I think the organizations - Netflix is again a great example that really hit it out of the park - have realized the reason that you invest in all of that stuff that we're talking about is resilience. It gets you to that promised land of being able to do things like Chaos Monkey. It's always great when people bring up Chaos Monkey because if you've ever talked to an audience that is not familiar with it, they're like, "It does what?" There's this look on their face sort of like...

Manuel: Scared?

Exactly! They don’t understand why would you purposely shoot your infrastructure in the head and see what happens?

I actually was working with a team and I asked them about that. They were banking related and they had a bunch of infrastructure. They said "Oh, we've got networks with active-active or active-passive. If you test this, we have backups and blah, blah, blah." I said, "Do you test them?" And they said, "Oh, yeah. We schedule outages where we'll go in and we'll shut down certain things.”

What's interesting about that is I was like, "Well, would you let me go in and just unplug cables? Would you be okay with that?" They're like, "Oh, no, no, no, no." So what's interesting is that when teams prep their failure modes, they test certain things that they know. It's like telling your ops team, "We're going to turn off these certain things so be ready for that," it's like cheating on the test. And that's actually where you see Netflix saying, "No, we've got a tool that's automated that it's just going to kill things. You just better be able to deal with it."

What's funny to me about that is that at Netflix, because it turns out I have a bunch of friends at Netflix and we actually have a podcast co-host who is at Netflix, you can opt out of Chaos Monkey but only for a certain amount of time. So if you're developing a new service that's going to go to the cloud, that's going to be in their production infrastructure, you can opt out. You can say “actually, don’t terminate this because it's not Netflix resilient”. But after some number of months when it's in production, then you can't opt out of it anymore.

From a team perspective, the way they handle it is you, your team supports and runs the thing in production. So if Chaos Monkey turns it off and some big thing happens, it's going to come directly back to your team. So what's interesting about that is that they have this sense of complexity. They're building that resilience way up front. They're thinking about it way up front. They're not thinking about, well, we need to have the triple redundancy that we maybe test at the back end.

One great example of this that I think everybody will know is when you see big outages at data centers that have to do with power. Well, they all have the big UPS batteries and all that kind of stuff, but I've seen outages where they were testing it and it takes out a set of racks because they just don’t treat it as a resilient thing. It's almost this "We have this backup but don’t touch it" sort of weird thing. We saw this actually with [hurricane] Sandy in New York where people had generators but they just didn’t work the way they expected. They were in the basement and it flooded, stuff like that.

I think when you go into it with a sense of resilience, that's where I think companies and organizations that get it are moving. But solving it the way that we do, investing in detection and defense against failure, that's how you get there. Even if you do the resilient thing, you have to do the other thing. So it's kind of a path to that.

The one last thing I'd say about that is that is the funny thing. Oftentimes you have to defend expending that effort on preventing failure sort of thing, and that's where you have to get the organization to the realization that the reason you do that is because that's what makes you resilient is if you solve it that way, that's how you become resilient. You're not doing it to really prevent the failure. The last thing I'd say about that is again the teams that you see really knocking it out of the park, they're really good at resilience. They're good at incident response and resilience. An emergent property of that is they have less failure. Or observable failure, their customers don’t notice.

   

9. Going back to the post-mortems, who do you think should participate in them and are there any particular skills or mindset to make it work well?

We talked a little bit about this in the talk. Everybody involved should obviously participate. I think that's actually a requirement. If you don’t have everyone in the room that was involved in the incident, then it's just a recipe for weirdness. So that's kind of a requirement.

I did talk in medium to larger organizations the idea that you might have your own NTSB [National Transportation Safety Board], your own impartial party that facilitates the post-mortems. I think that can be very useful to help teams get better and improve. Also because they're not involved, they can be a little more dispassionate about what's going on which I think could actually be very important.

When you talk about skills... I did a small digression in the talk about this term you hear a lot, blameless post-mortems. I hate that term. I absolutely hate it. I actually hate both words, those words together just don’t make sense to me. I mean they do but I talk about why they don’t. The one that I want to point out is blameless here.

Humans through millions of years of evolutionary psychology are wired to blame. I talk about a social psychologist, a sociologist, Brené Brown, at the University of Texas who has done a bunch of research. She was saying that blame is a way for us to discharge pain and discomfort. So when we go into this process and we say it's going to be blameless, it's like saying we're going to go into this process and we're going to pretend we don’t have arms. We all see that everybody has arms. We're just going to pretend and kind of wave our hands that aren’t connected to anything because we don’t have arms, right?

What I'd like to do in that case is actually acknowledge that, as humans, that blame is actually a part of the way we deal with unpleasant situations. I work with teams coaching them on how to get through that process, acknowledging the blame but then trying not to direct it to a person. We kind of put it to the side and everybody is sort of "Okay, yes, you may have messed up but we got over that."

So you're asking about skills, I think a sense of personal and then at the team level self-awareness is very important. When people can sense "I'm getting frustrated, I want to blame that person for that thing", that is a tool. I mean it's useful in a lot of contexts but it's very useful in this because it's a very interpersonal process. It can be a very sensitive process, a very vulnerable process.

A lot of times it's actually very interesting. You'll see very blameful cultures. So people on that end of the spectrum are like, "I don’t want to take any responsibility." I've seen it actually on the other side too and sometimes you have to kind of get people to the middle. Sometimes they're like "I mistyped the thing. I'm sorry. It was all my fault. I screwed up." That actually can be just as limiting as a blame culture because it doesn’t get you to realizations, like the reason that you typed the thing incorrectly. Yes, you did that and that's your fault maybe if we're talking about it that way. But you were on call for 50 hours or three weeks straight and so you were tired. That's an organizational problem. That's not a people problem.

So that's actually one thing you have to be aware of too where people will actually take blame on themselves. You've got a blameful culture and you've got very proud culture, right? Ops people that are very proud in their jobs and so they take that stuff very seriously and I identify with that, but a lot of times they will take the blame or the responsibility so much that it doesn’t actually get to the problem in the system.

Last thing that I would point out, the reason that I do the coaching is I think it's useful to have a team that may not be involved. I mentioned this in the talk, a lot of times if you have someone who doesn’t know about the database and doesn’t know really about your application, they're going to ask “stupid questions”. The team, with some context, actually cannot do very good prompting questions to make their work lives better and then again more resilient.

   

10. You also mentioned in the talk that there should be accountability but not blame?

Right. So it was the accountability-responsibility difference. A lot of times I think in the English language we assume those are the same things, accountability and responsibility, and they're not. I actually didn’t give this example [in the talk] but a great way to understand it is: in the military the enlisted people that are doing whatever, they are held accountable, they need to be able to explain their behavior and what they were doing but they may not be responsible. That may lie with their commanding officer or whatever, right? So there is this sort of difference.

That is one thing actually, it's a really good point, that you'll see a lot when you're talking about retrospective is that because it is humans and human factors in complex systems, a lot of times you have to slow down in a way that we're not used to. So we start thinking about what's the cognitive load? I did the example where it's like, let's do the timeline forward and [another person] do it backward and see what matches and what doesn’t match. You have to think about those things.

Language is another thing that's really important so we talk about accountability versus responsibility. We also talk about finding a fact versus finding a judgment which, if you're a lawyer, it's going to be very obvious to you. But if you're an ops person, that difference may be less obvious. So I walk through that with teams a lot of the time because I want that when the team is really fact-finding, trying to figure out what happened, I don’t want the team to get distracted with judgment yet. A lot of times we see people jumping into that.

   

11. You also mentioned the cognitive biases that we have and that people should be aware of when they go into a post-mortem. Can you maybe give us some examples of how those biases influence our behavior?

I think one of the biggest ones that we see all the time is outcome bias. I said earlier that success is an emergent property of a complex system as is failure, but we only really look at the failure. So a lot of teams that have been very successful doing the retrospectives or post-mortems on their failures are actually starting to look at it, they'll do that process in their successes as well. It's kind of actually interesting. So if they have a big database switchover and everything goes fine, they'll do a retrospective on their success as well.

You see this a lot with outcome bias where we only look at the things that exploded or maybe exploded to a certain level. In other words, in the reporting chain, the CTO found out about it. So then we do the retrospective. That organizationally is a bias. We only pay attention to where we failed, and that can be interesting.

I talk a lot about hindsight bias and this concept of counterfactuals, so a counterfactual is when you say something like "Why didn’t you see?”, “You should have”, “You could have”, “Why didn’t you?". The bit about that is that it is weird to think about. That's a time machine. We are going back and we're getting all up in your face asking, "Why didn’t you do x?" when at that point in time it seemed totally reasonable to be doing whatever we were doing.

We see that a lot with hindsight bias. What is so freeing about that is, and I've mentioned the time machine a couple of times, when we do retrospectives, we actually are trying to get in a time machine. But the difference is that when we do it in a retrospective, we're trying to go back and be curious, be open and reconstruct that person's reality from their vantage point. We're not trying to go back and look at whatever they were doing at that point in time as an external party and say "Why didn’t you know?", right?

The fundamental attribution bias, the DevOps bias, I wish we'd start calling it that, but the idea that because someone is an operations person they don’t want change. They want their systems to be stable. You might say the developers have a bias where they're like “Oh, you never want anything to change." And I've known some ops people that are always wanting to improve their systems. They were the first people to do Puppet and Chef and Ansible. They really have the mindset that you would expect a developer to have but, because they're operations people, we have a bias. I mentioned this in the talk. There's a slide of all of these biases. There were 30 biases up there. I went to Wikipedia and they've done so much research on how human brains just aren’t that good in certain ways. So we go through those.

You were asking earlier about skills in a retrospective. Honestly, being aware of your biases and being able to call them out or even have the conversation in your head “maybe the thing I'm about to say is due to a bias”, it's super useful. It's one of the most important things. I do this in the talk like you basically have to troll the audience a little bit because you have to show them examples of where they thought they knew what it was and they were just biased in some way because they weren’t expecting that [result].

And that happens a lot actually where people come into an incident and they're not expecting an incident. They're reacting which is fine. That's totally cool. But then when they go back you have to incorporate the fact that I wasn’t expecting there to be an outage. So my behavior might not be what I would have expected because I was caught unaware.

We see this a lot in pilots, nuclear plants, stuff like that where people are reacting to the best of their ability and then you have to take that into any retrospective process that you do.

   

12. When we talk about outcomes of post-mortems, often you end up with a long list of recommendations and improvements. How can an organization prioritize and do an effective follow-up of those long lists?

That is probably the hardest part. I do, again, a lot of training. In talking with people in the intake before we do the coaching or whatever around retrospectives, they may say "Well, we're doing retrospectives now," and it's like "Well, where does that stuff go?" and I make the joke about the write-once-read-never Wiki. All of those findings go there.

The biggest advice that I can give about that is you need to consider the remediations as experiments. So if you think about, okay, we're going to make a change and run an experiment to see that it actually fixes the problem. And maybe it won't or maybe we get data from that experiment so we can run another one that may do that [fix the problem]. If that's the mindset you use, you'll be much more successful.

I mean there's two things: that and I talk about if you're going to do to an experiment, again this is complexity theory in form so probe-sense-respond: probe the system with the experiment, sense if it's actually solving the problem, and then respond “Do we keep doing that? Do we stop doing that?”

The other thing we used to do in the popcorn retrospectives I was talking about earlier [is when] we have like 20, 30 things we should do. Do you think any of those got done? Maybe two of them. Maybe three. So you do need to decide as a team on the experiment to run. We're at Agile [conference], so it's a collaborative and kind of poker planning process where you decide what experiment you want to run. But the biggest thing is that you consider the experiment. Everybody knows it's an experiment and then you revisit it. A lot of teams that I see doing this and that I work with doing this, they actually come up with things that are more useful out of the results of the experiment. So they may have a failure, run the experiment. They find that it fixed that problem but they found four more problems. So then they run another experiment.

At the end, it's really about becoming resilient, [becoming] a learning organization, you have to value those things. And if you value those things, then this whole process of “running an experiment, learning, running an experiment” is actually very natural. But I think right now we're just starting to see, again with people like Netflix that actually live this way, the benefits of doing it that way.

Manuel: If all the outcomes of a post-mortem are just technical, would that...

You've missed something if they are all technical.

Manuel: Is that kind of an anti-pattern let's say?

Yeah. We often talk about socio-technical systems and the socio parts are the important part. This is the great example, the quintessential example: somebody typed “drop table” or “rm -rf” and the outcome from the post-mortem or the retrospective is “let's automate whatever cleanup process that was”. That's totally fine. That's cool. But then if we stop there, that's a technical solution, we don’t ask the question “Why did that person type that? Were they burnt out? Were they on call? Do we need to have a separation of systems in some way?” That goes into the area of human factors.

The great example I give there is we talk a lot about automation. We think automation is going to save us and, as a release engineer who's done a lot of manual processes, I'm a fan of automation. But going back into even the mid-'90s to now you see aviation talking about automation and automation has been a huge boon there. That's why we have two pilots in the front instead of three. But a lot of accidents have been caused by humans interacting with automation in ways that the humans didn’t expect, the creators of the automation didn’t expect.

So they've learned this lesson and Air France 447, which was a big accident, was, at its core, an automation accident, a human factors accident. And of course there were a lot of other factors at play there too. But the point is that I think you see a lot of push towards automation in the IT operation space and we need to think about that. Aviation learned all those lessons. Well, not all of the lessons, but learned a lot of lessons. We need to look at what they've done and the lessons they've learned. So if it's only a technical solution, you have definitely missed something.

   

13. So would you say that, if you run post-mortems that lead not only to the actual technical solution of the problem but dive into the social and human factors that cause it, that is a pattern of a healthy DevOps culture in an organization?

Well, certainly when we talk about what does DevOps look like in a healthy context, you're going to get tons of answers. I will say this, it's certainly more comprehensive. It will certainly serve your organization better if it has those components to it. We talk a lot about outcomes. You have a better outcome if it has those different components to it.

   

14. And speaking more broadly of DevOps, what other kind of patterns do you think those organizations typically exhibit?

I said this before but it certainly bears repeating. I think organizations that value learning and resilience as concepts that don’t rely [on risk analysis]. I don’t mean to bag on risk analysis but every single time it comes up, there's always some bad story associated with it where the risk analysis failed. Organizations that don’t rely on risk analysis to make decisions.

This is the dumbest thing ever but I have a friend who likes to say it's nice to be nice. So organizations like Netflix that hit it out of the park, they do the right thing because it's the right thing to do. They don’t say "Well, we should do a risk analysis and figure out whether the cost-benefit trade-off is whatever." I'm sure there's aspects of that. Netflix is one of those DevOps unicorns but the point being that they look at their system through the lens of a learning organization and resilience. Those things are very important to them.

I gave the quote about the fact that Netflix spends more in the cloud on their monitoring (they actually call it their business insight platform) than they do actually on shipping us all movies and TV shows. So they look at it that way and I think when they value the learning and resilience, that is certainly a healthy DevOps pattern because DevOps is the thing that doesn’t like to be defined. So if that's the case and it's always changing, if you have an organization that is resilient in the face of change and wants to learn, that is certainly I think a component of a healthy DevOps pattern.

Manuel: Thank you very much.

Thank you.

Sep 10, 2015

BT