BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The Trouble with Learning in Complex Systems

The Trouble with Learning in Complex Systems

Bookmarks
49:28

Summary

Jason Hand explores the challenges with learning in complex systems, the relationship between high and low stakes learning opportunities. as well as the cost associated.

Bio

Jason Hand is Senior Cloud Advocate at Microsoft. He writes, presents, and coaches on the principles & nuance of DevOps, Site Reliability Engineering, and modern incident management practices. Named “DevOps Evangelist of the Year” in 2016, he recently authored a book on the topic of Site Reliability Engineering. He is a co-host on the podcast “Community Pulse”- a show on building community in tech

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Hand: One of the things I want to talk about today, or the main theme for today is that I hope that I challenge some of the thinking that you've maybe been resting in for a long time in how we build, and operate, and maintain our systems. A lot of the ideas I want to share today came from the travels that I've been doing over the past six or seven months. Back in September, I joined Microsoft, previously I was at a company called VictorOps, which, if you're not familiar with them, they do incident management, on-call management, similar to PagerDuty if you're familiar with that service. I joined Microsoft, and one of the very first things that I was pulled into was what they call Microsoft's Ignite The Tour, which is a global tour. All around the world, there are 17 different cities.

I was pulled into this project, which was actually very interesting, very eye-opening. Over the course of the past 6 months or so, I've traveled almost 92,000 miles, which ended up being a little bit over 3 and a half times around the world, which I never would have thought I would have done in my career. It blows my mind. I've spent a total of over seven days, just in the air alone during that time.

These stats aren't that important, but what is important is that everywhere I go and everybody I talk to, I find that I'm using language like complex systems, and there's not a common understanding of what that actually means. I think - I'm guilty of this - we've fallen into this trap of using words, using terminology that not everyone's on the same page of what it actually means. When I say complex systems, I think sometimes people just accept that as face value and don't actually dig into what does that actually mean when he says complex.

Complex Systems

That's a big part of what I want to talk about today, and especially what I want to start with is what do we mean when we're talking about complex systems? Not only that but why - when we're trying to focus on learning so much, we've heard about how it's important to be a learning organization - why are we struggling to actually learn? If you were here for Ryan's [Kitchens] talk previously, he touched on a lot of things that I'm going to also try to amplify a little bit more. It's very difficult for us to really find good methods to learn about our systems, to learn new ways to improve them, and continuously build them and make them better for the world, for everybody that we're trying to serve and trying to make better.

My name is Jason Hand. That's a really old photo of me, but you can tell the hair's gotten a little longer and grayer, but same outfit, actually, it needs the shoes though. I came to Microsoft not too long ago. I love getting on Twitter and talking with you all, so if you do the tweets, follow me, let's be friends there and we can continue the conversation. Certainly, anything that you see or hear today, feel free to share that on Twitter. I love for others to be able to take advantage of the great content, the ideas that we're sharing here to today. If you're up to that, please do.

I want to start off with this idea of complex systems. Most of us, I think, have a general understanding of what that means or we have our own personal understanding of what that means. I'd like for just a moment to sort of split the term systems up into sort of a binary description.

There are systems that we can understand cause and effect, very simply, but there are also systems that we simply cannot understand cause and effect. If we look at systems in general and split it up into binary terms like that, it then allows us to look at these in terms of ordered systems and un-ordered systems. Let's try to keep it really simple and we'll just build from there.

We've got ordered systems and un-ordered systems. If we look at ordered systems, we can actually start to break it down even further. A lot of these ideas are sort of adopted from the Cynefin framework. If you're familiar with Cynefin, it's a Welsh term, a gentleman named Dave Snowden was the originator of this idea. He split it up into five different realms, but I'm just going to talk about four. I've laid it down in a little bit different way so it makes more sense to me. There are four different ways you can sort of split up those two binary descriptions of a system.

The first one is an obvious system. I think, for me, one of the easiest way to describe or give you an example of an obvious system is a bicycle. It's mechanical thing, even like a mechanical watch, we can break it down, we can take it apart, and we can put it right back together exactly how it is. We know how it works, you can analyze it in the moment, as it's being put together, as it's being pulled apart. You know that the chain connects to the linkage and connects to everything else and the wheels turn and all these different things. That's an obvious and also sometimes known as a simple system.

Then there's a complicated system, and that's if we were to take that bicycle, and we add combustion to it. We add some more things to make it a little bit more complicated. A lot of our systems or at least sub-components of our systems can be what we would identify as complicated. As you've probably been told or have heard several times already today is that there's also a lot of things that are happening in our system that we just don't have any clue. It's these unknown unknowns. We didn't even know these things could possibly happen. You're not going to see that in an obvious or a complicated system. They're understandable, they can be pulled apart and put back together because they sort of exist in a vacuum.

If we start to look at un-ordered systems, we take our obvious and complicated and let's expand on that. Let's take that same motorcycle, let's put it on a road, and let's add some more stuff, especially the people. Suddenly, we're looking at complex systems. This is where it gets a little bit tricky. This is where I realized of those 92,000 miles I've spent, this isn't common knowledge. I feel stupid for going so long and not stopping to have this explanation with people. Because I'll just take off on some weird tangent and start talking about complex systems, and they're thinking over here in the obvious and complicated area, as though you could really understand your system all the way down to the silicon.

Then there's chaotic, which, I think, if you've spent any time outside, you understand it takes the complex to the next level.

The one thing I like to point out is that these aren't technical systems we're building. They're sociotechnical systems, there are people involved, there's us. We do the designing, we do the planning, we do the building, we do the maintaining and operating, we do the incident response. This is all about us, and the people part is the hard part. The technical stuff, we can figure it out and it's changing all the time. It's the people part that gets in the way.

One of the things that you've probably heard also is that when it comes to a complex system, really understanding it, breaking it down can only happen after the fact. You can't understand how a complex system will behave or will work ahead of time. You can have theories, you can have ideas, but until something actually happens, until the system is alive and doing its thing, you can't examine it and understand truly the reality of what's going on.

Human Body

One good example I like to bring up in terms of complex systems that I think most of us can understand or at least try to draw some parallels is the human body. I think it's interesting that we can look at the human body and we can see that we have brain specialists, and heart specialists, and lung specialists, and all these different specialists in the different parts of the complex system. They know those parts of the complex system very well, but then there's also the generalists who know a little bit more.

We can look at this in another way. We've got our application specialists, we have our developers, our front-end people, our mobile people, we've got our infrastructure folks. We got the ones who know Kubernetes back and forward. There are always specialists, but they also don't have that broad understanding of the system, the actual system, the whole thing. I think this is where it starts to break down with folks because a lot of times we get really focused on what we know and what we're specialized at.

We start to operate in these silos, we start to operate in a box, and we don't see it from the 10,000-foot level or 50,000-foot level, and that's the part that we all have to start looking at a little bit closer. We have to step back and look at it a little bit differently. What we're always dealing with is a lack of perfect information. We will never, ever know what's going on in complex systems 100%, and a big reason is it's constantly changing. We got people involved, and now we're into the world, we're doing continuous delivery. We're always making changes to the system, let alone just our users in the system doing God knows what. They're very good at surprising us with what our system can and cannot do.

Imperfect Information

It's with this imperfect information that I think I want to go next. How many of you are either musicians or you have an instrument that you've tried to learn, you've maybe got a guitar at home? A few of the requirements to learning to play a guitar - I'm a guitarist as well and I can tell you and I think any of you that raised your hand can agree to this, that when you go to learn an instrument, when you go to play guitar, you can't just read a book. You can't just have someone show you how to play guitar, you can't just listen to the music, and then suddenly be able to do it.

It takes time, it takes actually playing a good note, followed by another good note, followed by a really sour bad note, followed by more good notes and more sour notes. It takes a little bit of good and bad, it takes success and failure. It is with this success and failure that I want to point out that you have to have both of them. They're the opposite sides of the same coin, and it's an absolute requirement when it comes to learning. You have to have set back, you have to have failure. Otherwise, you're really not learning anything, you're only seeing half of the story.

When we start to talk about the challenges of learning, if we accept, "Yes, I get that that makes sense I'm drinking your Kool-Aid here," that we have to have some failure and setbacks in order to learn what kind of challenges are we faced with. I feel like there's really two that I feel are the biggest ones and the two that I want to point out today.

The first thing I want to point out is that learning isn't always going to just present itself, or maybe it does, but you just don't notice it, but the opportunities aren't always evenly distributed. Sometimes we have great opportunities to learn because there are lots of problems. Sometimes, as Ryan [Kitchens] was pointing out, things are just working. Everything seems to be fine, our systems are healthy, we haven't had any incidents, we haven't had a sub-two, or worse incident in months. Does that mean everything's ok? Does that mean that we shouldn't still be finding ways to learn about what our systems are doing, our people are doing and how they can be improved?

I started thinking about this, and what our daily activities look like, and what are the things that we're doing on a very constant continuous rate. I started thinking about some of the high-level things here. This is a very informal poll that I had taken in terms of how often are you doing certain actions and certain activities throughout your day? I've got here code commits are pretty common thing that we're doing. Configuration changes happen fairly regularly. We do some feature releases. Then, of course, there is going to be some sort of service disruption, there's going to be an incident occasionally.

If you start to think about how often we're doing these things, we're doing code commits, and configuration changes pretty regularly. Feature releases, they're starting to catch up, where we're releasing fairly often, maybe even multiple times a day. Incidents, especially for a lot of companies who either don't have a way of identifying incidnets or don't have a way to respond to incidents, it's just not something that they see or do very often.

In fact, I just learned the other day about a startup who's just about to go from around 5 million users, which is already a pretty high number, to 250 million users, and they have 0 monitoring in their systems at all. I'm going to turn off my phone on that day so that I'm not worried about them too much, or that they don't blow up my phone. If you look in terms of the consequences for some of these things, and when things maybe don't go quite as well as you expect them to, when it comes to code commits, you can back out of that, you can fix that pretty easy. DNS seems to be the big problem, and a lot of configuration changes have made the news lately, that's still something we do pretty often. Typically, when you make a configuration change, it doesn't necessarily take your whole system down. Feature releases, sometimes big releases can cause problems if those don't go well. I remember not too long ago when all of our feature releases were on Friday - P.S., don't do that, never do that - we would sit there Friday afternoon until Sunday night, making sure that everything went well because we pushed out on Friday because somebody thought that was the right thing we should do at the time.

Incident response is another thing. If you don't get incidence response right, you're going to have a bad day, your company's going to have a bad day. Your customers are not going to be happy, they're going to bounce, they're going to go find somebody else.

The consequences are almost the exact opposite of the frequency that we're doing these things.

When we start to look at high stakes versus low stakes, and then you add in how often are we doing these things, you see that for the low stakes things, we do them all the time - code commits, frequency, it's really high. We learn how to do these things. I can do a get push and all that stuff with my eyes closed, not even thinking about it while I'm talking to three different people. It's just a thing, it's become muscle memory. When we're talking about incident response it's something that has high stakes.

How often do you get a chance to actually deal with those outages? How often do you know who's supposed to be doing what? What's your coordination? We have this problem where the things that have really high stakes to our business and can really have a high cost if something doesn't go right, we have also the situation where it doesn't happen that often. It presents this opportunity where essentially we are seeing that the frequency is providing the opportunity. If we're not creating that frequency or if we're not identifying that frequency, then we're not giving ourselves the opportunity to learn.

What Even Is An Incident?

Let's talk about an incident for a second because, clearly, this is the direction I'm headed. How many of you are on-call? I know Ryan [Kitchens] asked this question earlier, that's part of your responsibility, to be on-call. I bet you all the money in my pocket, which is actually zero at the moment, that if I was to ask each of you individually what an incident is, I bet I'd get a different answer for every one of you.

This is something else I've noticed too in a lot of my travels. When I start talking about incidents, people have different ideas, and the reason is, is because everyone's situation is different. Everyone's systems are different, everybody's teams are different. This idea of an incident is actually somewhat subjective. People have different ideas on what an incident is.

To me, if it impacts the customer, that's definitely an incident. I would hope that most of you would agree with that. Then there's other situations where maybe it didn't impact the customer. Maybe something went wrong, didn't quite go as we thought it would, but we don't call that an incident. If we don't call it an incident, then there's probably no incident review. There's probably no retrospective on what in the world just happened. There's your opportunity that we just let go.

One big reason why some people - some teams, I should say - don't have or don't create this opportunity to learn from these incidents is that the idea of an incident is just subjective. The other one is that we also view incidents and disruptions and accidents and all these different things as bad, something we should avoid. It puts a lot of stress on us, most of this gray hair is because I've been on call most of my life, I feel like. We don't want them, they're not pleasant. We're starting to realize they're not pleasant, but they're normal. It's just part of a complex system, it's part of what we're building. We have to start looking at them as what they really are. They're normal, it's part of our work, we didn't plan for it. Ryan [Kitchens] mentioned, these are unplanned work, it's an unplanned investment.

We've been silly in how we plan out our work. We start really focusing on just the functionality of our systems, our product teams are pushing down features, but a lot of the work that we do is keeping the system up and running. It's dealing with incidents, it's dealing with near misses, things that maybe don't even really get recorded. Somehow we save the day, nobody even really realizes it or recognizes it, and certainly doesn't take the time to learn from it.

Incidents are subjective and they're also oftentimes simply avoided, or maybe intentionally mislabeled, so that we don't say, "We just had 20 incidents this month. That's not good." This is one of those dangers of reporting incidents, 20 incidents in a month, it doesn't mean that that's bad.

I was actually having a conversation with some of my former coworkers at VictorOps while I was in Portland a few weeks ago for Monitorama. When I was at VictorOps, I facilitated all of our post-incident reviews. I was talking to one of our head, sort of SRE, people there and I was asking, "How do the PIRs go?" That's what we call post-incident review. He was, "Honestly, it hasn't really been the same since you've been gone. I'm running into problems where a lot of the times we get done with our retrospective, our incident review, and we don't have any action items. A lot of times we feel like it was a waste of our time." I pointed out to him that an action item isn't the point of an incident response or an incident review, it's not why we do them. It's to understand more about what's going on the system and share more about how the system actually works beyond just those people who had to deal with the problem in the moment.

What I tried to convey to him is, if there was even one person in that conversation during that retro that now walked away with more information about how the system actually works, then you had a good retro. You don't have to have action items, there doesn't have to be some new work. Maybe there is, and oftentimes there is, but it's not a requirement for a good retrospective. We're just here to learn.

One of the things they also continue to slip in too is that we're trying to find ways to prevent these things from happening. I understand that you want to mitigate and sort of minimize the impact of problems, but you will never, ever be able to remove all problems from the system. The systems are continuously changing. As we've tried to get across to you with several of the talks today, they're not bad, it's just part of the system. It's also part of the information that you need to collect.

I love this quote from Monitorama. Nida [Farrukah] is actually a coworker of mine, she's an SRE at Microsoft. She gave a great talk about how incidents are just full, just riddled with data that most of us just never even take the time to look at. A lot of it has to do with what's going on with the people. It has nothing to do with the technology. The components come and go, especially if you're in a cloud world, "Let's just kill that server and bring a new one up." We don't need to necessarily know what went wrong with it. We need to know how we're going to respond when things go wrong. Not if, but when.

When we start looking at incidents as bad within a complex system, it's not that it's just counterproductive. I think we all can agree that or at least start to understand that it's not just counterproductive. It's also dangerous, we're starting to build systems that people's lives are on the line. We're in a world now where reliability isn't just a nice thing to have. There are services out there, maybe not Netflix - I'm sorry, Dave - but there are services out there that people need. Dave's actually got a really good talk where I think he talked about an outage when Netflix went down, it was just a message that said, "Go outside and play." The reality is that we are now in a connected world and the things that a lot of us are building, people really do rely on, and it's really dangerous for us to start thinking about problems within our systems as bad or unnatural.

The way I like to explain it is that these incidents, these things that are "going wrong within our systems," it's just our system telling us what's going on. It's just the pulse. It's feedback, it's not necessarily good or bad. It's just information, it's data within a sociotechnical system, people and tech.

Techniques to Learn

Let me see if I can get into some ideas here about how maybe we can find different ways to learn within these complex systems given some of the stuff I laid out where we just avoid problems, we don't necessarily like dealing with things. We're also impatient, we don't have a lot of patients to sit down and read through a long retrospective analysis of what actually happened. We have other things to do, we have sprints, and certain deadlines that we're working towards.

One of the things I would say, and thank you, Ryan [Kitchens], for echoing a lot of stuff. I was able to put my root cause soapbox off to the side because he covered a lot of it. One point I do want to get across to you today, even if you are still doing root cause analysis, despite the fact that we're not really manufacturing anything linearly, it just doesn't provide enough data. I think it's good to do something, and if you are doing root cause analysis, I'm not going to shame you about it, I think there are better ways to learn.

The problem with root cause analysis, it's just not enough data, it's not telling us what's going on. I've had enough opinions about this over the last five years that people finally forced me to write a book about it. You can still download that, it's available through VictorOps's website. I don't even like the term postmortem either because that insinuates that we're just looking for the cause of death. We're just looking for the things that went wrong, and that's not what we're looking for at all. We want to know the things that went right, we want to know what people did to prevent it from getting worse. It's a learning review. A post-incident review is a sub-part of a learning review. There's a lot more data that we got to get out of what actually is taking place within our systems.

To put this visually in here, if we were to go through the old school method of asking five questions of why things happen, it doesn't really get us to much. It feels like we're just on this search of treasure. What was that one thing that went wrong? What was that first domino that tipped, that cascaded all these other problems?

That's not actually what's going on. Actually, I don't even really care anymore, it's not interesting. What is interesting is some of the other stuff. What are our team's understandings of how the system works, how the sociotechnical system works? What are the conversations that engineers are having with each other? What tools are they using? What theories are they forming? How much knowledge and information and expertise and experience do with some of these engineers that are showing up to deal with problems? Where are they at in all that stuff?

How much sleep have you had? Have you been on call for a week? Anything more than three or five days can be very detrimental to not only your health but the system itself. How about language barriers? How many of you work on a distributed team with people speaking multiple languages around the world? How many of you are using different types of equipment, different stuff that's just different? Do you ever talk about that? How much confidence do you have when you show up at an incident? Especially if you're the first responder and maybe you're a little bit more junior, you're new to the team, where's your confidence level? Do you ever talk about that in root cause analysis?

The point I'm trying to make is these are actually much more important than why did that thing break. Those things are quasi interesting but this is what's more interesting, because these get to the heart of the complexity of the systems that we're building. We look at these things, I've got five areas here, there's definitely much more. You combine them all together and now we're starting to see how the language of complexity starts to make a little bit more sense.

This is just one engineer and all the things that they're thinking, and how much rest they've had, and what their experience is. It's not just one person involved, you've got a whole team of people, and each one of them has inputs and outputs that affect each other. Then we start to see even more complexity. We like to talk about incidents in terms of a timeline and impact, and this is something I've definitely talked about a lot.

We measure things like the time to detection and the time to resolution. Here we can see our first engineer who came in with all these different squishy things, they showed up at one point in time, and they started doing some things. Then a few other engineers started coming in, and it wasn't until we get to the fifth engineer who got pulled into this incident that now we're starting to see the impact go down. Now we're starting to restore and recover from whatever problem we have, but then we had three other people join.

Historically, we would do this as sort of a timeline. We would see hopefully, you're doing some sort of persistent group chat with your incident management. Hopefully, you have a clear channel for people to have conversations. I think phone bridges are better than nothing, but I think they're falling out of favor too, mostly because it makes it much more difficult to capture what people were saying, what they were doing, the things that were actually happening when it's all just an audio recording. Unless you've got someone who's going to go back and transcribe exactly what people were saying, it's very difficult.

I'm a huge proponent for ChatOps and making sure that you get those conversations into their own clean channel about what's going on. That's still not really what's going on. The reason is that when it comes to incidents, they aren't linear, they don't really just happen in a straight line. In a complex system there's a lot of stuff always going on at the same time. I think most of us would understand that in reality isn't linear either, and that's what we're dealing with.

We start to look at this a little bit different, we can see the same arc, but just because they came in and they did one thing, they don't just disappear after that. It's moved on to the other engineers, but engineer one is still thinking, and still asking questions and watching what the other engineers are doing. It's changing their mind about what they should do next, and they're also learning. There's a lot going on, it isn't a linear story about what's happening during an incident response. We never talked about this, especially not in a root cause analysis.

Now you've got all these engineers, and they all have their own feedback loops, and we start to see the complexity really form. How do we capture this? How do we talk about this? How do we learn from every one of these engineers and what they're thinking and doing? There's a really great book that I wish I could just buy 200 copies for and just give you all because this one really opened my eyes to a lot of stuff. Thank you, Samuel Arbesman, for this book. I've read it twice now.

This is something that jumped out to me too when you start talking about complex systems. It doesn't take many inputs and outputs, and components to start getting to really astronomical numbers. It's just simply impossible for any one human being to wrap their head around what's going on.

We're not looking for buried treasure. There isn't some line, there isn't something that we just ask a certain subjective number of whys so that we can actually know what broke and what we should do to improve, it doesn't provide anything to learn. I think we've also heard today, what about the stuff that we didn't actually identify as an incident? Somebody came in and did one thing, we didn't call it an incident, so we never took the time to even discuss it.

If you want to build a resilient system, there's got to be continuous learning from all of that stuff. It can't just be these things that we subjectively call incidents. We're trying to improve what we call our operational knowledge and our mental models. The operational knowledge is how do we get stuff done? How does this team, how does this organization, this company, how do we get our work done?

Then the mental models is, what's your understanding of how the system works. I think if you go back to your teams, and you had this really heady, deep discussion about how systems work, there's a lot of dissent. There's a lot of people who have different ideas about how the system actually works. That's ok, that's good, you should explore that dissent. You should find out why did you think it worked this way, but she thought it worked that way. That's good information, those are things that should be explored in a good retro.

Exploration and Experimentation

Talking about exploration, these are going to be my two suggestions, my techniques that I hope you will take home. Some of this we've already covered in some of the other talks today. In terms of exploration, it's these learning reviews. We have to find better ways to learn from our systems, I do not recommend RCAs, and I don't like the term postmortem.

I don't care what you do, I'm not going to give you step by step how to do it, because there are no best practices when it comes to this, because your situation is different than mine, it's different than theirs. I can give you some theories, and I hope you take that away from today, but you got to find your own ways to learn about what's going on in the realities of your system.

Experimentation is the other one. When I say experimentation, I'm referring to things we commonly call as chaos engineering or game days, where we're creating synthetic scenarios to understand more about our system. We're pushing it, maybe to its limits, or to its edges, or at least to the place that has been unexplored. We don't know what the system will do if we do this and, well, we should.

In terms of learning reviews, I'll give you a few ideas, a few things to take home. The most important thing of a learning review, especially early on if you haven't done a lot of these, if your team and especially senior leadership aren't familiar with this, because they do like to pop their head in and they're looking for next to ring. They're looking for people to point fingers at a lot of times. Whoever is in this learning review, especially as someone who's facilitating, you have to set the context that we're not here looking for answers, we're not looking for fixes, there doesn’t always have to be actionable takeaways. We're looking for ways to learn, we want to learn more about our system.

That means we have to be curious, we have to set aside the time and the effort to explore more about what our systems are really doing under certain circumstances. This is what is required for us to learn, we have to be curious about what's happening.

I'd like to point this out too, most of us don't understand the reality of our systems. Especially if we're thinking linearly, especially if we're just asking a limited number of questions such as five whys, we're not really getting to the reality of what's going on, especially with what people were thinking and doing.

I also think it's really important to bring in more people who weren't part of the incident response. There are other engineers, who maybe weren't on call or just weren't available, or for whatever reason weren't needed during that incident response, it doesn't mean that they shouldn't learn from what took place. It doesn't mean that they can't still provide information about what is actually taking place in our systems. They have their own area of expertise, and I guarantee it influences the system in some way. They should all be part or at least invited to the conversation, invited to the discussion about what's going on.

They should be encouraged to explore that dissent. They should be encouraged to ask stupid questions. I love it when new people show up to a retro, and they real gently raise their hand, and the usual, "This might be a stupid question, but why does that do that?" and suddenly, everybody's, "Oh, my gosh, I didn't realize there were..." We think everybody knows how stuff works. We think we know how things are connected and what should happen if this thing happens.

We're always bringing in new people, and it's not the case that everybody has the same operational knowledge and same mental model. We have to explore when people don't really understand what's going on, or they have questions. There are no stupid or dumb questions.

Another thing too is our time is valuable. We have things that we're working on and I think we all understand that. This is important stuff, there's nothing more important than learning about our systems, and we have to create the space and allow for time. That might mean that there are three, four different conversations, four different meetings, perhaps, about what's going on.

One of the reasons I think it's important is because it takes time to really think about and synthesize what just took place, what happened. You never want to do these things, right after an incident. People are frazzled, they're tired, they likely had to deal with this in off-hours. You want to give them time to rest, give them time to think about what happened. Collect their thoughts, collect some notes, and then maybe space it out in terms of the discussion over several different days, or several different meetings.

One meeting might be simply, "Let's just talk about some of the high-level themes that we saw during this incident." The next one might be a little bit more digging into the timeline of what were engineers doing, what were they thinking. Then if it seems like there are some good actionable takeaways or some things that we should implement, we found some blind spots in our observability and our monitoring, then there's going to be a separate meeting, maybe just to talk about those. What do we need to do? What do we need to schedule as far as future work to not prevent these things from happening, but let us know that something's happening a little sooner, and also maybe have a little bit better response plan?

If you're looking for a to-do list of how you should do this, I point you to what Ryan [Kitchens] was talking about earlier. I love this model from Netflix, I think this is great in terms of capturing, to a certain degree, what people are doing. It's very difficult to actually capture what's in the minds and hearts of people, but I think these are great starting point. Thank you, Ryan [Kitchens], for that and Netflix.

The other thing, too, besides these learning reviews is our game days. I'm sorry, but it's just something we know we need to do. I was reminded this today, staying at the hotel, I'm on the floor with the exercise equipment, right out my door, I haven't done anything with any exercise equipment. I'm realizing that yes, I need to probably be a little bit more healthy, I need to take care of myself, and we have to do the same thing with our system. A lot of that is rehearsing and practicing.

Just like sports, you can't just show up on game day and kick butt. You have to rehearse, you have to know what's going on, and what to do under certain situations. These game days have become very important, too. They incorporate the chaos engineering and all that stuff, too. It's not just about, let's see what happens to the system. Technically, when we do this, it's let's see what happens to the system sociotechnically when we do this. How do our people respond?

What we're trying to do is remove some of these blind spots that we didn't even know we had in the first place. That's the point of the game day or chaos engineering, is to understand more of the system, to create a better understanding of the realities of what's happening.

We're looking to increase our understanding of how things work and how people sort of deal with different situations in the moment. Also, of the 10 people in your team, why does each one of them have a slightly different understanding about how the system works?

I've just got one other thing I want to add or want to mention. Obviously, this isn't a super technical talk, I think this is more theory, more principle. I hope that I've challenged some ideas with you in terms of what can be done, what certainly needs to be done. I know this is tough, I've been in the same situation. I was in your shoes, I was sitting in these types of seats at other conferences when people were telling me these same things. I was thinking, "I guess it makes sense, but I got to go back and talk to my senior leadership and tell them that we're doing everything wrong, and we shouldn't be doing this stuff. It feels like you're just asking me too much." This is engraved, it's in our culture to do it this old way, these RCAs and whatnot. I get that that is hard, and it's tough. As we probably heard it, it just takes one person, one champion to come in and start really being that leader.

If you're in on Nick's [Caldwell] keynote this morning, leadership isn't coming in, implementing these new changes. It's sharing ideas, it's becoming that champion that says, "Here's what's really going on and here's what we should be thinking about and should be doing." I know that every one of you can do that because that's what I did. It didn't really take much effort. I just had to come in and start opening my mouth and having conversations just like I'm having with you right now that we aren't necessarily operating in reality. We're operating in these models that made sense when we were back in the day, it was all manufacturing, it was just like assembly line thing.

Software isn't developed as an assembly line. Why do we think we can adopt the same old processes that they used it forward to build cars? Doesn't work that way, it's just not the same. We've adopted a lot of things that worked for a while, but we're no longer putting software on discs that then go out in the mail or whatever. This is continuous, we're always changing things.

The point I'm trying to make is, I know it seems like a heavy lift, I'm asking a lot. I've spent the last month or so gallivanting around the western U.S. in my camper van with my partner, visiting a few different conferences. I was at Telluride, Colorado the other day for a Bluegrass Festival, finally took a few days off. On Sunday, one of my favorite musicians named Bonnie Paine - who played in a band called Elephant Revival, not assuming any of you know anything about Bluegrass - this one lyric she was singing from I will carry on really jumped out at me. Because I've been thinking about this talk, obviously, I had to be here this week to give it. This is a brand new talk, I haven't given it. I was thinking, "What can I say at the end?" I've given you all these ideas, hopefully, some of it is sinking in. I hope that your takeaway isn't that I need to go back to my teams, back to my org, and really shake things up and make it better. Implement these new learning reviews, these post incident reviews, let's get some game days going. It's going to be hard, you're going to get resistance.

What you can do is expose new ideas to them, just like I'm trying to do for you. Just pull back the curtain, just think about things in a little bit different way, and I guarantee you're going to see some change, you're going to see some improvements. You're going to get other people who are just as curious as you probably are now.

I hope that the mentality that you can take back, is that you can start having some real honest conversations about what in the world are we doing. We've been thinking about this in a certain way that has been taught down from our entire career, about how we build and operate and maintain stuff, and how we can improve things. It doesn't apply anymore, not in the complex systems that we're building. It scares me to death that we keep using these old methodologies, these old procedures, this old process when the world is becoming so much more reliant on the stuff that we're building.

I hope you can take some of these ideas and carry them back with you and start having some good conversations.

Questions and Answers

Participant 1: Thanks for the talk. A simple question allowing for space and time in an org that's growing, lots is going on, the culture isn't there yet. How do you build towards allocating that time and space, which has been the thing that's been impossibly hard?

Hand: To repeat the question, how do you create that space and time to have those learning reviews, those post incident reviews? Most of my background has been working in startups. Microsoft is the first place I've been in terms of a large organization. I think, first of all, it depends on the size of the teams and the size of the orgs. I can tell you at VictorOps, we took it very serious in terms of incidents. We started really rethinking what is an incident, I think it starts with that.

Once you've identified what incidents are, then it becomes a little bit easier for you to say when there is an incident, and you can even say if it's a SEV2 or higher, we insist we have one of these learning reviews. If it's anything lower then maybe it can be a casual conversation, just in a slack or teams channel or something like that. I think it really depends on your unique situation, but a lot of it, I believe, starts with removing that subjectivity about what an incident is. What is the right situation for us to stop what we're doing, even for 90 minutes or something today and just have almost a water cooler type of conversation?

Let's just air it out, let's just talk about what really happened, and what were you thinking. "Jason, you were the first responder. You went into Prometheus and you checked this. That made sense to you in that moment, let's explore that. Was that the right thing to do?" Because maybe there's a different place that another engineer would have looked first. It's hard to say what's right and wrong, but it's worth exploring why there are differing opinions on how to approach that problem.

Participant 2: I was just wondering, I work at a very small company where there's only four developers, and we don't often have time to implement all the things you talked about. What would be your first step? What's the most important thing or the best way to enter that?

Hand: To repeat the question, in a small team, oftentimes, especially startups, it's very high paced, we're just constantly working towards something we're building. There's only a few of us, lots of times we're wearing multiple hats, it's just as hard for us to set aside time to do these things. Do you have a product manager someone who's sort of head of product? That person needs to feel like he’s part of the engineering team. He needs to feel the same pain that comes up when things go wrong or when there's concerns with the engineering team.

If there's someone who's in charge of the product, and they're in their own ivory silo, and then you got your engineers over here doing the real work, that's going to be a huge problem. You have to bring the product teams in because they've got to be able to feel the pain. What happens is, you realize you got these blind spots, and you're monitoring, and it takes some real engineering time and effort to instrument your systems to have more monitoring data, to have more stuff to explore during a retro.

If your product always seems to be pushing new features, you don't have time to instrument your code. As soon as that product person is now a little bit closer to what the engineers are doing, and the pain that they're feeling, and the problems that they're thinking about, then suddenly, the product team will say, "Yes, I feel that. Yes, we can't scale from 5 million to 250 million without monitoring." They have to be part of that conversation.

I think it's the classic siloed thing that you have to bring those people in, because the product teams, usually they're the ones who get to say what we're going to work on. If they don't understand the importance of the observability and all those other things, then it just never really happens. I say, bring in your product teams or your product person, whoever that is, and let them really feel the same pain.

 

See more presentations with transcripts

Recorded at:

Sep 02, 2019

BT