InfoQ Homepage Presentations Incident Management in the Age of DevOps & SRE

Incident Management in the Age of DevOps & SRE

Bookmarks

View Presentation

Speed:

Download

51:46

Summary

Damon Edwards takes a look at the techniques that high-performing operations organizations are using to finally transform how they identify, mobilize, and respond to incidents.

Bio

Damon Edwards is a Co-Founder of Rundeck Inc., the makers of Rundeck, the popular open source runbook automation. He has spent the past 19 years working with both the technology and business ends of IT Operations and is noted for being a leader in porting cutting-edge DevOps techniques to large-scale enterprise organizations.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Edwards: My name is Damon Edwards. This talk is about incident management in the age of DevOps and SRE and I'll talk about why that's important. You can follow me on Twitter @damonedwards. I'll post the slides there but also they should already be up. If not, they'll be up shortly, rundeck.com/qconsf-2019 so you don't have to take pictures of things. The slides will be there. Plus, obviously, QCon does a great job of recording and distributing them.

My first assertion is, number one, the ability to respond and resolve incidents is the true indicator of an organization's operational capability. How good are we at operating? How many folks here work for a packaged software company or somebody that doesn't have operations? Anybody? The rest of you all are all in a business that makes its money by operating software. The running service is our business. Everything else we're doing is to support that so how good we are at operations is fundamentally how good we are at our business.

Number two, my second assertion here is that I think everybody now works in operations. If you think about that that running service is the point of our business, everything we do, everything that starts in development goes into production and then boomerangs right back at you when there's a problem, we're all part of that operational chain. Also, looking forward down the path, you see where the future is going, this heavy bias towards the, "You build it, you run it" teams, which also bring us all right into operations. I think the days that we're going to be saying "Oh, operations, that's somebody over there," I think is long gone. Now, operations is something that we all have a critical stake in.

What Is an Incident?

We're talking about incident management. We should probably have a definition. What's an incident? I look at an incident as an unplanned disruption impacting customers or business operations. The first part is about thinking all the way back to the ITSM roots, the idea that an incidence; that's just disrupting a customer, that's things like outages and service degradation. If you think about it, also business operations, tings like work interruptions, delay, waiting, short notice requests, a euphemism for somebody forgot to tell you until it was urgent, all of those things are the death by a thousand cuts that pile up. It slows your delivery, it slows your response, it results in poor decisions. All those things can slow our internal business operations which in the end affects the customer. Think about it holistically, delay affects our customers.

If you think about an incident, we can't separate out one of these from the other because we're the same people behind the scenes that need to deal with both. When I talk about an incident, I'm talking about something traditional like an outage or service degradation or these short-notice interruptions that happen all across our business.

The format for this presentation is, I'm going to walk through the lifecycle of an incident and what we see high-performing organizations doing in these different areas. It's like a tour, or like a survey course of where things are at today and where they're going. Along the way, I'm going to mention a whole bunch of people who I've looked to, who have provided guidance and a lot of great insight on these different areas, so they'll be making an appearance at some point or another.

First of all, before we talk about that wheel, the wheel of an incident, I want to talk about the environment that we're all cooking in right now. There is a lot of context that goes into why people are doing things around incidents the way they're doing them and I think it's the flashpoint for a lot of really interesting side conversations that have been creeping into our industry for more than a decade or so.

Digital Transformation

I'm going to start with digital transformation. Don't groan, don't leave. It'll be quick, I promise. A lot of this stems from this, which is this decision at the boardroom level. I hear a lot of people have different definitions of digital transformation, probably more so than even have definitions of DevOps or Agile. If you think about distilling it down, I see that a lot of the communication from the board level down at technology organizations, what are they really after?

The first one is they want everything integrated. Gone are the days where the customer service agent would have multiple screens and multiple windows on a screen and you could look at one, look at the other, and talk to the customer on the phone and stitch together and see what's going on. Or, business line A and business line B would live in their own silo and they would only cross at the balance sheet level. Now they want everything integrated with everything at all times which allows them to do new business ideas, combine things to extract more value out of the systems that we already have.

The next one is responsive. This is not responsive like web page responsive. This is more responsive like they want the business to respond or the organization to respond to the market, to respond to customers, to respond to their failure demand or value demand. They want everything to feel a lot more responsive than this long lag time that they've been accustomed to where things take months, if not quarters, if not years, to come out the other end. They want to see it quick. They want to see it fast.

Then they want all that everywhere. Whether it's your desktop, or it's a phone, or it's an Alexa, they want all this capability to be available to the consumer, and then on top of that, always. It's got to be there all the time. The idea of change windows and taking maintenance downtime in 2019 is no longer really acceptable. All this is cascading down to the technology organization. This is the Uber driver force that's pushing us into making these new decisions.

In fact, the first person here, Cornelia Davis from Pivotal wrote this great book called "Cloud Native Patterns." In it she really describes what the technology feeling is, the signal receiving in the technology organization about the demands that are coming from these digital transformations. That's good to keep in mind. It's the force number one.

Clound Native & Microservices

What that's driving us over to is these ideas of cloud-native and microservices. First of all, it's been this explosion of technologies because our old infrastructure wasn't good enough. We need things to be ephemeral, to move faster, to go with that new highly integrated, always on, highly responsive world. A good friend, John Willis and Kelsey Hightower are two people who have really helped me keep a clear thinking on really what's going on in this world and very much worth following.

If you go on top of that, what all of these technologies have created has really ushered in this new era of microservices. Being developers, I don't have to tell you that. I'm sure you've all seen these Destar diagrams. They're pretty cool. They list all the microservices around the wheel and then visualize all the interconnectivity to it.

Our world has gone from complicated, but we can generally divide and conquer and keep things to ourselves. In the old world, we could have things segmented and we could have the team for app A take of app B, team for business line B take care of business line B. Now everything is highly integrated which dramatically impacts how we are going after and trying to respond to these incidents.

From that, somebody spoke earlier today who I think has been a great leader in our industry around explaining the power of these microservices, why people are driving so hard into this. Even going back to DockerCon in 2014, Adrian Cockcroft talked a lot about how architecture enables speed and speed enables business advantage. He really talks about this desire to decouple that in the past – I know even that microservices Destar starts to look like a monolith – but the idea in the past was, "We build things in a central way," and that was the most effective. We're realizing that we're really slowing down the business. We can't achieve that digital transformation dream if everything is so tightly coupled. Now we're taking those cloud-native technologies and microservices and we're splitting things up into these different value streams. We're trying to decouple the organization so people can move faster. They don't have to constantly be tied to each other.

Now we're talking about decoupling and fragmentation. Everything has gone a lot more complicated and we're purposely trying to fragment how we do the work, so that we can move faster. Now, if you're on the incident side of that problem, things become a lot more difficult.

DevOps & SRE

Moving over, now that we're talking about these new technologies, it's driven from the digital transformation, we're talking about decoupling these people so they can move faster. What's next? Now comes these DevOps and SRE ideas, that how do we get people to work in a high-velocity way? We've got high-velocity systems, in theory, but if we can't unlock them by changing how the people work together, then what good are they? We just have expensive hosting legacy systems 2.0.

If you want to talk about on the DevOps side, I'll use trends that are entering in. We use Gene Kim as a good example. I think he's like the raconteur of the DevOps movement, doing a great job documenting him. Do you know the book, "The Phoenix Project"? We talked about the three ways. There was the first way focusing on flow, looking for these feedback loops and then the continuous improvement along with that. It's really about feedback. It's a new book that's coming out in case you haven't heard, "The Unicorn Project," which tells the same story from a different angle. It tells the same story of "The Phoenix Project" but not from the leaders but from the folks in the trenches. A great book, but it's got these five ideals locality and simplicity, focus, flow and joy, improvement of daily work, psychological safety, customer focus.

The reality is, people took this advice and they really focused on the delivery side of things. It's, "Ok, we're in dev and it's all about this go-go-go, and then what?" We deploy. We deploy 10 times a day but people aren't talking about ops. How do we operate this thing? Deployment is not the finish line. Deployment is this first step in the rest of your life. It's like getting married. They say people should not on the wedding day so much but on what the rest of your life is going to be like. Rhat's where operations comes in and historically, we've largely ignored that. It's, "That's another problem." We planned the flag, we delivered this project. It's like in the movie business, "We'll fix it in post."

I actually gave a talk about this last year at the DevOps Enterprise Summit called "The Last Mile." It was all about showing how unless you can change how you operate, all the blowback and mess that comes back on the rest of the organization stops you, prevents you from really realizing those DevOps dreams. This means that we're trying to focus on that flow and these feedback loops. We have to pull operations closer to us.

I think in this next wave, an interesting thing to arise is the notion of SRE. It really started from Google. They're the ones that wrote the first books. Ben Treynor was one of the engineering managers that said, "How do we run operations not like a classic operations organization, but using the same principles if we run an engineering organization and do it in an integrated fashion?"

From that, we've got these principles. What's interesting the principles of SRE – the first one is "SREs need a service-level objective with consequences." It’s this idea that it's not just an SLA that you have to adhere to but we've got this idea of a service-level objective and it's got consequences. This means that the business, the development and operations have gotten together and said, "This SLO is what matters to our business, and if we blow that SLO, if we blow through our air budget," the term they use, "then we have to swarm to that and all work comes after that."

The idea that you have the power to tell a development organization or to tell a product organization, "We're not shipping new features until we fix this service," or, "We're not going to do these extra things. We're going to invest in getting this service-level objective back above this level," and it has real teeth. I know a lot of big enterprises that say it's just crazy talk to go to the business side and say, "No, we're not shipping this because we blew this agreed-upon SLO." The pushback is extreme to say the least.

That plus this "SREs have time to make tomorrow better than today." This is when we talked about the idea of toil that says it's repeated, could've been automated, it's not adding enduring value to the business. That's something called toil; and toil, we want to get as much of that out of the process as possible because we're not using our human capital to its fullest potential. We've got all these smart people and they're buried in repetitive work that the machines should be doing. How do we get them out of that work so they could spend their time doing engineering work, moving the business forward? Again, it's that idea of pushback.

Third one there is "SRE teams have the ability to regulate their workload." They can say no to things. They can say no to releases. They can say no to being overloaded. In a way that's like a bill of rights, and the idea is really about providing feedback. It's providing a backpressure. Think of a physical system. It's providing backpressure to that go-go-go delivery and it should make a self-regulating loop.

Why this is so interesting? If you think about, especially in large enterprises, how incident management traditionally happened in the past, we had this knock, some level one teams. They're always the aggrieved party. It was, "All the tickets go to them. Let's have them run around and try to fix them. They're probably going to escalate it back to you, but hopefully, they're going to take the brunt of the load." This new model, it's saying, "That's not good enough. We're not making the most out of our people. How do we set it up so there is these tight feedback loops, and when things get bad, the pressure is coming back on the rest of the organization, not in a really negative way, but just a way to provide that feedback loop, again, like the Gene Kim's three ways there to help us regulate that workload."

Very interesting stuff: one of the biggest changes in operations since ITIL – in 1989, I think, was the first edition of the ITIL books came out –developers have all had the Agile seeping into their brains for the last 20 years. There has been books and conference speeches and it's in the tools and whether or not you were doing Agile, the ideas were still out there, but in the operations world, they were living in a much traditional, different world, which we'll talk about in a second. Folks like Stephen Thorne, Tom Limoncelli, Niall Murphy, Liz Fong Jones, excellent writers, practitioners, have done a great job of really breaking down what this is all about to its essence.

Now we're changing how we're organizing our people. What does it start to look like? If you think about it today, DevOps and SRE, you start to add these things up. This is just a selection of some of the things that we're talking about, like thinking about products, not projects, continuous delivery, shifting left. Solving operational problems early on in the lifecycle when it's in development land, not waiting until it's going and trying to get into production. Getting to production as quick as possible, working in small batches, air limits, toil limits, all this cloud-native infrastructure. If you think about it, what's actually happening is we're starting to build these self-regulating systems. We're starting to align our organization on these horizontal value streams from an idea to where we're making money from a service, but really thinking about it, how do we build this self-regulating horizontal system. This really works across all different models.

We see the top one, the Amazon model, second one, the Netflix model, and the third one, the Google model. If you're from those companies, don't yell at me, it's like a big-picture. Some people drive towards cross-functional teams and those are ways of building self-regulating systems. You packed all that into the same team and that's how we have that continuous feedback. This bottom here, we'll still have a Dev and Ops divide. Think about Google, they wrote a whole book on operations. They have a separate organization called SRE but they've created all these tools to recreate that shared responsibility model that you would get if you put everybody on the same team. You see the world going towards these value-aligns like value stream, meaning what's all the activity we have to do to deliver this point of transaction of the customer, and it's self-regulating. In fact, this was Jon Hall at BMC who was the first person to really point out to me, "You know what's going on here, you put them together, it's a self-regulating system."

Now, let's compare that to how the world used to be. In fact, when you're going to talk to a lot of your counterparts especially in large organizations, this was the view of the world that their whole organization was built on. Yes, there's still this flow of work that everyone thinks about, but that flow of work has to go through all of these processes. It's a very vertical lined way to think about the organization. "We're going to put the firewall team with the firewall team, the database folks with the database., the windows admins, the Linux admins, so on and so forth."

That was the key way to organize an organization which would always have a process owner, and there was going to be inputs, outputs, triggers, metrics. In fact, this was the ITIL way of doing things. ITIL has their 26, I think now it's up to 34 distinct processes. Now they call them practices, but they're breaking things down like service catalog management, incident management, change management. I'm just jumping around here. Service validation and testing, all these processes, and when that goes into a big enterprise, an organization, and you put a process owner and you give it all those criteria, well, what happens? You start to get people who do what they're going to do. They're going to say, "This is my new task. I'm going to manage this process. I'm going to be the best firewall rule changing team west of the Mississippi and that's all we're really going to care about."

What ends up happening – I'll get to the change part in a second – is you get this unintentionally encouraging silos where people start to focus inwardly on doing their part of the task, and what starts getting broken up is that flow of work. That's when we have the ticket hell, all the ticket systems. How many folks here spend a large amount of time with a request waiting in the ticket queue somewhere? This is where this comes from. Now, anything that has to think holistically like a system has to flow through, or try to make through these hoops. On top of that, there's this idea of a change advisory or a change control. There's some external centralized source out there that is going to tell you if something is safe. I know if you talk to the ITIL folks they'll say, the high priest of ITIL, "It's just there for advisory. It's just there for coordination and communication," but in most organizations, it ends up being the change authority, that someone is going to tell you whether or not you have the ability to change. That really encourages whether it's unintentional or not, it really encourages very much a command and control mindset.

If you think about that exploding world that we're entering with the explosion of the microservices, with people decoupling, moving along, and this idea that we're going to externally inspect and determine quality and safety and any of those things is really starting to be quite far-fetched. Those who study Deming, lean the [inaudible 00:20:35] system, it's one of his key 14 points. It's number three if I'm not mistaken that you can't achieve quality by external inspection. You have to build quality into the system.

Now, what's going on is, you put this together you see what's really happening is really, this DevOps and SRE self-regulating, value-aligned systems is really starting to replace this old traditional ITSM way of thinking and fundamentally because of that different orientation, one is the horizontal value align, the other is vertical function aligned, and this idea that we're going to have a central change or external change and quality advisory, they're really quite at odds with each other. In fact, it's an oil and water type thing that I think needs to be resolved at some point. Be aware that this tension is going to exist and it's going to exist for quite a while.

I don't think they're going to agree with everything I say but to folks in the space, Charlie Betz from Forrester, one of the few analysts I actually like and Rob England, often known as the IT skeptic, excellent thinkers and writers in this space documenting this contention, this transformation that's happening, but keep this in mind. It might explain a lot of your relationships with your operations counterparts.

Acknowledge Complexity

All that going on, I think it's time to realize that with these super complicated microservices plus all that's going on on the people management side, we really drifted from complicated systems meaning we can predict them and have an idea, there's some determinism in what we do to really complex systems.

Paul Reed is a guy that writes a lot about this and talks a lot about networks and Netflix, that our world is really complex, it's not deterministic. On the development side, it's easy to think, "Engine X compiles or it doesn't," that there is a determinism to this thing we're doing on the software side, but if you think about the larger distributed system that we're building and the things that we don't control like the cloud infrastructure and the impact of that and the traffic from humans and the unusual behavior, all this piles together meaning that we are living in a complex world. We're operating on complex systems.

What do we know about complex systems? If there's any physicist in the crowd, don't get mad at me for this but I distill it down as we can never have perfect information about it. Just because we know how engine X responds to a request, we're never going to be able to determine.in all the other pieces, we're never going to be able to determine how the system exactly works. We can't break a complex system down into sub-parts and say, "I know how this works."

We can look at an engine of a car. I can know how all the parts work and I can model out how this engine is going to work. You go outside in San Francisco, there's no way you can model out and break down. You can't look at that Drumm Street over here and Market Street and use that to figure out how the rest of the traffic in the city flows. We're never really working with perfect information. We can never really control or predict what's going on.

One of the biggest eye-openers that I went through is Richard Cook, who wrote this paperback I think in the early '90s. It's like five or six pages and it's just point by point how complex systems fail. If you read that paper and don't think that all those things are currently at work within your system, then you have another thing coming.

Another way to put it is – this is Charity Majors, called the queen observability now – "Distributed systems have an infinite list of almost impossible failure scenarios." This proves out to me, talk to folks who've spent a lot, organizations have invested a lot in resilience engineering and trying to root out failure from their systems, but I'll just that it just gets weirder and weirder how improbable the next problem is. It's the magic bullet theory times a hundred that seems to cause problems time and time again. No matter how good you get at trying to root out failure, we just get weirder and weirder edge cases which is a strong indicator that we're working in a complex system.

If we're trying to work towards how are we going to be managing incidents, how are we going to respond to and resolve failure, we have to keep in mind we're fundamentally working in a complex system, not just a complicated deterministic system.

Safety Science & Resilience Engineering

Now that we know that, how are we going to actually think about these things? I think we're starting to learn a lot from the safety science and resilience engineering domains. There's these experts like Sidney Dekker, Richard Cook, David Woods, who are famous in the real world. They work with airplane disasters, healthcare deaths, nuclear power plant controls, high-consequence domains, and really look at how do these systems work and what can we do to try to mitigate the failure and disaster that comes out of them. They're starting to bring this into the operations world.

Why is that? This gentleman here, John Allspaw used to be the CTO at Etsy, before that, he was actually head of operations at Flickr, one of the people that kicked off the DevOps movement by giving a conference talk with his development counterpart about how they did 10 deploys a day. This is in 2009. People were throwing up in the aisles, it was so sacrilegious. He went a little off the deep end here and he went – with Dekker and Woods and Cook – went and got a master's degree in Sweden. He went off to Sweden to get a master's in systems safety and talked about how people respond to and avoid failure and disaster in highly complex, high-consequence environments.

He talked about why is this important. Why do I have to know this stuff? He's, "If you think about it, there's this above the line, below the line metaphor," and the reality is in our work, above the line is all the stuff we do. It's the source code we see, it's the processes we go to, it's the buttons we click, it's the things we say to people. Below the line is the real work, that's our systems and we can't actually see it. We don't see the zeroes and ones going by. We can't see what's actually going on under the hood. All we see is this thin abstraction layer and our mental representation of what's going on under the covers there.

The whole point of why the systems safety resilience engineering world is coming into play is because it's all about the communication of the people. How do you keep those mental models in check? How do we make sure that we're trying to connect the people better because we can't really fix the underlying system? There's things that you can do to it, but fundamentally, it's the human interaction with it that causes the most issues, so an important idea of why this is coming into play.

Some other folks have picked up this banner and run forward with it. There was a great event, every year, it happens here in San Francisco, called Redeploy. That's really some of these world-class thinkers and academics in this space who get together and talk about all of these things, how do you bring this to our world of operating online services. I know they hate slogans so I made some bumper stickers. It's ideas, "There is no root cause." That's just a political distinction that we decided to stop and say, "It's that person's fault or it's that system's fault." It's just our political desire to have this straight line to go and say there is some blame there when in reality, there's all these contributing factors and most of them are the same thing people do day in day out to do their job, except that this one day, the right confluence of factors causes a disaster. We can't then go and blame that person and say, "It's their fault," when they were doing what probably had success 99 out of 100 times.

Likewise, there's a big disdain for the five whys because it forces us into this very linear way of thinking that, "It must be this cause." That's the problem. Instead, it's trying to take a bigger, broader view of causation. This whole idea of Safety-II that in the old world there was Safety I, which they call it the old world, is we only investigate failure. Failure happens, the NTSB shows up, all the investigators show up, the journalists show up, and we dig through it and we try to find what was the problem. What we were actually not looking at is, "Why did it ever work in the first place?" It's an interesting idea around Murphy's Law which is not, "What's going to go wrong is going to wrong." It's, "How does it ever work in the first place?"

Safety-II is about studying, "Why do things actually work?" If you think about humans and how they work and your colleagues around you, there's all kinds of little shortcuts and whether it's a mental shortcut or a physical shortcut that they take in their day-to-day job, ways they do things, rituals, ways to talk about things. Again, all of those things work on a daily basis day in and day out until some little slight thing changes and then the exact collection of actions that would normally cause a success, causes failure. If you don't look at why things work, you're going to basically have a hard time figuring out why things don't work. At least, you're going to be trying to do so with one hand behind your back.

There is this great idea that incidents are unplanned investments. It's going to happen. How many of you get called into incidents when they happen like escalations? Your time is valuable. What is that? That's an investment. Your company is now investing in this incident. What do you learn from it? The ROI is up to you.

Elevate the Human

Really where this is all going is this notion of we have to elevate the human, that this dream that AI is going to solve our problems, we are decades off from trying to manage these complex systems. If we look at what's going on in healthcare, nuclear power plants, aviation, all these things, they've spent billions of dollars of highly rigorous academic pursuit trying to get the human being out of these processes and they haven't been able to do it. The fact that we think we're suddenly going to achieve it on our end, I think is a little bit foolish versus looking at it as, "This automation, things we're building, is really to elevate the humans," more Iron Man than HAL from "2001."

It's about elevating the human, and also I think you'll see a lot of folks coming on now "We're starting to look at the humanity of this." We love it because we go to conferences and get to have a good time, but if you think about thousands and thousands or actually millions of our colleagues around the world, that life is not so rosy for them. It's a little bit rough. You're on the receiving end of the aggrieved party at all times. Burnout is very high.

I highly recommend for your own well being and just for the well being of your colleagues, Dr. Christina Maslach. She's from Berkley, previously Stanford and one of the most famous people in the world on burnout and she's now turning her sights on to the IT industry because burnout is so high. If you think about it, some of you will say, "We're going on a country club here. We have to be nice to people what's going on," but the reality is, what she's really giving you is a formula for high performance. If you look at the things like loss of control and the feeling of being overwhelmed, losing agency over your work, there's a whole list of things that she'll highlight. If you flip them around to the inverse side, we're actually talking about how do we get better performance out of our most expensive assets which are each other?

Jayne Groll is also doing a lot of work with this, with trying to elevate the human being. I think in the operations side, we're seeing a lot more of the humanizing of what's going on, and the reason being there's 18 million IT operations professionals in the world – this comes from PagerDuty S-1 – and 22.3 million developers. That's a lot of people out there that need our help.

This is the world that we're marinating in. These are all the trends that are coming together to now look at what's going on with instance management. Finally, let's look at the wheel. If you think about an incident, there's the from zero to, at some point, there's the observe side. We're trying to figure out what's going on. Then there's the react. We have to take action, whether it's to diagnose something or to repair something. We'll often loop at those two levels, and then eventually, we have to learn from what went on.

OODA Loop

Those of you coming from the lean world, you recognize this looks a lot like an OODA loop. John Boyd is one of the most famous military tacticians, at least, probably one of the most famous American tacticians. He was a fighter pilot who really came up with this methodology, this is actually his first drawing of it, which is in any tactical application of strategy. There's a couple of phases: there's observe it, what's going on. There's the orient, deciding what's going on. There's the decision, what am I going to do, and there's an action. He figured out that whoever can make those loops faster seems to always win at the objective. He started talking about dogfighting for aircraft, but soon they found that this way of driving human performance really applies to all sorts of domains. I added that here. I want to think about this top part as an OODA loop. We're going to observe, we're going to orient, what's going on here, we're going to make a decision, and we have to go ahead and act.

The reason why I like this is it breaks down the different areas of what I see as incident management and then we can focus in on the different developments, what people are driving.

Observe

Let's start with the observe side, with a couple of interesting things. Monitoring, obviously, you all know about. Monitoring is really about spotting the knowns. We're always looking a bit in the rearview mirror with monitoring. We're looking for conditions that happened in the past, and that got us quite a bit down this road towards better resiliency because we know something happened in the past, it happens again, chances are we know that there's a problem here.

The new kid on the block has got a name that's been around for decades, but people focus on this new idea of observability. If monitoring gave us that we're able to spot the known, spot things that happened in the past, observability gives us the ability to interrogate the unknowns. How do we actually look at what's going on now in an unknown situation and figure out whether this is good or bad and figure out what the problem is?

Then it really brings in a few different things. Number one is logging the event. We have to have a record of there was some event, something that happened, far better structured than unstructured. There's the metrics which is a data point over time. It's like speed. Are we going faster? Are we going slower? It doesn't really tell us what's going on in context. We lose all context of that event, but we can know our things: is that data point going up or down or sideways or whatever it is over a certain period of time.

Then the third one is tracing. How do we take those events and put them together in the context of a single request? Look at what's going on in the tooling world, Honeycomb, Zipkin, those sorts of things. It's around building out this observability side of things. Again, Charity and Adrian Cole are great people to follow in this space that really have a keen eye for where this has to go.

Also, there's another one. Our buddy, John Lewis is here if anyone wants to talk about this idea of automated governance. We've seen a collection of highly regulated industries getting together saying, "You remember that Destar, remember the speed we're moving at." Governance, this idea that a human being can attest, "Yes, this control is correct and this control is being followed and here is the evidence of it," we just can't keep up. It's like manual testing in 2019. It's just not going to keep up and get us to where we're getting.

Now you see the idea of, how do we drive governance? How do we drive compliance into the automation layer or into our systems ourselves so we can prove and attest that we're compliant with these controls and we can do it in an automated way? Definitely, John [Lewis] is a good guy to talk to about this. IT Revolution got a bunch of those banks and folks together and did this white paper on that you can get off of their site for free.

Why this all matters is that in order to figure out what's going on and respond as quickly as possible, we have to bring these three things together and make sure that everybody has full awareness of them. This is how we keep that above-the-line behavior in sync; by being able to distribute these three aspects of visibility, the monitoring, the observability, and the governance. I think a lot of organizations have the monitoring, visibility is getting there, but the governance side is always a missing piece.

Orient and Decide

Going on on the visualization side, we want to climb up to the incident in command which is something that's the mobilization, the coordination, the communication. How do we get good at that? Where a lot of this is coming from, we've seen come in in the last decade or so into operations, is really taking directly from incidents in the real world. FEMA, runs this, they're the keepers of this thing, it started much before this, the incident command system, which is a series of definitions and processes and practices all around, how do you manage the response to some type of major incident? It's written for forest fires and hurricanes but it's being applied to our industry.

Actually it was Jesse Robbins, one of the founders of Chef that really was the first person in Amazon to run game days and they called him the master of disaste.? He was purposefully trying to break things in a systemic way so they can apply these practices and see how they actually work. Brent Chapman I think is another folk who's been doing this and really has come to the forefront. How do you apply these incident command principles to how we mobilize, coordinate, and communicate? It might seem heavy-weight at first, but if you see as organizations get better at it, it becomes naturally a way that they talk and they think that we have to respond to these things in a structured way.

Ernest Mueller is also a great person to follow who did a lot of this in the DevOps space, translating these instance command ideas. Give a shout-out to PagerDuty who's done a pretty good job. It's an open-source project where they've taken all of their incident response docs based on the incident command system and turned it into an open-source project. They accept poll requests. Again, this is a low-tech side of the conversation but it's very important to understand how this applies especially as now we're all being asked to participate in this. It's no longer some other operations organization that's going to run this.

Also along this idea is what's going on with operations itself? We see this split that's happening. Andrew Shafer made this T-shirt .John, Willis, Andrew and I did the first DevOps days in the U.S. here in Mountain View in 2010 and this was the T-shirt. It was "Ops who are devs, who like devs to be ops, who do ops like they're devs, who do dev like they're ops, always should be," and if you were a teenager in the '90s, you know the rest of the lyric, but the idea is, first we're going to be blurring these roles. The past nine years have been about how do we blur the lines between devs and ops.

Now I think we're going a step further and saying, "Operations is starting to split itself," and you see different organizations like the folks at Disney, Shaun Norris who was at Standard Chartered. It's a big global bank, I think 60 different regulated markets, 90,000 employees. Disney is obviously Disney. You see them drive this the same old way. We have to take what was the operations organization. We're now divided into two parts. There's platform engineering which just really looks like a product organization. It's largely a development organization building operational tools, building operational platforms, and then everybody else goes into this SRE bucket. That's generally our expert operators. They're starting to be distributed. We're starting to see this shift where the walls of operations were being blurred before by responsibility. Now, they're outright disappearing because we've got this growing centralized platform engineering organization, and this ephemeral distributed, call them what you want, these folks call them SREs, who are expert operators who are being distributed into the organization.

That also leads to a new view on escalations. In the past, escalations were a good thing, "We're getting it to the expert." People here enjoy being escalated to. It's an interruption. It was, "This is a terrible idea." Not only that, but it's also just slowly now the response. If we have to escalate off to people, our instances are going to be longer and now we're putting all this death by a thousand cuts, we're interrupting the rest of the organization, which just adds a bunch of delay everywhere else. We're finally starting to realize this is a bad idea.

Jody Mulkey at Ticketmaster, this is going back like four or five years, had this epiphany which they had this old model of working which their knock, or their talk, they called it, it was basically.they call them the escalators, because all they do is look at the lights and call somebody, and their major web incidents took like 40-something minutes. When the Yankees can't print playoff tickets for 40 minutes, that's CNN news. That's not just TechCrunch news.

The idea was, what was a lot of this time being attributed to? A lot that they found was, because it was stuck in the escalations; because you have to escalate up to different people. What they did is they had this idea of support of the edge which was, let's take all the capabilities that we need to diagnose and resolve these problems or a large chunk of them and push it down closest to the problem. How do we empower those teams closer to the problem to go ahead and take action? Then on top of that, if they can't take action, how do we empower them with the right diagnostic tools to be able to figure out who to actually escalate to so we cut that chain down?

Their story was remarkable. I think in 18 months, they went for 40-something minutes, down to four minutes for major web incidents all because it's the same problems over and over again. How do you just empower those people closest to it to actually be operators and take action? The best part was it cut their escalations in half. Imagine being interrupted half as much as you are now. It was a huge hit and they kept getting better and better at it.

John Hall has come back into this again, BMC. He likes to bring this idea of swarming which really comes from the customer service world and bringing it to, instead of linear escalations, how do we take more of a swarming effect to bring all the people we need to bear on these problems? It's very interesting how the swarms work together, but it does show that you can solve a lot more problems without having those dangerous escalation chains.

React

Time to take action. The two actions we have to break it down into is diagnosing, so health checks, exploratory actions; and restoring: restarts, repair actions, roll-backs, clearing caches, all the known fixes for known problems that are out there. I think really what's important to note here is the return of runbooks. If you lived in enterprise land at all, runbooks, we used mostly manual, a lot of Wikis, how they talked about not that long ago, weren't really talked about because the world was going to be the Chef and Puppet and Ansible and operations was going to go away. Now thanks to the SRE movement, runbooks are back, except now it's not about how do we make better Wikis, it's about how do we automate those procedures so we can give them to somebody else? In Jody's model at Ticketmaster, how do we give out that access so it's given to the right place in the organization where they can go and take action?

Runbook automation, it's the safe self-service access and the expert knowledge that you need to take action, something that myself and my colleagues at Rundeck we work a lot on. The idea is that moving the bits is the easy part. It's the expert knowledge that is hard to spread around. It's, "Is the restart automated?" "Sure. It's automated. We'll just let the developers do their own restarts in production." It's, "They've got to know how to quiet Nagios and they've got to know how to talk to F5 and pull it out a load balancer pool. Then they've got to know how to check these five things, and they have to then run the restart script but check these other six things to make sure it worked. Wait a minute. Before they run the restart script, they've got to know the right command arguments, and they've got to edit these variables files." It's, "Ok. Now we get it. We can't hand that knowledge off. It's either going to be weeks of training or it's going to be months of somebody trying to skip the script and all of that out." Moving the bits is the easy part, spreading the knowledge around is the hard part.

It's got to be self-service that you have to empower those closest to the action like I just mentioned, and it's got to be safe. By safe, it's two things. One is yes, from the security and compliance perspective, we have to make sure we're only giving action to certain named procedures, not giving people random SSH access and pseudo-privileges in a script and wish them luck. We're also making sure that from a security perspective, we're de-risking it, but also from just the person taking action perspective. How do we put the guardrails around them to know that there's the right air handling in place or commands are idempotent? You have to de-risk it so even if the person has expert knowledge in some other domain, they're being guided to the smart and safe options and they're not going to potentially cause more problems.

What's going on now is you've got these alerts, these tickets, we've got the incident command system, people are getting all riled up to do it and they've got three options. One is deciphering this Wiki which is, "Is this correct? I'm not really sure what this person was trying to say. Wait a minute. Look at this date. When did they write this?" That's option one. Option two is, I'm dumpster diving into our shared directory to look for the right script we used last time, "But wait a minute. Did they tell me it was -i not -e," or, "There's a new version and that script is in a different directory." There's that problem.

What mostly likely happens is the escalations. We're just pushing a disruption back into the organization versus from a runbook automation perspective. We're saying, "How do we define the workflow that basically allows us to call all the APIs and scripts and tools that we need to know and therefore, we can basically push that safe, smart options to the people closer to the problem. They can solve the problem and not have to wake you all up." It stops these incident chains, "Yes, there's a problem with this particular service," and then the system N or SRE shows up and says, "This is an application problem. I've got to call the developer," and then they say, "I know it's a data problem. Let me call the DBA," and the DBA shows up and finally and goes, "Wait a minute, this is a network problem. There's a firewall issue here."

We've got these meandering chains of escalations and what all this is doing is interrupting our other work. We want to get to a point where if it's an unknown, then we can quickly diagnose the problem, take the best ideas out of people's heads. How would you check this service? How do you check that part of the service? Create them into automation that are now our 01 folks, whoever that might be, can respond to it. Either they're going to see a known problem that they know a solution to or they know exactly who to escalate to. For known issues, it takes it down from minutes, potentially hours, down to minutes or seconds.

For all of you, I think on the development side of the house, it stops those escalations, the "I need", the "Can you," the "Help me with this," and each of those instances there's a little bit of waiting that's being injected into the universe and then big interruptions will be landing on your head. Then when it comes time for you to go and do something, now you're waiting in somebody else's queue and it just keeps getting worse and worse.

If you can take that expert knowledge, take these procedures, basically bottle them up and let other people help themselves, you're getting rid of all those instances awaiting, all those instances of interruptions, and it solves these difficult security and compliance problems. Before it was, "I could fix it but I can't get to it because we've got customer data." I've seen this work in very highly regulated environments. If you're handing out these access to a named procedure, the compliance people actually like it. You could run it all through an SDLC which now we can do code reviews and decide whether or not this is good to do.

Folks at Capital One talk a lot about this. They've created their own internal runbooks as a service. The whole idea is to make like a router that says, "Hey, this thing, let's run the diagnostics against this instance, whoever gets triggered," and to have two decision points. One is, either know it's a known problem and I'm going to fire off this fix and see what happens, or I'm going to know who I'm going to escalate to. They just spoke at the DevOps Enterprise Summit. There's also a lightning talk at DevOpsDays Austin. It was fantastic.

Learn

The last piece this year is this learning part. I'll just leave this one piece from John Allspaw here. He talks about, a mistake we all make is we want to get to the action items. We're, "We'll talk about some things beforehand but where is the action items? What did I really get out of this?" when the reality is, if you think about what you want out of it, it's the journey, is where you do the learning. It's understanding what happened and being able to tell those stories amongst each other, get together and figure out all the contributing factors, is what really drives learning in the organization. Again, incidents are unplanned investments. The ROI is up to us. It's up to us what are we going to get out of it? Failure is a fact of life. What are we going to get out of it?

To recap, don't forget the environment that we're all marinating in now that it affects all of our lives and decisions we make and how we talk to each other, as well as follow along this pattern, break it down, and good luck out there.

See more presentations with transcripts

Recorded at:

Jan 27, 2020

Damon Edwards

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?