InfoQ Homepage Presentations Preparing for the Unexpected

Preparing for the Unexpected

View Presentation

Speed:

Download

48:43

Summary

Samuel Parkinson talks about how the Financial Times is using incident workshops to prepare for the unexpected and make incident management a more consistent process by sharing the group’s wide range of operational knowledge and architectural insights.

Bio

Samuel Parkinson is a Principal Engineer at the Financial Times, supporting the development of FT.com and the mobile apps. At the FT, he has recently supported the Operations & Reliability group with their rebuild of the company-wide monitoring platform and is doing his best to convince people that joining the on-call team is definitely a good idea.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Parkinson: Imagine for a moment, you're a firefighter. You don't just jump into the deep end and start fighting fires out in production. How do you prepare yourself to face those fires? This picture sums it up perfectly. You work within a controlled environment to learn how to deal with as close to production environments as possible, these situations. I'm going to be talking about how we can apply just this method to incident management within technology.

I wanted to start with a story, a story about an incident at the "Financial Times." Imagine, what's the worst incident that you've been involved with, at work, outside of work? I'm going to run through a few and see, do you think that's the worst one that we've encountered? It was at that time, at the FT where we accidentally DoS'ed a third-party vendor that we integrate with by hammering the purging service. That was pretty bad. We got a really sudden Slack message about that and how to fix it out of hours. That wasn't our worst one. Could it be that time that we really wanted people to accept those cookie messages, because GDPR had come in? Every page, we showed that message, you could not dismiss it. That was a terrible user experience. It wasn't our worst incident at all. How about that time that Boris actually broke our live blogs. It was so popular, and it was running on a bit flaky infrastructure at the time. We couldn't actually publish that breaking news as it was happening. That's awful from the journalist's perspective, from a user's perspective, but definitely not the worst one. How about that time where we just lost our domain? Thoughts? Do you think that's the worst one? That was definitely the worst one.

You might have noticed there's a bit of a naming scheme. I learned through writing this talk that we've actually started naming all of our incidents about since 2016, after the way that Friends name their episodes. I only learnt about this last week. I just love it, and we're really consistent there.

I have a flatmate, Andrew, and he wants to read some FT. I sent him my password. Probably not allowed, but done it. He tries to log into the app on the 21st of January, so not very long ago. He tells me, it doesn't work. He sends me the error message. That was not a really nice error message, certainly. I'm like, "That's probably a network thing. Just try again, would you?" He's like, "No. This is really broken." I check Slack on my phone. I haven't been called at this point. I see the following message from Alice, my peer. I want to point out two things that made me realize this was really bad. First, Alice has sworn, she never swears. I'm like, this is really bad. She didn't bother capitalizing my name. There's no time for that. The ft.com zone was missing. What does that look like? Something like this. This is not good. If you tried ft.com, that wasn't resolving either. App.ft.com where we do our app, that was not working either. It turns out we actually have about 5100 subdomains on ft.com. None of them were working. This is dire. We have tools all over the company relying on this domain, internal tools, public facing websites, it impacted the whole company. Our journalists were impacted. Our users were impacted. Our engineers were impacted.

For three years, I've logged into this dashboard every week, probably, to make changes, and I've always seen ft.com listed there. This is the first time ever that that was just not there. That is not what you want to see, really not what you want to see. We have never prepared for such an incident, how could we? You'd never expect our vendor to have something disappear, or a change that we make implement or cause such a serious situation. It's a classic data loss situation as well. It's not a database, but it's data we've lost. Just like any other classic data loss situation, we don't have a backup that's actually recent, always tested to make sure it restores properly. Some people have run scripts in the past that have downloaded our information. There's going to be a lot of changes since October. That was for sure. Fortunately, it turned out after about an hour of communicating with our vendor, they actually had a backup. It was only partial though. It turned out the things that were missing were the most critical. We use DNS for some of our subdomains, and they're used for load balancing. That's slightly more complex. It meant ft.com wasn't working, and a lot of editorial tools as well that we need to use to fail over. About 10 people worked on this incident. We had, actually, 30 people online who had joined and they were there just to help out if need be. Most were not called like me. They still volunteered their time and were willing to support restoring the service.

A little insight on what it looked like. This is the view from our ft.com homepage. We ended up having 4 hours and 30 minutes of outage. It was really weird because of DNS. The first hour is just total. Then after that, we started to restore stuff automatically and by hand. Eventually, we got the service working again. During this incident, there was a total lack of panic. It was so calm. I joined an hour in. People had already started understanding what caused the problem and focused on the resolution. During the incident, there was a total lack of panic. It was a slick operation and we recovered. It took restoring from that backup, and manual entry, remembering, literally, what we've actually got in our heads and not written down, to be able to get back to where we needed to be, to be able to go to sleep. After that first hour when we understood the problem, the whole time we were focused on the recovery. Not once did we talk about who made the change, what caused it. It wasn't going to help. It was a wonderful example of the discipline that we want to see during our incidents. Those 20 people that weren't actually involved in recovering, were actually just there to learn. That was the biggest thing that I understood, the most impressive thing. They wanted to see if they needed to help out, they could. Also, they wanted to see how it was working, how we were recovering, what we were doing? For the next day, is there anything that they're going to need to pick up? We celebrated just past 12:00, everything was good enough for us to go to sleep and come and fix the next day. We were supplied with cake for the rest of the week, we also had time in lieu. That's a picture of where we are today with incident management at the FT. I think it's really positive.

Background

I work at the "Financial times." I'm Sam. I'm a Principal Engineer at FT. I work on the group that builds ft.com, our flagship product, and our Android and iOS apps. I joined about three years ago, and I started in the infrastructure teams. Since then, I've morphed slowly back into software engineering.

What I want to talk about is how we do on-call. How we structure ourselves within the FT because it's a large company. There's 1000 people in our office just in London. A bit about what we actually care about from the perspective of incidents. I'm going to talk about our challenges that we faced in 2019, my personal group, and why I'm talking here today. Then I'm going to talk about how we address those challenges and made our out-of-hours team sustainable. I'm going to talk, are they even worth? The spoiler is, yes, that's why I'm here. Some things you can take back to your own teams, and hopefully motivate them to get involved with incident management.

How do we do on-call? I'll start quickly with a little overview of how we structure ourselves at the FT. We split ourselves into groups, we call them. I work within customer products. We build customer facing products. We are 45 engineers and counting. We've gone down that microservices route. We run about 180 systems. It looks something like this. I don't have that Monzo diagram with all the connections or anything. I do have a scroll bar that's really small. That's the best I got. We split ourselves into 9 teams, which works out to about 20 each, I think. It's a lot to maintain. A lot to be aware of. I just want you to think about that throughout this talk, is like, when you're coming onto an incident and you're supporting 180 systems, are you ever going to have the full in-depth knowledge of all of those to be able to support them? This is the teams that we split into: content innovation, content discovery, apps, ads, privacy GDPR, accounts, acquire, platforms, U.S. growth.

I want to talk about our Ops Cops team a little bit. We have an Ops Cops team that's made up of one permanent tech lead. We try and rotate every six months. Then two engineers who come in once a week from each of the other teams, and they are our primary incident management team during the week in-hours. At the "Financial Times," across the company, we also have an operations team to help monitor our entire technical estate, 24/7. If you've been to my colleague's talk earlier, Luke, you might have seen a hint of this dashboard already. There is a lot of stuff. It's fascinating what we built to be able to do this. From my perspective, our systems and customer products are just a drop in the pond. That leads us to talking about how we actually look after our things.

We've got this mentality of you build it, you run it. There's no way we can lean on operations entirely, to be able to look after our stuff. We support our systems out-of-hours as well. This is our approach within the "Financial Times" to DevOps quite across the company. One way of describing this I like is to say that our engineers wear many hats: QA, incident management, engineering, and infrastructure as well. We expect our engineers to be able to put our systems out into production. Today, we're going to put on our incident management hat. How do we support out-of-hours? How do we do this for out-of-hours? It's a volunteer setup. People say, "I want to join this out-of-hours team." We don't have shifts. We don't do that week-on, week-off schedule that is typical for an out-of-hours team. What it ends up looking like is our operations team spot a problem out-of-hours, and they end up calling us on the rota, and seeing who's available. It's hit and miss. We end up with a lot of voicemails, or a lot of missed calls that we have to call back on. It does mean we could be all unavailable in the cinema. We've been lucky, so far. It works out really well. It's very flexible. It allows us to work the way we work.

What do we care about as well? We've started talking about business capabilities at the FT. This is a new approach to how we monitor. It's putting the user first. We're thinking in terms of, can a user sign in? Is that working? It's a really great way of us being able to think about the impact from the user's perspective, when something goes wrong, because we're actually monitoring that business capability. I was asked to think about what is an incident in the FT? This is a really tricky one to answer, actually. From my perspective within customer products, it's very much about, is there an end user impact? Is there something we need to fix quickly, to ensure our users can have a good experience? Actually, if you go around our company as a whole and ask each group, it might be a very different answer. We're starting to consolidate around this business capabilities idea. Customer products, my group, has two really important business capabilities. First one, our users should always be able to read the news. If we can't do that we're not selling our product very well. The other one is, a journalist must be able to publish the news. If we have breaking news, but the journalists can't get it to our users, we're still not in a good place. If either of these go wrong, we declare an incident. Our users will see something like this. Hopefully, our engineers will be able to see a graph like this. Not always. That's the context.

Incident Management Challenges

Let's dig into the challenges that we faced in 2019 around Incident Management. To start, we weren't immediately productive on-call from my perspective. It might be late in the night, of course, it's going to take time to actually just wake up. We had an engineering mindset in an operation's situation. Typical, I think, across this industry. What I meant was we jumped straight into the code. This comes from the fact that within my group, we are made up entirely of engineers. We don't have specialisms in different areas. We don't have site reliability engineers. We don't have specialisms for DevOps. It means we jump into what we're comfortable with. I like this quote from a member of the operations team. I always start with the impact and comms, and the engineers here, we jump in at the tech. The total inverse of what they do. The other part was that our incident management process wasn't actually second nature to us. It was very organic. Sometimes you make an incident channel, sometimes we'd start a Hangout. Sometimes we would remember to do communication, and say, here's what's happening, here's the impact. Often, that was hit and miss. We didn't actually have many incidents in the first half of 2019. I counted, it was 10 in total. Any one of those were out-of-hours, which doesn't leave as much room for practicing and being prepared, foreshadowing, getting people involved with these incidents. Most crucially, going back to how we run our out-of-hours team, we're down to five people. It's really likely that we were actually all in the same hour at the same time. We set out to make our out-of-hours team sustainable.

How did we do that? We started out by asking our engineers, what's going on? We surveyed them about helping out during an incident. It turned out there are actually a lot of people who are already on the fence. We just simply hadn't asked people, do you want to join? This made our lives a little bit easier. There were three people who were like, "We can do that." There were seven people who were on the fence. Seven people we needed to convince to get us back to that sustainable position with our team. We wanted to understand a bit why they were on the fence, what can we do? I got quotes like this, around having that in-depth knowledge and domain understanding of 180 systems. We needed to think about how we could convince these people who wanted that in-depth knowledge. Actually, you might not. We set out in 2019, to convince people to join our team. We did this by running incident workshops. I've heard some other phrases, so disaster recovery workshops, or tabletop exercises. These will allow us to prepare our engineers for incidents out of the stress of production. Following that, we also ended up writing a generic runbook for our microservices, because it became clear through the incident workshops, that they weren't all aware that we can treat a lot of our 180 systems the same, and they're homogeneous. It allowed us to document all the things that you could do to our systems in one place, so that you can actually apply it to, hopefully, most of those systems. We actually set out in the last six months of 2019 to address this.

Building Your Incident Workshop

Let's have a look at what an incident workshop looks like, how you can go and build your own. Firstly, don't panic. Set aside a couple of hours to write the workshop. It shouldn't take more than that. A lot of this is going to be coming back to making sure that this is sustainable for yourself to be able to run them. Start by having a read of your old incidents. They're an absolute treasure trove of information. There's a great opportunity to learn for yourself. Today, I'm going to follow along where we wrote an incident workshop based on when our RSS became popular again. We have APIs at the FT to sell our headlines. People still really like RSS feeds. This ended up degrading our website and causing a bad service for our users. Use that first page with this incident workshop to set the scene. You're providing background but not that much information. It might look something like this. It's going to be a page full of alerts. That's the first thing that engineers really get to know when there's an incident going on. Something's alerting usually. In this case, we're looking at our Heroku dashboard and the alerts in there. Actually, we're kind, we provided a graph in this case. Most of the time it's Slack, Pingdom alerts, health checks going wrong. You then want to follow it with several pages of graphs and information, not too much, plenty of space, so people can write down. Here, we've got Fastly, our dashboard in the bottom, and then a few graphs from our systems from Grafana.

Each page is progressing that incident. If you picked one that has an incident timeline, that's great, you just follow along with that. This is the story you're following. Here, we're going past the graphs. We've got a bit more of an understanding. During the actual incident, we went into Splunk and searched, what are the IP addresses that are causing this? We provide that information in the workshop. Make sure that there's opportunities during the workshop to ask about the tools and systems that people might be using to resolve the incident. Include the dead ends, because that's what actually happens in production. You don't want to narrow down on one root cause, you actually want to be able to recreate the fact that you do hit a dead end and that wasn't the right cause of the problem.

One of my favorites was, there was an incident going on, this RSS incident. I was told during the incident as we were resolving it about a bunch of emails and push notifications that went out at the same time. They lined up so perfectly with the graphs. You're like, "That is the cause." They had nothing to do with the incident. Then wrap it up in a summary page. This is where you're running through with the people in the workshop, what actually happened in the incident? They want to know what happened. What did you actually do in production? Does it matter what we work through in the workshop itself? What caused the incident, do we know? We might not know. My main takeaway here is to keep it minimal. I went into this and there was a lot of over-engineering going on. We were thinking about making interactive graphs and dashboards, preparing datasets, and making it very hands-on, build your own adventure. That's not the way to approach this. The best thing to do is keep it minimal, lightweight, just paper. That way it encourages a lot of discussion. It gives a lot of freedom for the conversations to open up and people to share what they know, between everybody. You end up with something like this. Key points here would be printed out on A3, and give one stack per team. Bring a bunch of pens along because people are going to write a lot of really interesting things on these notes.

Running Your Incident Workshop

Now you got your material. What's it like to run your own incident workshop? You pretend to be the incident lead, incident commander. You want to start by splitting out into small teams of three, four, five people. No more. You're huddling. You want to get a really collaborative situation going on. It might look something like this. This was our first workshop that we ran. In this one, actually, is where I learned to use one stack of paper per team. If you hand out too much, then people silo off even within those small groups. Get people huddled around one page. Introduce the session and the format. It's a bit different. You need to explain what's going on, what they can expect, what you're going to be asking them, and what they should be thinking about. First, hand out that starting page of background information and get them talking. Give teams about 10 minutes to discuss it. You're really going to have to work hard as the incident lead here to get people to open up and start talking about the tools they know, the information about systems they understand. Pose open questions, don't drive the direction of this but get people discussing. Get people to write down their thoughts. There is so much value in the notes that people make during these incidents, to look back over just make sure they write it down.

This is a wonderful one, where we ended up at the end exploring what the hell is miss, hits, paths, and synthetic even meant because a lot of us didn't actually understand it. It gives us time to reflect on what people know and don't know. Bring the teams together after 10 minutes. You want to ask them the following questions. Keep in mind that mindset that we had with engineering-first mindset, we're trying to break that apart and get people thinking about how do we really want engineers to address an incident? Start asking, can they take any action? Are you leading? Can you do anything right now with the information that you have? Could you fail over the website? Would that fix things enough for you to go to sleep and fix it in the morning? Could you turn a flag off? Can you roll back a change? Do you not have enough information? Probably the alerts aren't enough to make an action. What more information do you need? What tools are you going to dig into? Splunk, you can look in Grafana. You can look in for your exceptions somewhere at your logs. Finally, the one that's really critical for engineers, what are you communicating? You need to explain to the rest of the business, what's going on? How are you doing that, to who? At this point, people actually get really excited. It's your job as the moderator to moderate that conversation. At the FT, there's a mascot that our parent company have. Just pass that around is what we do, one person talking at a time to make sure it's fair because there's so much information that can spew out at this point. Then, hand out the next bit of information. It might be what they've asked for, it might not, but just keep driving it forwards. You want to repeat the handouts and questions until the incident is over.

The most important thing here is to leave plenty of time for questions at the end. You've got all that material with then notes, what people are thinking about? Spend 10, 50 minutes just digging into what people didn't know. What were the surprises? In this one, we ended up digging into the dashboards. People weren't aware of what our important dashboards were, and what our important metrics were. We ended up spending 10 minutes talking about all the most important things in that area. It was great. Then, 10 people left all on the same page.

Documenting a Generic Microservices Runbook

After the first two workshops, it became clear that that question about actions, what could people do? We weren't all on the same page there. We started looking into how we could address that. What could we do across our 45 engineers? Make it clear that you don't have to fix the code first, there are other things you can do to resolve an incident quickly. We ended up documenting this generic runbook for 180 systems, one place where we list all the actions that you can take. We ended up calling it the ft.com incident tool belt. We also needed to build one for the app, that's coming soon. It looks something like this. We talk about what it's like to be an incident lead, like the Slack command that you have to use to start an incident at the FT, and the various different actions like failing over the FT, or turning a flag off that you could take.

Each action has a bit of prep. What can you do before an incident happens, or worst case, maybe during an incident? What do you need to do to be able to use this action? There's a lot of configuration, downloading the right CLI tool, document the usage. Then also, a bit of context about when you could use this action. We list our previous incidents like when there's been a problem with content in the U.S., that's a great situation in which you fail over. Looks something like that. We ended up running training sessions on each one. This complemented our incident workshops that were very lightweight. They were hands-off. The generic runbook, when we ran through the action in practice, we did them properly. We actually applied them in production because what we wanted is for people to come away feeling confident that they could do this, and minimize that time, that consideration of should I, or should I not fail over the website. We wanted the people to come away from this and go, "I know how to do it. I'm not fearful of that. Let's go for it".

Results and General Takeaways

Did it work? Did the documentation help? We ran six incident workshops. We developed three incidents. We also ran six toolbox workshops as well as [inaudible 00:29:14]. We asked before and after each incident workshop, how likely were you to help out during an incident as an engineer? It looked something like this. Before, it was so-so. People were interested. It goes back to those quotes around confidence. Immediately after people were really interested, it was like, "I might not know everything about these systems." What we did not teach was 180 systems worth of an in-depth knowledge. They were still willing to help out. Favorite quote, "Solving out-of-hours incidents is probably not as scary as most of us think." The workshops expanded everyone's mental model of our systems. We went from this place where we were fixing things in the code, because that's what we understood as engineers. That was what we could fix. This was our comfort zone. Out of these workshops, we started being aware of these other third parties like Fastly, that we rely on critically, to serve the FT. They are important, and we might need to do something and look at the metrics in there to be able to fix an incident. We got better at the process. We just understood it. It became that second nature that we were looking for. A quote from an engineer here, on the flip side, coming out of these incident workshops, I've learnt to focus initially on the comms and the experience, and less on finding that technical root cause and fixing the code. We don't want to wait 15 minutes to deploy a system to get the website working again, if we're going to fail over in 2. We started learning from our old incidents. This is a wonderful exercise going back over what's already happened? Often, it means we go over the old actions and we realize we haven't actually done anything still.

Lesson Learned From Incident Workshop

Our RSS incident meant that across the FT, our user experience was degraded. There was a crawler interested in finding out about the latest news. They were crawling so much. It was causing issues on reading articles, searching across the website. At the time, we did nothing. We actually did nothing in the incident. It sorted itself out. Not great, but it was all right. Through the incident workshop, someone actually suggested as an action to scale down the system. Could they scale down the page that were serving the RSS feeds, to hopefully provide a better user experience for our customers and cut off that degradation? It meant that we lost a page of the website, but on the whole it meant we were better off for our users. It was the perfect answer. We started getting quotes, like, "I'm keen to work towards being on the out-of-hours incident team." The most important thing, engineers actually did join the team when we asked them. We're now up to 11. We still got more to go, so we're going to continue running these. Many of the people were dialed into that DNS incident. They were comfortable knowing that, "I can join. I might be able to help out. If not, I can still learn." That was a wonderful thing to see.

We're going to continue promoting, joining that out-of-hours team through these workshops. We're going to start evangelizing across our company, to do this, to learn about these silos of information and knowledge in different teams that we have, to try and break that apart.

Key Takeaways

Practicing incident management can prepare you for that unexpected situation. You might not have that in-depth knowledge, but you don't need it, as long as you get a clear understanding of those quick actions that you can take. Confidence doesn't have to come from in-depth knowledge. If you understand and increase your mental model, you'll be able to figure out ways of solving incidents quicker.

Resources

In this talk, I read a lot, there's some really interesting stuff going on about learning from your incidents. Here are the examples. I encourage you to have a look. They are real from the FT. They should provide a good bit of solid starting point for you all.

Questions and Answers

Moderator: Given that you created some of these workshops, you must have personally learned a lot about the incident that you were creating it around. Could you talk about that, Sam?

Parkinson: The first thing I learned is we really don't follow up on our actions from incident reports. That is just the thing. Generally, my favorite one was learning this whole new idea of scaling a system down to provide a better service for the rest of the website. Going back to that, it was, I just hadn't ever considered that. Through working with all these other people, everybody with great ideas, and understanding of different systems, that's what I was getting the most out of these workshops.

Moderator: You said you learned a lot, and you're trying to get more people to do these incident workshops? Are you incentivizing, leveling up as a part of running the workshops?

Parkinson: At the FT we have engineering competencies, it's a path to progression. It isn't part of that at the moment. I think you would feed a lot of it in. A lot of this goes around back to communication and being able to work with others. That's a large part of what we look for in development. Yes, in a way.

Participant 1: Given all of your efforts around the workshops, was there anyone who still didn't want to engage, still didn't feel comfortable? How did that go?

Parkinson: We had a few people, because we were asking after every workshop, what did you get out of this? Are you in a place where we can follow up and say, would you like to join the team? There were a lot of people joining, so that when they were on the Ops Cops team, actually, that they were more confident in being able to fix issues at the time. There will always be people who are just not interested and not motivated by it. That's ok. Even if they're still coming along to the workshop, they might have good ideas. I think that's worth being aware of, this isn't something to pressure on people or commit people to joining our out-of-hours team. This definitely needs to be voluntary and something that motivates people.

Participant 2: You said the key word there, which I wanted to follow up on, what was the main motivation for people to join the out-of-hours support team?

Parkinson: We didn't actually ask, specifically that. I would hazard a guess that it's different for everybody. I can talk about myself, though. I see each incident as a puzzle. It's so rewarding to work with others to solve that problem. It's just a massive endorphin hit, I think. Everyone will be different. I'll be interested to learn more around that, and why are people motivated to actually help out during incidents.

Participant 3: You've said that you don't really need to have all that in-depth knowledge for all the systems. Did you have an incident where you really needed that in-depth information and there was no one on the call who had that information?

Parkinson: I think the way the incidents tend to work, not needing to have that in-depth knowledge is about getting people on board into incidents, as an entry point. Otherwise, there's this expectation that you have to prepare a lot to be able to join and be comfortable doing that. Like the DNS one, one of the reasons why that DNS outage was such a success was because the 10 people helping out all knew each other, all knew these systems, and the connections, and have those mental models that really support advancing. I think that is critical to be able to solving something quickly. There's a balance there about having the right people. This is a great approach to getting people new to incidents involved.

Participant 4: I was wondering with staff turnover and staff churn, I don't know if you have this issue, but how do you pick the right time to train people and do these workshops? If we have a new starter, I don't want to do this for one person. I want to do it for a bunch of people. Do you have a magic formula?

Parkinson: I think, really, in these workshops, we tend to have 8 to 10 people in each one. A lot of the reason why we started it in the first place was people swapping teams or leaving the company, actually. That's how we got down to five people. We just stopped asking, are you interested in joining? It just totally depends. I don't think I have an answer to that one. That is a case of look at your company and see, but running this with three people, that's perfectly acceptable. It's so lightweight that you should be able to apply it and get value out of that. Running it and also equally running it with people that already are involved with incidents, there's a lot of value in that too. Make sure you've got that nice mix of people involved.

Participant 5: You showed that in the workshops that you're doing you had graphs and examples of what actually happened during the Incident. Was that information that you already had or was that something that you had to go collect, after incidents?

Parkinson: A lot of that comes from our incident reports. That's what you lean on. If you aren't already getting screenshots of the graphs and what the website looks like at the time, and putting together that timeline of what people have done during an incident. Now is the time to start doing that. Because it means you can produce these things. It's so valuable to be able to look back over. Yes, we do.

Participant 6: Several of the talks talked about building trust and credibility of the teams in your organization. Have you yet or are you going to publicize what you're doing to wider business groups to build that credibility for your team?

Parkinson: I think we need to. A lot of what I've done and talked about here is just within our group, people we know, people we work with day-to-day. The reality of an incident is actually we're working across the company to resolve the big outage, was a big part of that. I'd like to explore that more, but we're not currently doing that. There are some more general workshops that we do within the company to learn about how to prepare for an incident, and that's cross cutting across disciplines and across teams. It can be applied in that area.

Participant 7: If we have a team with multiple vendors, a couple of full-time employees, a couple of contractors. There's a varying age limit as well. In these dynamics, how this model, means how we can motivate the people to come to our out-of-hours.

Parkinson: It goes back to that earlier question about, what motivates people to join an incident, are they willing? Within our company, we would never force somebody to join. The contractors one is interesting actually, with our 35 coming out. It's a no-go for us. This is all perm staff that are getting involved in learning about how to spot incidents. I think the best advice I have is ask. Find out about what people are thinking in your team and whether they're interested or not. What it would take for them to go, "I would be up for it." There's probably more maybe's rather than no's than you think. You got to ask to find out. That would be my advice.

Simona: What's your SLA when you intervene to an incident?

Parkinson: We don't really have SLAs actually at the moment. We're still learning a lot about how we get on with incidents. I think one way of describing it is there's a lot of goodwill that goes into us resolving incidents at the moment. We're not professional about this, I think yet, compared to Netflix, but we're also not as big as Netflix. We also have the luxury of building and serving our own products. The membership team, and the content and metadata team, they actually have contracts with third parties that they get paid for. Those are the teams that do have SLAs, and money involved. We're a bit more decoupled from that, which gives us a bit of luxury. It's something that we're working towards having.

Participant 8: In the beginning, you said that everybody tries to fix the situation, but nobody looks at who and what happened before this? Wouldn't it be simpler to fix the problem if you know the reason or the root cause?

Parkinson: That goes into the blameless culture part of incidents, I think quite importantly. The dying incident is a great example. In that first hour, which I wasn't part of, we actually understood the changes that did happen that might have led to the incident. We still don't concretely know why we lost our domain. Once it was clear through the recent changes, then we just switched total direction and became disciplined about focusing on recovery. We understood the situation. We couldn't do anything about that. We knew we needed to recover and restore our data. That was the focus for the whole team. There is an element of, have a nice list of changes that have happened recently to be able to have that context available. It might be better, for example, to fail over the website than roll back the change depending on how quickly each one is. For us, failing over the website, a couple of minutes, but rolling back a change, that may be 20 minutes. That's not a great user experience. We lean more towards avoiding changes and using other actions to resolve incidents if we can. That's the understanding that I have.

Participant 9: You said you don't have a rota. How do you make sure that calls get reasonably fairly balanced across the people in the team?

Parkinson: We just maintain a spreadsheet. We work with our operations team, that Sara runs here, to make sure that it's distributed. They helpfully record every time that they called out to somebody and whether or not they've actually answered. We work together to make sure that that's up to date. That just helps balance it. Also, just having a bigger team helps as well for that. A simple approach.

Participant 5: Would you say you have any advice for a Greenfield project that is actually the first time it goes live, so we don't actually have examples of any incidents that have happened before, and how you can prepare people to be on support for that?

Parkinson: Going back to the big incident at the beginning, that was unusual, not something we would have ever thought about happening. We'd not even thought about backing up that data with the help of our vendor. You could use our workshops, for example. A lot of people are starting to talk about incidents that happened internally. You could lean on those a bit, I would suspect, because a lot of the incidents are the same. There's a lot of similarities. You just need to apply it to the tools that you're currently using within your team and the actions that are available to your team to help resolve an incident. Managing those two, I think might get you there.

Moderator: It reminds me of the concept of a pre-mortem. If this were to fail, how would it fail? There's a lot of cool resources online for that.

See more presentations with transcripts

Recorded at:

Sep 10, 2020

Samuel Parkinson

InfoQ Software Architects' Newsletter