Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Better Resilience Adoption through UX

Better Resilience Adoption through UX



Randall Koutnik goes over three case studies where teams achieved success (and a few that didn't!) by focusing on the human element of engineering tooling. In each one, he looks at a specific UX technique that the team employed to put their company on a path to resilience.


Randall Koutnik has worked at everything from tiny startups to Netflix to teaching introductory programming at a bootcamp. He wrote a book on RxJS, and blogs at

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Koutnik: This is the better resilience adoption through user experience talk, where we're going to talk about better resilience adoption through UX. First, what I want to talk to you about is something that's near and dear to my heart, ice cream. This here is an It's-It. It's an oversized ice cream sandwich that because we're Americans, we have dipped-in chocolate. It's native to the San Francisco Bay Area. Speaking of the San Francisco Bay Area, I used to work at a small little company called Netflix. At Netflix, we had the typical Bay area kitchen full of snacks and all that, and there are about two granola bars and one leaf of seaweed. They claim, we've got healthy snacks, and the rest actually tasted good. We had a whole bunch of It's-It in the freezer. A co-worker of mine named Mike decided to take not one, but two of these things. Throw them into a blender, pour whole milk over the top, because at this point, why not? Blend it together into what he called an afternoon snack. What he didn't get was an afternoon snack. What he in fact got was this, an angry blender and some increasingly soggy cookies. All he had was a blinking red LED at the base of the blender, and that was it. Netflix, as you may know, only hires what we call world-class engineering talent. Mike indeed was, world-class engineering talent. He knew the one thing every engineer knows, when you encounter a technical problem, you turn it off and on again. He unplugged the blender and he plugged it back in. For good measure, he pushed every single button on the front. He did not get a milkshake. He continued with that blinking red LED.

As we all know, software is a team sport. There are others of us in the kitchen at the time, and we saw Mike struggling with this blender. One by one we all came over to help. One by one we all unplugged the blender, plugged it back in again, and pushed all the buttons on the front. The same result. We put our collective heads together, and years of engineering talent, and we decided to Google the error message. What we didn't find was blender overflow. That would have been nice. What we did find was the answer. The lid of the blender was put on improperly and needed to be rotated one quarter turn, so that the arrows on the lid aligned. Thus enlightened, our eyes rose from the phone to see indeed the lid was misaligned. Not only that, we saw on the wall behind the blender, this sign. There we go. It's a nice picture. You can see arrows pointing to other arrows. There's a sentence clearly establishing the problem and how to solve it. Not only that, as you can see there, it's right next to the blender where you'd expect it. The first lesson you can take from this talk is that engineers don't read documentation.

You didn't need me to come all the way over to London to tell you that. At this point, you've probably got two questions in your mind. Who is this guy with a weird Viking helmet? What does any of this have to do with resilience engineering? I can answer both of those questions. First one, my name is Randall Koutnik. As Nora pointed out, my career resume can be described politely as interesting and more accurately like this. Observant audience members will notice the first half of my career did not go well. I worked at four separate startups, all of which were catastrophic failures in their own way. I gained a lot of experience, ended up at Netflix, moved to Slack. I'm now back at startups because you know at this point, I've got to make it. This has got to be the one. Along the way, I picked up a bunch of skills in UI engineering, but also building a bunch of tools for engineers, which turned out to be a weird skill set and very useful.

What Is Resilience Engineering?

We had a second question, that second question is, what the heck is resilience engineering? Has anyone heard of resilience engineering before? You all haven't raised your hands, I have questions about why you showed up to this talk. Resilience engineering, at least to me, is building systems that remain functional despite unexpected errors. We're engineers. We build stuff that hopefully this stuff remains functional. The important part of this is despite unexpected errors, there is in software, the weird. We've all experienced the weird before. Raise your hand if this has struck fear into your heart? We know the weird. We know the weird happens. We need to build systems that work with that. It still works despite DNS or some random router in Eastern Europe dying. I think there's a little bit of a misconception here. Because when I said systems, I'm willing to bet you all thought about this. We don't rack servers anymore. It's unfortunate. That was really cool back when we did. We have the cloud. We take our software, and we deploy it to the cloud, and that's our system. There's more to it than that. There's our software. There's the language runtime. There's the operating system. There's AWS. There's all the third-party people who are down on the third floor who would love to sell you services. That's a really complex system. There's enormous number of things going on there.

That's not the whole system. Because we've gotten the most important part, the people. All of you. Imagine if, tomorrow morning, every engineer in your company woke up and moved out to the woods. They wanted to build their own log cabin and live off the land. They didn't want to write software anymore. How long would your site stay up? An hour, a day? You think you can make it to a week? You're probably wrong, but good on you for being optimistic. Because we think like, when we have outages, when we have problems, we love to blame the humans. That's the only time we think of the humans in the system, "It's human error. It's all those idiots around the blender who didn't read the documentation. It's their fault. We've found the issue." In the heat of the moment, the blender didn't work. In fact, we have an entire career path now specifically around being the human part of the system that keeps the system working and resilient. It's called Site Reliability Engineering. You can have a whole life just doing that.

The System Is Resilient Because of the Humans

Stop thinking about human error and start thinking about this way. This system is resilient because of the humans in the loop, not despite the humans. We don't need to replace the humans with AI ops, or something like that. We need to augment the humans. How does this change how we think about systems? This is the old way, we've got down in the corner goals, different things. We've got the programs, and our databases, and our tests, and our release engineering suite, and all of that. That's one way of thinking about this. We can add in the humans. For the humans, this is where all the work's getting done. All of the coding, and the monitoring, and the thinking, and the whiteboarding, and the endless meetings, which we may not categorize as work, but we still do them. In the middle, there's this big bold line. This is what's called the line of representation. This is because we can't just look at a computer and tell what's going on. That would be great. If I could just stare at my laptop, see all the 1s and 0s coming down like the matrix and just understand everything that's happening. That isn't true. Instead, we've got monitoring. We've got logs. We've got coding. We've got the blinking lights, AWS console, all things that tell us what's happening in the system. These may not be accurate, as we've all seen in AWS's status page. These may not show us the whole picture. Because of that, everyone has a different idea of how the system works. You may work in a different part of the company and only know you're a part of all the microservices, the 1500 microservices you run. Or you might just exclusively be on the front-end and only know how the front-end works.

Real World Scenarios - Netflix's Atlas

What I do, and what I'd like to share with you today is how to study that line in between. Look at how people interact with the systems, how they learn about the systems, and where the confusion happens. Where people misunderstand things, or don't, and why. Let's start talking about real world case scenarios. I used to work at Netflix. It's a QCon talk. I need to mention either Netflix or the "Accelerate" book. At Netflix, I worked on the Atlas team. Atlas was a massive metrics database. What does that mean? It means that we had 7 bajillion servers at Netflix. We used to run AWS entirely out of servers, because infinitely scalable cloud was not enough. That sounds cool, but it wasn't in the moment. Having that many servers isn't very useful if you don't know what's happening. Atlas as a tool, looked at those servers, and kept track of all the system metrics, and the JVM because you need to keep track of the JVM. All the business logic, and the sidecars, and all the other microservices, and put it together into this really big database. Not only that, you could query all this information in the database, and do really cool things like, here's my metric. Then we're going to look at the week-over-week metrics, and then take the second derivative of that and calculate a rolling count of the previous hour. Then alert on this and that. It was pretty impressive. In order to do any of that, you needed to write Atlas stack language, which looks something like this. Got it? Easy. You just start off with the alerttest cluster, and the requestsPerSecond. Then you sum it, and then you duplicate it, and des. I worked on that team. I can't tell you what this query does.

What Are The Humans Doing?

From this, I'd like to introduce you to a question that has driven pretty much all of my work over the past couple of years. What are the humans doing? Because there's a whole bunch of vendors who are right here and go visit a sponsor booth. They can tell you what the computers are doing. They can give you graphs, and logs, and all sorts of fancy things. What I want to know is what the humans are doing, because that's the important part of the system. I joined this team that had this complicated database. I went around and asked, what do you think of Atlas? Some people were like, "I love Atlas. I can do the double exponential, this and that. Then calculate." I'm like, "We don't need to talk anymore. I barely passed calculus." I went and talked to the rest of the humans, and I got a different response, "Atlas, I love it. It's huge. It's powerful. You can do all sorts of cool things. I just don't use it." There was a lot of people that either didn't have alerts at all, or had used a third-party service for all of their monitoring and alerting despite Netflix providing that service internally. That was because they didn't want to do the double exponential calculus. They just wanted to know how much memory their service was taking up. They had to write one of these to figure that out.

Alert Wizard

Instead, we decided to do something different. We decided to make an alert wizard. Remember software wizards in the glory days of Windows 98, you just pop up and just click next, and then you've got the latest and greatest in viruses. We wanted something that simple but without the virus at the end. Instead of writing this long, complicated query, and you weren't quite sure how it worked, we just wanted a box and a page that said, "Alert me when memory is over 80%." I could have just said, we need to build that. I'm going to go into a corner, and hack away, and then chuck it over the wall. Then everything's going to be great. I'm totally going to get promoted for that. In doing that, I would have committed a sin that many software engineers commit. I wouldn't have built something useful.

Instead, I grabbed the conference room, spent about 30 minutes playing with the whiteboard, and came up with something that looked like this. This is not going to win me any design awards. That's why I'm not presenting at designing conferences. Once I'd finished this, said, that looks about right. Then I went, and this is absolutely critical, and took a coffee break. You can take a tea break too, don't worry. That's not critical to the process. Went into the kitchen and waited for my next victim. By and by an engineer came, and I said, "Do you want to help me with something?" It turns out he did. Both grabbed coffee, walked into the conference room. I said, "Can you just start at the upper left corner and go through, and try and set up some alerts with this new product I'm trying to build?" Click, that's not a sound you want to hear when demoing a product. "What do you mean?" Says, "I'm on the security team and we have a separate AWS account. Where do I set that?" "Right here," pull up the whiteboard marker. Thirty seconds later, you could set your AWS account. The fastest refactor I've ever done. There we go. He got it. He was able to finish the flow. If I hadn't talked to him, I wouldn't have known about the separate security AWS account. I would have built it and I would have released it. He would have gone through gotten to that step, and went, "I'll get to this later." How many of you have a browser full of tabs you are definitely going to get to later? It's ok, you could admit it. That would have meant this guy from security wouldn't have gotten alerts. The product would have failed. I was able to figure that out just by playing around on a whiteboard. I repeated this process a couple times. Over that afternoon, I learned so much about how people used or didn't use the alerting system, and what they needed. Took all the lessons from that, turned it into a UI, completely faked the entire background. Didn't write a line of Java. We got something.

Then we started running regular user tests. Every Friday, we'd pull a couple people in, call up the latest version of the alert wizard, and put a laptop in front of them. Say, "Set up some alerts." Then we shut up. That is probably the most terrifying thing you can do as a software engineer is to put your product in front of someone and go. We couldn't explain it to them because we couldn't sit behind literally everyone who was using our product and explain it to them. We learned some very helpful things. One of the smartest software engineers I know, got almost all the way through the process. He had set everything up. He's right at the last step, and he sat there for a minute and a half, which is an eternity when you're sitting there in silence, until finally he tells me, "I can't find the save button." This wasn't his fault. He's a brilliant engineer. The problem was I'd put the save button in the corner. All of the other buttons in the application were in the center. This was my fault, quick fix. If I hadn't done that, it definitely would have gone into the later bucket, and he wouldn't have set up alerts. This is one of the things I think is absolutely critical for building any tool. This is especially true if you're building resilience tooling for internal tooling because your users, all the other people you have are right there. Just take a coffee break. The first lesson is to talk to the humans before you start coding. I really strongly recommend doing some user interview test. We just did them weekly, Friday afternoon, grab a beer. Try our new product. Netflix, as you probably know is a fairly operationally mature company. There's a lot of smart engineers doing a lot of smart engineering. What happens when that's not the case? Now you know where I got the helmet.


Norse was a startup. It was the fourth tire fire in my resume, and the final one, thankfully. At Norse, what we did was something called cybersecurity threat intelligence, which is a really fancy way of saying we let the bad guys attack us and took notes. We'd sell the notes to whoever wanted to hear about what the latest attack was. What attacks were so old they'd become retro and are cool again. The most probably forward-facing part of our product was the threat map. Has anyone seen something like this? A couple of hands. Then what we showed there was live attacks. I wish it was still up to show you but I don't have that anymore. You could see all these lasers pinging around as mass port scans from college students in China, and all that went around. The problem was this code base was awful. It was originally developed by a contractor who just threw it together and handed it to us. Then it jumped from team to team as people would add a little bit of a feature. Then go, "Your problem now," which is of course the best way to develop a code base. In fact, there's a snippet of JavaScript on there that would refresh the page every five minutes, otherwise it would freeze and become unusable over time. One of the themes of this is memory leaks are bad. This finally landed at my feet because I had just joined and I was the novice that the company has made the mistake of going, "We can totally take that on as a tech lead." I had two people reporting to me. Our job was to take this trash fire of a code base, and turn it into something good, and add a whole bunch of features in two months. Honestly, we did. That was some of the best engineering I've done in my life. You're not here to hear about that. What you want to hear about is when it came time to deploy this.

Red Flags in Deployment

There were a few red flags. Let's go through them. First one we've already covered. It's a massive change. We had refactored this application from top to bottom. It was so much better now. Literally, every part of it was different. Secondly, we were on a non-traditional infrastructure. This was unfortunately 2016. The company thought the cloud was this hipster fad that would blow over eventually. We're entirely on-premises, bare metal. Doubling down on that, we thought Linux wasn't cool enough, so we did BSD instead. How many infrastructure engineers out there do you think are experts in BSD? It's a very small number. To make matters worse, we didn't have an infrastructure expert handling our infrastructure. We had me. The person in charge of doing all the infrastructure work was a UI engineer by trade. How many UI engineers out there do you think are experts in BSD? I can tell you, I am not one of them. Still, following up on that, we were using new and unproven technology. At least new to us. It was 2016, and new, and cool, and cutting edge was Ansible. This is the first time we'd ever deploy with Ansible. Previously, deploys were just the team would do whatever they did, "Jimmy knows what he's doing. Go talk to him." Finally, tried to do something with deploys. Next up, there was no possibility of a rollback. As you probably know, a rollback is when you go back to a known good state. We had no idea what state the servers were in, so many people had just SSH'ed in and installed this package or that package. You just got things running. We weren't even sure the code running on the server was the code we had in Git. We had to fix forward. Finally, this was on a Friday afternoon. I know those of you who are extremely online and have active Twitter accounts know that we should deploy on Friday. We should be confident in our deployment process that will catch problems that won't wake us up on a Friday night. Though, I think even with the five other things on this list, even Charity Majors would probably admit, a Friday afternoon deploy was a bad idea here.

With all of these things on the table, you might ask, why on earth did you press that button? Did you go forward with this? There was one mitigating factor that I think at least made up for some, if not all of this, this was the last day we'd be paid. Two days before that, we'd had an all hands that said, "We ran out of money. Don't worry. The check is in the mail. The investors are totally on board." They were not on board. With a deploy, you always say, worst case scenario, we'll lose our jobs. That had already happened, so why not? Let's take a look at those red flags again. It wasn't a complete catastrophe, but why? Take a look at these six items. I want you all to think, which of these was the catastrophe?

Participants: All of them.


Koutnik: You're all wrong. It was CloudFlare. Apparently, this is a trick question. I didn't list CloudFlare as an option. As it turns out, a while ago, the CTO of Norse had added CloudFlare to the site and didn't bother to tell anyone. We deployed, CloudFlare happily cached what it had been told to cache, which was the old site, and the new site collided with the old site, and everything fell apart. We made the news. That's always impressive. Like, "Mom, I'm a real engineer." There's the map too. Indeed, Norse Corp. was imploding and everything was falling apart.

Lessons Learned From Deployment

What can we learn from this? You have this above and below the line thinking. Below the line, you have all of the software and all of the computers running. Above the line, you have all of the people. Everyone has a different conception of how the software below the line works. Had Norse had money and we had all come in on Monday to work, we probably would have had a postmortem. When I say postmortem, someone's like, "I'll do the postmortem." They stand up in front of the meeting, and they go, the incident started at this time, and the incident ended at this time, and involved these services. I know some hipster startups have nap rooms, the rest of us just have postmortem meetings. Instead of that, instead of just listing off a static timeline of everything that happened, look into postmortems as a way of thinking, what surprised you? What was interesting? What was different? Where did our conception of how the system worked differ from reality? Because if we knew what we were doing, it wouldn't have gone down. I hesitate to call what we were doing at Norse, release engineering, because I think that besmirches every release engineer out there. What does this look like when it goes right? What does this above the line thinking about the humans in the system look like when you apply it to release engineering at a company that has customers?

Slack Releases

To that we need to go to my previous job at Slack. Slack does IRC, but in the cloud, I think it's really cool. They're going to take off someday. Before I joined, releases were really slow. Nobody likes slow releases. Nobody likes watching your progress bar slowly creep across the screen, and you're sitting there going, there are so many things I could be doing right now that are better. The release engineering team had sat down and done a lot of work to make releases fast. They were. It was pretty good. You'd get your code merged into master, push a button, and 60 seconds later, your code would be running on every server at Slack. That's a lot of servers, which is really effective. Sixty seconds later, your code, for better or worse, will be running everywhere. Let's go back to that question I asked, what are the humans doing? As it turns out, not much. What people were doing is clicking the button, and as it deployed, they'd go start doing something else. They had OKRs to work on, and KPIs, and TLAs, and SLOs, and all sorts of management speak for, continue working. They didn't have time to babysit, deploy after deploy. In fact, one story I heard is about someone who was out at lunch, pulled out their phone, pressed deploy. Put their phone away, continued eating lunch, as all of Slack burned down behind them. A bunch of you are suddenly realizing why Slack was so flaky. In fact, once again, "Mom, I made the news."

The Deploy Commander

Something had to change. I joined the release engineering team at Slack a week before this article was posted. Out of the frying pan into the fire. Something needed to change. We could just come up with a policy that says everyone only deploy good code, don't deploy bugs. That would not have worked. Instead, we needed to think about this above, below the line. If you look above the line for the deploy process at Slack, where are the humans? They aren't there. They're off doing other things. We need someone in the loop to handle the weird. We came up with a new role called the deploy commander. A lot of companies have a role like this. It's someone who sits there for four hours and just babysits deploys. It is not a fancy, great job. Honestly, to credit the humans at Slack, a lot of people volunteered for this, volunteered to sit there and watch graphs for four hours straight. This worked. This was good. We were able to have someone in the loop as we deployed. We didn't deploy 100% immediately anymore either. We're able to say, let's deploy it a little bit and have a human watch. When things get weird, they can go, "Jane was working on that feature. Let me ping Jane and make sure that she knows what's happening here." Then, ok, so that's normal, we can move forward. Or maybe it's not normal, and I hit the big red button. This worked and things improved a lot. Customer service actually let us know and said, "Thank you. We've had, for the first time, no tickets in the morning and we're able to finally relax." We finally put a human in the loop.

Things weren't perfect there. There was a lot of things. Deploying a product like Slack, there's a lot going on. We wrote up this huge document, detailing every single thing you needed to do every single time you deployed. Looks something like this. Let's all read that for a quick second. What was your favorite part? Mine's the fourth page. What had happened was some diligent humans signed up and got them to deploy commander rotation, and read the whole thing. Good for them. Then they went through and deploy commanded for four hours, and then went back to their regular day job. Then, remember I said a lot of people signed up for this, 2, 3, 4 weeks later, they'd come back. Can a software system change a lot in four weeks? No, nobody makes any changes in a whole month. There was a whole bunch of changes that all went on during that four weeks. They'd come back, they wouldn't read the document again, "I read that last time." Then go through and things did not work out.

We had one outage with search, where new messages weren't being indexed. You could search for old messages before the bug, but after the bug you couldn't search for anything. The deploy commander went and tested search, good. They searched for an old message, which they found, and said, "This looks good to me. Let's push it out." It went out and then everyone on Twitter got mad, as usual. This time they were mad at us. They had tested search. Did they test search exactly according to the document? No, the document could have been better. Instead, what we decided is maybe humans aren't great at repetitive, boring tasks. Is there anything we do that's really good at repetitive, boring tasks? Computers. We needed to take all of this and put it into a computer. What we ended up with was just checklists. Checklists are awesome. NASA does checklists. You should do checklists. What that meant was the deploy commander's job was not to read and nearly memorize this huge document. Their job was to handle the weird. What they were able to do is sit down, what had changed over the past four months? It's going to be a checkbox. They'll know about it. They can go through and say, I tested everything according to the latest standards. Teams that come to us, we want to change the standards, great. Next deploy, they'll go out. What this meant was that no one went, "I was supposed to test something two steps ago. Do we need to undo it or redo it? Or is anyone mad at me?" Instead, the humans could handle the weird and computers handled everything else. There we have above the line, below the line thinking. I really hope that you can take this and bring it back to your company to build more resilient systems by making sure that the humans are involved and know what's happening below the line.

Like to end with this picture of my blender at home. You notice there's a notch in the lid. There's only one way to put the lid of this blender on. I don't need a sign on my wall telling me how to use my own blender. When you're building software systems, remember to build the notch, and above all, ask, what are the humans doing?

Questions and Answers

Participant 1: I work in a team that's got quite a few big components. We do quite a few weekly releases. We have release commanders, they go through the release docs. I can definitely see the value of checklists, and I've done that in jobs gone past, really good. How do you mitigate when something goes wrong? We have an FAQ's section. Could that also be moved into a more human friendly way to tackle that, instead of panicking and putting a message out on Slack?

Koutnik: Sending messages on Slack, you can panic and do that. I think one of the key qualities of a good deploy commander for us was just knowing people and being able to know when to reach out. I think humans are great. We're just really good at identifying problems and jumping in to get them. You see these AI ops and they're like, "AI is just this list of things to handle situations you already knew about." That's not what we tackle. That's not resilience engineering. There are unexpected errors. I think being able to clearly say these are the changes that are going out. These are the people who are responsible for those changes? Super helpful. I know who to contact to say, is this normal? Because I might see this huge spike, and someone say, "That happens every time."

Moderator: Checklists are good for training purposes, but after a while, you probably don't even need the checklist anymore. It's not relying on it too much, step by step, that's when you get into problems, but trusting the human to go a step beyond the checklist.

Participant 2: How much better do you think you do now if you went back to some of these car wreck stories you told us about and were faced with the same situation again?

Koutnik: To answer that, I'm going to talk about video games. I used to play Overwatch a lot. It's a 6 v 6, first-person shooter. I was not very good at it. I'd ask my friends who were really good at it. I'd show them, there's this situation, how should I have handled it? All of a sudden three people were coming at me and I didn't have any backup. They'd say, let me rewind the video a bit. A minute ago, you went that way. You're never going to be able to handle the situation over here. You need to understand all the things that led up to that situation. There's a whole bunch of decisions I had made throughout the entire game that really led up to that situation. You could never say, "This is the issue." I think one of the changes I would have made is never joining those companies in the first place. To be honest, there's a couple giant red flags. It was early on in my career. I didn't learn to identify them, I think. I explicitly give this advice out now, if you're new to the industry, do not join a small startup. There is no mentorship available. Not to say that's a bad thing. They just don't have time to provide that mentorship, and you will grow slower. I grew slower because of that. I wasn't able to handle all of the things that went wrong. Now that I'm further on in my career, what would I have done differently at those companies? I think, honestly, one of the biggest things and problems there was people. Because computers are easy. I think there's a lot of situations that I handled personally wrong, because I thought I was right. I thought being right was all that was needed. I didn't know how to build a consensus. I didn't know how to talk to people and meet their needs, rather than just going, my way is the right way.

Participant 3: Obviously, everyone is a user. From this perspective, it felt like more Dev experience and Ops experience because those were internal users. What I didn't get the answer for, and your original question why I came to this talk was, I was hoping to see some techniques on how to conceal disasters from the public using user experience.

Koutnik: Using user experience, I think there's a lot of wonderful things in that area. What I was hoping to do is start a talk about how to make things useful so that people in your company build more resilient products, because we've all worked in feature factories where it's just JIRA, go. Making things easier can help increase adoption. As far as people using your product, of course, when things die, you can just have, it's a blank page, everything's dead. There's a lot of great tools you can use. I think, honestly, Slack is a great example in this area of dealing with there's no connection, or there's a slow connection, or there's a flaky connection. Slack knows. The Slack client can figure out each situation and will respond differently. If you have the Slack client open, and you don't have any Wi-Fi, and for whatever reason your internet is dead. It will tell you that. It says, "We were trying to reconnect, that's a problem." You can still read messages. It's still as useful as it can be, and caching. The same with flaky, there's fallback systems where, the WebSocket doesn't work let's just try polling, and just keep trying through a flaky connection on the cell phone, or whatever. If you want to maybe not hide disasters, because I'm not a huge fan of being, everything's fine here. Nothing to see here, move along. Being able to have that progressive degradation of your product. It's not either up or down, you can have issues and anticipate those issues in your product. Because, believe me, Wi-Fi is more like Li-Fi sometimes.

Participant 4: We are using Slack for release management, whenever something is wrong we inform with Slack, to the point that one or two years ago when Slack was not working, we stopped release process. You worked with Slack, if Slack didn't work, what did you use to communicate to other people?

Koutnik: Fun fact about Slack is the release process is, the first servers that get new code are the servers that run the internal Slack, Slack. Any problems are our problems first, and hopefully, we notice them before all of you do. If that goes down, we have already a Zoom call going for deploys. It's just always on. We don't need to start a Zoom call, it's there. If you have coding going out, it should be on the Zoom call. We can immediately have everyone involved already there. For really big bad outages, Google Hangouts, although that's not as good as Slack.

Moderator: Laura McGuire is talking later today in this track about the cost of coordination during incidents, because a lot of times when your primary communication mechanism is down, you actually have no idea where to talk, which elongates the incident.

How many of you ask this question in your incident reviews or your postmortems? What were the humans doing? What were the engineers at the company doing? Does anyone feel like they do that well in their postmortems? Not a lot of hands. It's a big part of the system, as Randall pointed out.


See more presentations with transcripts


Recorded at:

Oct 06, 2020