Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Graceful Degradation as a Feature

Graceful Degradation as a Feature



Lorne Kligerman talks about graceful degradation as an engineering goal which can be confidently tested with Chaos Engineering. By purposely causing failure of one service at a time in a controlled environment, one can safely observe the effect on the end user, whether that’s on a laptop browser, a mobile app, or the result of an API call.


Lorne Kligerman currently leads the product team at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. He last worked at Google Cloud as a Product Manager on App Engine, empowering developers to build applications on a fully managed and resilient platform.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Kligerman: I work as the director of products at Gremlin, and I am really excited to talk about chaos engineering and building these features into the product, into your mindset, into the workflow from the beginning. I'll start off with a little bit of a story. A couple of winters ago I was out snowboarding in California. I went to Lake Tahoe with my wife and some good friends. We're at the top of the hill, we were really excited to get there. There was 8 to 10 feet of powder, big storm. It was everyone's dream to ski down these hills with that powder, see what it was going to be like.

We get to the top of the hill, we got some good photo bombers in the back there, everyone's really amped up. We have different ability levels, I'm feeling a little bit ambitious. It's the first one on the day, let's see what we can do. I say, ''I'm going to go through some trees. I'm going to see what I can do. I think I'll be ok. I’ll take the more adventurous route." Other members of our group weren't feeling that yet, so they took the green, even the blue runs. We said, "Great. We'll meet at the bottom and we'll tell everyone about how it went. We'll work up to a more adventurous run together."

I head off on my own and I start making my way down. Everything's going well. Of course, halfway down through the trees, I catch an edge and I get stuck. I get buried in snow. I don't know if everyone's anyone's been buried in snow before, this was my first time. It's pretty interesting, you put your arms down and the snow goes right to your shoulders. You're kind of stuck. You start to panic a little bit and then you say, "You know what? I'm going to survive. I can breathe, I'm good to go." I pull my phone out, I can see a bar or two of service. I send a message to the group to say, "I'm good, I'm all right. I'll be down in about 10 minutes. I’ve just got to unbury myself." I put my phone back in my pocket and after I hit send, I keep going.

I get to the bottom of the hill and I go to the lift where I'm supposed to be. I see a lot of people with panic in their faces. I see one of my friends going and talk to the lift. I see my wife looking for a ski patroller and I'm all excited because this was actually pretty fun and everyone looks at me like they're really mad, really upset. I'm, "Well what's going on? I let you know, I told you I was ok. I was going to get to the bottom. Everything's great." I pull my phone out of my pocket and of course I get this message. I get "message not sent, touch to retry." That's not very helpful, is it? Thanks for letting me know. It probably didn't let me know right away. When did it let me know? I'm not really sure. I didn't know there was a bit of reception up there. I looked at my phone at this point and there was a decent amount of reception at the bottom. That message was still just sitting there ready to go. After talking everyone up that everything was still ok, that we should do it again, back up we go.

We Expect Technology to Just Work

We really expect technology to just work, and why wouldn't we? However, we do live in a world where things inevitably go wrong. Technology is ingrained in all aspects of our lives. Whether you're an engineer working on an API or relying on an API, whether that's an internal API, an external API, whether you're using enterprise products to get your job done, or whether you're using a consumer app on your phone, on your watch to listen to music, to talk to your friends, to talk to your family, there are all these different aspects of technology that are in our day-to-day lives and we expect it to just work.

In the eCommerce space, we're quite familiar with this, we're quite familiar with things not working. There are days that things can go wrong more often and that typically happens when you most need the technology, you most need that to work, especially in the eCommerce space. These things are happening more and more often and not only is it frustrating, but it's costing a lot of money. It's costing hundreds, thousands, hundreds of thousands and millions of dollars.

There are a few headlines here, but I'm sure we all heard of the recent register outage with Target recently. I think it was about a two-hour outage and it cost the company about $50 million. It's not just unfortunate for the company, but all these people trying to purchase what they needed - are these household supplies, is it pet food? Is it some sort of medication? Is it that TV that was on sale that they really wanted to get, that's going to double in price again? The frustration is starting to bubble over.

That's getting frustrating. It's a lot of friction in play, but it's not really interfering necessarily with how we're getting by, but that can start to happen as well. Thinking about how finance is moving into the cloud, into technology, every company is really becoming a technology first company, which is awesome. I think it's amazing, I think that's where we need to go in order to build great applications that we rely on for the general public, for our business, and the whole workflow in between. How do you feel if you got a letter in the mail telling you that your house was foreclosed on? You made all your payments on time and that's probably a little bit startling. What if you can't get money out of the bank to pay someone back or to buy an item, to go to that restaurant that only accepts cash, and now you're stuck? You've got to run to an ATM, and it still doesn't work. All these sorts of things. This is only going to get worse and not necessarily better.

What happens when we start relying on services with the government, like healthcare, like retirement? What happens when the DMV is entirely hosted online so you don't have to go to that place, which we all love? What happens when that starts to break down and we can't renew our driver's license? All these things are really affecting our day-to-day lives more. Something even more pertinent today in our day-to-day is we're hearing about airline incidents. We're hearing about ticketing counters going down, the check-ins not being available. Then something even more significant with the 737 Max issue; it's becoming very serious. This malfunctioning software which we're not thinking about how is going to fail and what we can do when it does fail, is going to start to become a tragedy.

When lives are on the line, it's more than inconvenience. It's more important to make sure our systems will continue to work. I don't think I have to keep banging on this. It's somewhat obvious now that technology is fragile, it's going to break. This is inherent because of our complex systems and how we're building them. But when it does break, we really shouldn't have to notice. We shouldn't have to know right away. Maybe if some catastrophic failure does happen - the response doesn't come back, the data isn't going back forever - we should tell our users about it. How often is the failure really catastrophic? How often can we not handle this gracefully and give the customer a good experience?

We really need to plan ahead to keep our users happy. We can't do this after the fact. We can't really do this once we build the feature, build the product, get it released. We have an incident, we have an outage, we see all these errors piling up, and we figure out what happened. We fixed that root cause. We push a fix, everything's good to go, and you say, "We could do a lot better with our user experience. We can do better with the workflow. We can do better with the API errors that are coming back." Then you’ve got to go back and fix that and design that, and talk to your product people and talk to your design people and keep working on that, which is important. But if we can get ahead of this, then we will be in a much better place.

There are several studies that have shown that 75% or more than that of apps on mobile phones are only used once and then never again. That's across consumer, enterprise, everything really. You don't have very long to gain that trust from your customer, from your user. As soon as you burn that trust, you're gone. They're going to go look for an alternative to your product.

Why Are Failures So Common?

Why are these failures so common? Why are they happening so frequently? I think we've been gaining a lot of really good insight at the conference around different types of architecture, the amount of complexity, the different ways that that failure could occur. I want to touch on a few of them that I've seen that have been very relevant to me and my day-to-day.

A year and a half ago, I went on this big bike trip in Europe. I tried to cycle a bunch, it helps to keep me fit and also happy when I'm sitting at my computer all day. I trained for this trip for about three, four months, something that I'd never done before, to go bike in the French Alps. I took a flight through London, Heathrow to Geneva. At that carousel off comes my duffle bag, but no bike box. All these other people, they get their bikes, I have no bike. This is a problem, I’ve got to go on a bike trip, there are two more days until it starts.

I stand there, I go talk to the nice lady who is helping me try to track my bike and my box down. After a while I say, "What can we do about this?" I pear around at a terminal and this is what she sees. This is what she's working with. She's typing these three letter codes. Not sure if anyone's familiar with the backend system to airlines. I am only enough from watching these people try to figure things out, but it's pretty antiquated. It's pretty legacy. These monolithic applications are under the hood and on top of them are these flashy UIs, these nice apps we all have in our phones that tell us when to board the plane and when our bag should be picked up. Not that it could be picked up, but it should be picked up.

I said, ''What can we do? This is pretty urgent.'' She said, "I can tell Heathrow that there's a bike missing, we should send it.'' I'm, ''Everybody at Heathrow?'' She said, ''Yes, I'll tell the whole Heathrow ground crew to look for a bike box.'' I said, ''Great, then we'll know? We'll know that somebody's coming back.'' She said, ''No, there's nothing in the software that lets them do that. There's no way for them to message us.'' Well that's also great, but not great. There's no solution to this problem, it's a very big legacy system. I've talked to people who work on this stuff and they say, it'll really be like changing the engine mid flight on a plane to swap in a new system. It's very hard and we've gone a long way from where the technology was before and where it is now.

Testing is very important. We all know we can write automated unit tests, integration tests, end-to-end tests. There are really good applications these days to help us test our UI that can automatically click on things, make sure we have the right user flow. This is really testing what we tell it to test. It's telling that if we click a button, it will do what we expect it to do; that our integrations, our third party APIs, are responding as we expect them to. It's testing those base cases and some of the edge cases that we've thought of, but not necessarily everything, because it's somewhat impossible. It's not feasible to know of everything that's going to happen. Failure testing is something that helps you get there. It helps you understand what is going to break when the real failure actually occurs.

Of course, there's the complexity of the scale that we're working with. I think there was a really good talk the other day about microservices, and I think a few different talks about what best practices there are today, how to implement them and all that sort of stuff. That's not something I need to convince people of; that's how things are done today. Things today are moving as well in more complex ways with containerization, orchestration platforms like Kubernetes, we have service messages coming in. Everything is changing so rapidly that the complexity of everything talking to each other is just growing at an enormous rate. All these systems are very fragile, and while there might be one little error case here and there, it all propagates. It all builds out on each other and it becomes a state that we see in all these incidents at a very high rate, and we're having to try to deal with them on an hour-by-hour, day-by-day, week by week basis.

Our co-founders both worked at Grumman, both worked at Amazon and Netflix. We like to call these the microservice death balls. These are from about seven years ago, they are real. These are a tiny portion of what they actually look like today. It wouldn't be an interesting thing to look at. It would just be a blob on the screen. I worked previously at Google on the cloud product, specifically on app engine, and it was the same idea. We relied on hundreds of dependencies within the company. While you might not work at a massive company - I'm at a smaller company now and I quite enjoy the speed we can work at and all work together - but your system still looks a lot more like this than it does in a monolithic environment from years past.

How are we still able to delight with all this complexity that is also bound to fail? One more perspective - you zoom even farther back and the internet is just really one of these systems. It's massively large and super important that everything connects to each other well and there isn't a breakage halfway through. It’s really no longer the days when the failure is inconvenience. It's now really critical to our daily lives that all of this works as designed.

What Can We Do about It?

What can we do about it? How can we get better? This really comes into designing for failure. Design for failure - the graceful degradation of your systems. Starting at the beginning, thinking of this as a culture and a practice, we'll never be in a good place again. If we're just dealing with the outage, figuring out why, with the talks that we've already heard today, outages and incidents, how to deal with them, how best to handle them, how to learn from them. That's going to continue and it's probably going to get worse. How do we get better?

A couple weeks ago I was visiting a couple of friends in Denver, in Colorado. I have this great perk in my job where I can work remotely. I was over there for the day, for the weekend. Took a long weekend, worked on the Friday, loaded up my laptop to run a team meeting, working as the product manager there at Gremlin. I loaded the meeting up, everything seems fine, internet connection is great. I start the meeting and I got 10 other faces on the screen. About five minutes in, the inevitable happens. Everyone’s face starts to freeze, people's voices start to be choppy. I got all these pings on Slack saying, "We can't hear you. We can't see you. What's going on? Maybe you should do this. Maybe you should do that." I start panicking, I mute to see if there's static on the other end. What's going on? I unmute again.

Quick clip - that hide your video button. Everyone's looking for that button right away. Does that help? Does it not help? Someone else pings me and says, "I think your machine is really overheating. You should probably close all your tabs." Who knows what the problem is? You're panicking. I'm getting in the way of my team doing good work. I'm getting in the way of this meeting and what happens next? I've got a pop up asking me how I'm doing. I'm not very happy. I don't think a thumbs up is a not very happy situation.

What could we have done better here? How could the design team and the implementation be done in a better way? Why is it my responsibility as a user to turn my video off? Why shouldn't it just happen? I don't think there's even a way to turn other people's video off. What is the key feature that needs to work for this meeting to take place? It's really around my voice. It's around voices. I need to hear what people are saying and they need to hear me. Why can't we just degrade into that state and not interrupt anything everybody else is doing? It's this kind of mindset that will help us build better applications.

Designing for Failure

What is our key user story, what's that feature that is most important to the user? This is not just the product manager's job. It's not a designer's job, it’s not just an engineer's job. It's really everybody's job. I can define a feature as a product manager, as a user story. A designer will design a cool way to interact with it in the UI. An engineer is going to implement it and it will involve API calls. It'll involve timeouts, it'll involve all these sorts of things, all these tests. It's really everyone's job to think about at what point that key feature needs to work.

This is something that as every person in their role should know, you should ask. If everything was to fail, what's the single most important thing that needs to work? Do I want to buy that airline ticket? Do I need to enter my frequent flyer numbers? Do I need to select my seat as doing the video call on your productivity app? You know what needs to work, and what is ok if it doesn't work? A lot is ok actually if it doesn't work, if you think about it, if you walk through your use cases.

While I was in Colorado I lifted up my phone and saw that the game was about to start and said, "Great." However, we had made a really nice reservation at a restaurant that we had to get to, and we were going to go and everything was great. Halfway through, I didn't know what was happening with the score. Pulled up my phone and this is all I get. I'm, "Hold on a second. That really good story you were telling me, just hold it because I need to know. It's the third quarter, what's going to happen?" This is again, a good example where there was some connectivity and not the best. I had opened my phone probably five times in the last hour, and there was a good signal. I wanted to know what happened. Some of this could be happening in the background.

Also, spinning forever is not great. Loading screens really are not graceful. What does a loading screen tell you? What's it for? It's there to tell you that something is coming. When is it coming? Loading bars, they're a little bit better. I could argue that they're also not great. That bar sometimes moves really slowly, really fast, really slowly. That really also doesn't make us very happy. Then sometimes it gets to the end and nothing happens or it crashes or the inevitable it starts again. Is that very helpful? Not really.

Inject Failure

Once we've thought about how to design for failure, we need to test it. We need to make sure that our design is actually implemented and in place. How do we do that? We really need to see that that failure has happened. This is done by injecting that failure, by breaking things on purpose. This gets us into the concept of chaos engineering, which really are these thoughtfully planned out experiments designed to reveal the weaknesses in your system. Big companies like Google, Amazon, Netflix, and many others have been doing this for a long time. A lot of other engineers at different companies, small companies, medium size, are really starting to practice this. They’re starting to learn about it and put it into place.

We have a long way to go. We have a lot of work to do. It's not just about causing the failure to see what happens. It's about thinking about what to expect before it happens, and what did I design to happen when it does take place? There's more than just running that experiment, running that attack. There's a lot of pre-work to think about.

Getting into a little bit more detail, injecting the failure is something to really do one service at a time. Start with that initial test, a pre-production environment, one server, one instance, one service, one dependency. Make sure that critical functionality is still in place. You can still buy that ticket, I can still send that text message, I can still select my seat, I can still listen to that music, the stream of that one song is still in place. Then start breaking something else and systematically go through the different services in your system.

There are a lot of failure modes that you can inject, that you can use to test with. These are just a few that I find are quite interesting and quite good to start with. Of course errors - errors sound simple. What happens when you get a 400 back 401? 500, 503? You can make up your own error codes. There are some that are supposed to be standard, but are they actually? What happens to your application when you get a certain error code back? Is it as expected? You can black hole the traffic, you can just cut it. What happens when that request goes out and nothing comes back? Am I ready for that? Not just waiting for an error response, but actually nothing happens at all, when I don't get an act back when I send that request.

Then latency. Networks are inherently unreliable. If you don't agree, we can have a good conversation, but networks will always break. They're only in your control until the point where that request leaves your network, leaves your application, and anything is possible. It's also about how much latency is ok, because there always will be some. You can start with 100 milliseconds, 500. Get to a second, two seconds, five seconds. For your application, depending on if it's needing that realtime response, if it's a video call, versus sending a text message or purchasing a ticket, getting confirmation back that it's happened, maybe a second is ok. You need to decide yourself what is sufficient and what is not, and report that back to the user. If it's ok to wait five minutes to tell somebody that their seat has been selected, then you can simply say, "We'll let you know soon." You don't even have to let the user know there was an error. Their expectation is really only what you tell them, so to be thoughtful about what you're describing is really important.

Degrade Gracefully

This is all around gracefully degrading. What is that experience you want your user to have, to have the best experience? Have little friction. Another example - using text messaging as I find, whether 1 in 15 messaging platforms out there, they're all different, they're all frustrating. They all kind of don't work. I guess that's why there's so many. I find that Google Hangouts these days gives me the most trouble. If there's not perfect connectivity that message doesn't go through. There's a lot of friction there. It's not something we need to put on our users.

There are a lot of dependencies. We know that and there's always just going to be more. Authentication, the data where it's coming from, where's the static content, where's the dynamic content, where are we storing things? Where are we cashing them? What are the different features? When a dependency fails, do we really need to tell the user? How about two, how about four? I worked on AppEngine way back at Google for quite some time, and of course we were reliant on all the internal Google services. There were a lot of instances where one small dependency would fail. Not in Google, not even the Google auth system, but some small storage system. By no control of ours, the system would go down, we'd see failures, we see errors, our customers were unhappy. That's something that we do need to think about and that we do need to be conscious of.

. Let's look at a couple of actual examples. I'm going to pick on my favorite sports website from Canada called TSN. Let's look at what the website looks like. It'll refresh in just a second, you can see it load. This is on a fast connection, I will say. If you're paying attention, that that wasn't the best. You could see the banners show up slowly and the top bar. I think the ad was the last thing to show, it moved everything down and that's frustrating as well. I don't know if any of you are like me, but when I'm using my phone and I would say half the time when my finger comes to that screen to press a button, whatever I was trying to press moves away. It happens constantly because things are loading, things are dynamic. Dynamic is great, we're not blocking, but when I don't get to press that button that I want and I go into a whole another workflow, that's quite frustrating.

I'm going to add some latency here. In the Chrome dev tools actually you can add latency to any web page you're looking at. I'm adding two full seconds of latency here or refresh. If you watch what happens, we see a couple of spinners. The content comes in, something else may happen. Some more bars come in, the ads come in, we still see some white space down there under the main header. This is two seconds of latency. It's providing a pretty degraded experience. I was actually looking at this on the airplane on the way here from San Francisco and it was even worse than this. I thought it was broken so I just kept rehabbing. I just kept hitting refresh. Even working in software for many years, of course my first instinct is to refresh, turn it off, turn it back on, even though I know what's under the hood. There are unreliable networking issues going on and I'm not sure where we're at.

Let's look at blocking one of the requests. We're going to block one request that provides one of those little video boxes. You can again do this in the Chrome dev tools, it's very small at the bottom. You can see that there's a spinner. The video that I was blocking out, we lost it. The whole screen got pushed down. I'll scroll down a little bit and there we are. We just got an infinite spinner. I've just blocked one request out of probably hundreds. If you take a look at your dev tools these days and look at the network requests going on when you load any website, I'd say there are hundreds of requests going on in the background, they're all concurrent. Some of them block, some of them don't, but the spinner has stopped because my recording stopped. A file would be too big, but that goes on forever. I never know as a user if that's coming or if it's not. The other things on the site look to be working, but that's not. I'm none the wiser as to if I'm ever going to get to see that video.

One final example. I'm going to block a more serious request, a JQuery request that responds to provide a lot of content. We do a refresh here. You'll see that requests are the page start up and it just stops pretty quick. That's where we're left, that's all we're going to see. Our drop down on the top left there doesn't look very good, it's pretty broken. The whole site is pretty broken. The only content we really get is that main headline and that's about it. Yes, it's unlikely that that request will fail a on a consistent basis. What if it does? Unlikely is not never. We do have some content here, we just don't have the rest of the content. So what content is important when one request, one major request, starts to fail? It’s important things to think about because you will have outages, you'll have incidents that caused some of these issues.

Here’s another dependency failure that I keep referencing over the years, and not to pick on Amazon, but it is a good example. A bunch of years ago, I think 2017 AWS S3 went offline. I forgot how long - somebody can probably tell me to the minute because you were probably working on a system and had to deal with this. It really shines a light on what data we're putting in S3 and we can have a big conversation about that and with where there is a backup. Really, what is more interesting to me is that Amazon relied on the content in that storage environment for their status page.

Their status page that told you that S3 was up or down was down because S3 was not available. That's not great either. They had to go to a different platform. At least we have other tools out there, social media. They went to Twitter to let people know that their system was down and that their status page was down. Status page for a status page. It starts to get kind of meta, and it starts to really let us see where these failures can come into play. I think they could eventually put a toast bar at the top, but that was about it. It wasn't worthwhile fixing their status page while they fixed S3 – so in terms of priorities and where the functionality really lies.

Delight Your Users

I keep talking about delighting your users, giving them a positive user experience, reducing the friction that they have to go through to use your application, to accomplish the tasks that they've set out to do. A couple examples - there are a few actually - about how this is done right, how it's done well. I'm a big fan these days of WhatsApp, in comparison especially to other chat platforms that I've used. I've been traveling on a sailboat recently, about 100 miles or 80 miles off the coast. Had very poor reception but I was still able to have a video chat call. That's pretty unheard of with other platforms. Not going to name any names.

Netflix is quite good at this too. A lot of people that I work with now have worked at Netflix previously in their careers. If you're looking through a Netflix system where if the recommendation service or the curated lists of videos or movies or TV shows that they think that you would like to see, are not available, you can still see some general content. You can still see the show that you've been bingeing for the last week and you need to see that last episode. You'll still see that show on the list. You'll still be able to watch it. You're none the wiser that some of these other systems are failing underneath the hood. They just don't see it obviously. There's also no reason to tell the user all the time. Do they need to know? It doesn't need to get into their mind that something's there. There is an issue, we're taking care of it by the way, but still go do what you're doing. It just doesn't need to be top of mind.

We've been working with a customer more recently who knows that their system is going to fail when there's a certain amount of traffic. I won't ask for a show of hands but if you can think about a certain part of your infrastructure, a certain part of your stack that you're more uncomfortable about, there probably is one. There's probably one that will fall over a certain point of time, a certain amount of traffic. If you become extremely successful, which is a great thing, your technology may not be able to scale with that usage. We've been working with a customer who knows this and is getting ahead of that by treating different groups of users in a different way.

Their number one priority is to onboard new users as easily as possible. They want that flow to be great, to be delightful. The users that are already on their platform, it's ok if there's a little bit of latency, it's ok if there are a few seconds delay if they're doing certain actions. They're consciously doing that. It's not ideal that all your users don't have the best experience. If you can consciously treat one experience a little bit differently than another, at least you're getting ahead of the game and you're preparing for it.

Positive Business Impact

Moving from the world of engineering myself into product management, there's not only frustration with users and usage, but there's really a big business impact with all of this. As we launched new features and products, it's an exciting day. We’ve defined what we want to do, we think it'll have a big impact. We get it out the door after a lot of hours of engineering, testing, iterating user feedback and we've defined what these success metrics should be. I would say it's common practice to define what we're going to look for, how's it going to go, how many users are we going to get, what's our goal, what's our strategy around releasing this new thing?

What is not done as well, I've noticed in the practice, is really identifying how that product launch is going to land. Were those success metrics achieved? Were you successful? If you have not been successful, why not? Almost more important than if you hit your goals, is to understand why you didn't hit your goals. Did people just not like what you're doing? Typically, that's not the case. I would say typically what is the case is that there was friction involved. What you wanted your user to accomplish was not possible. You should be able to see that in your metrics and your monitoring in your analysis. Where there more errors in a certain case than you expected? The 30% or 40% of a user trying to sign up and not complete that flow. Sometimes its usability, but more often than not, there was something that got in the way. What was that, and how can we make it better?

As we think about this as well, getting ahead of the game is super important to do this early, to think about designing for that failure early in order to maintain our release velocity, if we want to keep going. I've heard numerous times in the track today and other meetings and other talks, when do we talk about doing this testing? When do we get started? When do we make time and space for this? If you're doing things after the fact, the amount of time that you're spending is going to be way higher than the amount of time that you spend initially being proactive.

That's always a very hard conversation to have. It's hard to say, "Don't worry, we're going to save you 100 hours later by doing this for 10 hours now." As an example of designing features in a better way, which I've really seen the industry come towards, making sure everybody agrees, doing a technical design, signing off on that technical design. We are doing a lot of planning, but we're not necessarily planning for all of these different cases that can cause a lot of time and outages at a later date. We're going to spend less time in war rooms; not our favorite thing to do, and of course, we're going to deliver that positive user experience.

Graceful Degradation as a Feature

To summarize, there's a lot involved here, but to go through this workflow, hopefully this is something that you can take back with you and you think about as you're implementing what you're doing today. You're just about to launch something; don't stop the launch, don't necessarily block it, but think about what is a failure case that could happen? How will something get in the way? Is that going to stop my users from using the feature in its entirety? Because a lot of times things really do get in the way in a bigger way than you expect. Designing for that failure, injecting that failure to see what happens, making sure your system does degrade gracefully so that can provide that delight for your users, is really what this is all about.

That's the main point that I wanted to convey, what I'm pretty passionate about, especially working at Gremlin. We're hosting a conference later in September called Chaos Con. We're going to have more content, we're going to have practitioners come talk to us, tell us how they're doing this, how it's working, how it's not working, and how it could be better. We'd be super excited to have you there and to have more people there.

Questions and Answers

Participant 1: As engineers, we spend the majority of our energy getting the happy path to work to get the new feature out the door. What's the best way to instill the resiliency mindset in engineers, especially more junior engineers?

Kligerman: To repeat a little bit, how do we get that mindset in our engineers as we get features out the door, to think about those failure cases? To me it's really about relating to them. I tried to show you some examples today about things that get in the way and yes, these were some consumer apps, a little bit on the enterprise world. This relates to us in our day-to-day. I tried to talk to our engineers as though, "When was the last time you used a piece of technology and it really frustrated you?" They could probably tell me in a second. Then we go back to the application or the feature that we're building and we talk about it.

We say, "What happens if the network just degraded really quickly? What happens if the internet was cut and it comes back in about 30 seconds?" Making it somewhat personal and getting them excited to release a feature that they're proud of is really a big part of it too. Engineers, product people, design people, anyone involved needs to be proud of what they're getting out the door. You probably won't be very proud if you get it out the door and next week you're sitting in a war room trying to fix it. Then coming up with that use case, that does pose a problem and is going to delay a launch by a few or a week or two, that's also ok. Creating that safe space to say, "We've come up with something, it's not just a bug, it's not a unit test that's failing. It's not an edge case we missed, but there's a pretty big failure scenario here and that's really going to give us a bad reputation and a bad user experience." Being ok with that is really important.

Participant 2: At the very beginning of the presentation you used a cell phone not being able to send out a message as an example. What would you say would be a better user response? What I'm trying to get at is, let's say you're on a mountain and you can’t send it out. Just to apply it to another scenario where if something fails, it just hard fails then? What's your take on this?

Kligerman: I have a few different takes on it. The phone knows a lot of things. Your phone's been on for a long time, you probably don't reboot it very often like I do and then everything starts to fall apart even more. Your phone knows where it is geographically. It knows where the service is, it knows how strong the signal is. What I would do is look at the signal strength for the last X number of minutes to understand the state of where I am and my surroundings. In that, you can determine if this is something that you expect to hard-stop at. If I haven't had signal in an hour or two hours, then it's probably ok to tell the user "this wasn't sent" right away, you're not within the range of civil mil, whether it's cell towers or Wi-Fi.

At the point in time when I hit that button, I did see a little bit of a bar and I saw a spinner and I was starting to get cold in the snow and I put my phone back in my pocket. Understanding the scenario of where you are and what it could do and putting some logic in there, and not just saying, I got an error message back, couldn't send it, can't send it, and that's it. There's no context. You could let a user know, "Hey, by the way, you haven't had signal for two hours, do you want to try to send this when possible?"

Maybe give the user an option, as opposed to just saying, "no, it's never going to happen." Really thinking about the options that you can give to the user and the state that that feature should be in when you have great service, zero service, and maybe something in the middle. You don't need to deal with a million different cases, but there's a sliding scale here. It's not just working or broken, and that's where I think technology is today; it works or it doesn't, and that's not great.

Participant 3: Chaos engineering is obviously good, it’s the right thing to do. As engineers, you want to do this. I'm sold, a lot of people are sold. A lot of times practically this isn't done. Why? It's about the cost, it doesn't come for free. There are a lot of costs. Then people try to wait, "Ok, this is going to fail once in X number of months and I'll just bounce it." It just works once, you bounce it. How do you measure benefit here and say it makes sense to invest time in this?

Kligerman: I think we talk about that every day with our customers and our prospective customers. How do we measure the effect of chaos engineering, essentially? Also, I think the first part of your question was what is the value of starting to do these experiments and these attacks? The first part I would say is that it's true. Some failure cases are quite rare. You can call them Black Swan events, once in 10 years, once in 5 years, maybe, maybe once in a year. If you spend 10 hours in a war room for something that happens once a year, maybe that's ok. What if you have one of these unusual events once a year, 50 times, and what is the root cause?

You start going down this analysis path and more often than not, you have a root cause that's causing different types of failure. If you were to get started and run a few of these attacks on one of your dependencies, two, three, five, you start to notice pretty obvious fixes that you could make, and changes you could make to make your system more resilient. It's a common question. However, what we've observed and what I've observed is that a lot of the failures happen from somewhat obvious things that you discover when you start to run the basic types of chaos engineering attacks. Then getting into the weeds, it's something that you definitely can get to a little bit later, but to measure that, again, it's really about what is the downstream effect? What is the number one metric that's important to you? If you're an eCommerce business, it's likely dollars. How many dollars are you losing? If you're Netflix, how many videos cannot be watched? How often are they interrupted? Again, it comes down to a user base type of thing. You're going to lose customers.

It’s looking at that and looking at the number of outages that caused those failures and the volume of it to running some attacks and discovering these issues that are going to prevent those outages. Sometimes they're a little bit hard to correlate, but as you start getting into the practice and running things more often, you start to see that number go down. You see the dollar loss go down, the amount of hours and meetings in war rooms go down as well. We've started to talk to our customers about scanning their calendars and looking at the results of them to see how often people are in these war rooms and are trying to work on these incidents and outages and do these postmortems and learnings. There's a lot of expensive time. You’ve got to identify those key metrics for your business, whether it's time spent on employees or dollars lost on your product, and watch those and see them get better.


See more presentations with transcripts


Recorded at:

Oct 10, 2019