InfoQ Homepage Presentations Building Reliability One Step at a Time

Building Reliability One Step at a Time

View Presentation

Speed:

39:01

Summary

Ana Margarita Medina shares how she has been using Chaos Engineering and how it can be used to decouple our system’s weak points, learn from incidents and improve monitoring and observability.

Bio

Ana Margarita Medina is currently working as a Senior Chaos Engineer at Gremlin. Before Gremlin, she has worked at various-sized companies including Google, Uber, SFEFCU, and a Miami-based startup. Ana is an internationally recognized speaker and has spoken at: AWS re:Invent, KubeCon, DockerCon, DevOpDays, AllDayDevOps, Write/Speak/Code, and many others.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Medina: I'm going to be talking about building reliability, one step at a time. Prior to getting there, let's go ahead and ask ourselves, why do we prioritize reliability work? We know that reliability really matters, because downtime is expensive. It also starts with creating a culture of reliability. We want to foster a safe and inclusive space where employees know that failure is ok, and that we learn from these failure moments. We prioritize this work because customers matter, and their trust is very important.

How do we go ahead and we build the reliability? Is this something that we can just wake up and do overnight? No. Is this actually something that we can just build a new tool and have reliability built in a month? No. Reliability is a journey. It's made of many steps. We all want to achieve reliability really fast, but this will also lead to much more roadblocks, and maybe really expensive learning lessons along the way with failure. We have to start by setting some of the tone for today's talk. Today's systems are far too complex for one engineer or a team to fully understand them. They are too complex for us to understand and for us to know how to build them more reliably without learning from failure.

We also have to understand that this complexity only keeps on increasing. Our systems are not only growing and serving more customers, but they are becoming much more distributed along the way. As they become more distributed, the incidents happen more frequently and sometimes they're larger or even more severe. It also translates into this cost of downtime being much more expensive. We can't forget that failure is unavoidable, and things will go wrong. When these failures happen, are we taking a moment to learn from them? Are we asking ourselves and our teams, what happens? Are we asking ourselves, what do we need to do to overcome these failures? Are we being proactive? Are we doing the work so that we're not regressing into the past failures that we just saw? We can't forget that building reliability is a journey. We must take it one step at a time. What do I actually mean with this? What I mean is that we have to take this as the same concept of crawling, walking, and running. We're all going to start out somewhere. It might be with a certain number of nines, or it might just be with some building blocks that we've learned in the industry.

Background

My name is Ana Margarita Medina. I am a Senior Chaos Engineer at Gremlin. I've been here the last three years building more reliable systems by proactively using chaos engineering. Prior to being here, I did Site Reliability Engineering at Uber, where I helped build the chaos engineering platform and build cloud infrastructure as well. I've also worked at small startups, a credit union, and Google. One of the other things that I really care about is diversity, inclusion, and equity. I am a proud Latina. I was born and raised in Costa Rica, and my parents are from Nicaragua.

Site Reliability Engineering

We're going to be talking a little bit more about Site Reliability Engineering, also known as SRE. Up here, I have some of the three most common Site Reliability Engineering books. For all of you that are not familiar with this, Site Reliability Engineering is an implementation of books that focuses on reliability. This field keeps growing, and it was coined and pioneered by the folks at Google. They've created the Site Reliability Engineering Bible book, and they also have the workbook. I also recommend the Seeking SRE book that David has put together, on how do you implement these practices at scale. Every company is going to implement Site Reliability Engineering differently, and that's ok.

The Service Reliability Hierarchy

With the Site Reliability Engineering fundamentals and the learning, in the first book, we got the service reliability hierarchy. These are the elements that go into making a service reliable. We'll start at the bottom from the most basic and move to those advanced things after. Monitoring and observability is the most basic portion of this hierarchy. This work gets strengthened as more monitoring and observability gets put in, and there's more tuning done. This work also gets better when you define service level objectives and service level agreements. Now that we understand how our systems need to be available and reliable, this is where we can actually move to incident response. Here, the focus is creating an on-call rotation, making sure that there's alerting set up, and that we also have some escalation processes in place.

Then, you want to go ahead and make sure that you're learning from these incidents, learning from failure. This is where you can leverage things like blameless post-mortems, to do your post-incident analysis. First, you want to take a look into the causes and triggers for these failures. What actually happened that made the system or service fail. Then you want to make sure to look at any of the bugs that could have contributed to this. What did the system failure affect? You also want to make sure that you're looking at those solutions and recommendations, that a similar error might actually occur in the future. Now that you're learning from failure, you want to make sure to make things better. This is where doubling on efforts and testing your release procedures is really important. You can leverage canary deployments, feature flags, and chaos engineering in this portion of it. You want to make sure to focus on trying to minimize the risk of failure and making sure that you do things incrementally as you mature on this. Then you can move on to capacity planning. Capacity planning is where teams work together to make sure their services have enough spare capacity to handle any likelihood increases in the workload, and enough buffer capacity to absorb that normal workload between spikes and planning iterations.

Now we're getting close to development. Most of us just want to build a lot of shiny things, especially in product teams. We want to make sure that we've invested enough time in all the other layers of this hierarchy so that when we actually do build the features that our customers want, and those really eye-catching products, we make sure that we're actually standing out from the rest. As we prepare to reach the top level of the reliability hierarchy, this is where implementing things like error budgets really come in clutch. The error budget provides a clear objective metric that determines how unreliable a service is allowed to be within a single month, quarter, or year. This is where teams make sure to innovate and experiment. An example of an error budget can be that the SLO is three nines, 99.9. Therefore, that error budget for that service is 0.1. As you establish that reliability at the product level, it makes it easier to do reliable product launches at scale to give users the best possible experience from the day of launch, also known as day zero.

We also can't forget that reliability is measured from the customer's perspective. When we do this, we have to make sure to actually understand what our customers are doing with the products. What is it that makes it reliable for them? How is it that our customers are accessing the product? Where are they doing it? How are they accessing it? How are they using it? When all the components come together, is it a reliable, critical path for them? We can't forget, building liability is a journey.

Measuring Reliability

When it comes to measuring reliability, our industry likes calculating reliability based on the nines of availability. What that means is that if an organization or a service is up for two nines, that's 99% of the available time, it means that they're allowed to be down for 3.65 days. That translates into 87.6 hours. At three nines, you see a downtime per year of 8.76 hours. For four nines, you're allowed to be down 52.56 minutes a year. With five nines, it's just 5.26 minutes of downtime per year.

Two Nines

When it comes to those two nines, what does that world look like? This is the beginning. This might just be one person keeping a service or a company reliable. They might have not just built out a site reliability engineering team yet. The customers might think that the company is not reliable, that things might be really broken. That customer trust is starting to get weaker as they're having these frequent outages. As you're here, how is it that you move from two nines of availability to three nines? Maybe logging was what was had, but there was no monitoring. You want to make sure to get started in there. Maybe there was some testing, but it wasn't that advanced. You want to make sure to be preparing your teams for failure. You want to ask if there are backups set in place. You want to make sure to start testing for things like those failures as well.

Three Nines

In the world of three nines, this is where most software companies are. There's now more than one person working on reliability, and there actually might be a site reliability engineering team, or various teams that are actually on call. When these outages occur, they are much more complicated, and sometimes it's latency that causes a lot of cascading failure. We also see that the rate of change and innovation and many things going on in organizations are going to be contributing factors to the complexity of the outages. How do you go from three nines to four nines? It starts by doing the work to reduce the signal to noise by making sure to clean up those alerting and monitors. You want to make sure to also do things like canary deployments and really be testing in the real world. You also want to make sure that you're sharing those learnings from the incidents across the company. Let every engineer, product owner, or designer learn, where are these failures happening? You're also starting to build more redundancy. You want to make sure to have redundancy at the regional layer at this point. Also, go ahead and test it and make sure that that redundancy is documented properly. You also want to make sure that your teams are doing failure injection, experimentation, things like chaos engineering. This also allows for you to prepare and harden your critical services by making sure that you're running those fire drills and game days. You also want to make sure those service level objectives and service level agreements are also in place. You also want to start tackling automation. You want to make sure that you're preventing those past outages and that you're preventing drifting into failure or any regression.

Top Performers

I work at Gremlin. We recently surveyed 500 folks in technology companies asking questions about reliability, their development processes, incident response and such. From that survey, 74% of the companies with availability over three nines were actually doing failure testing with chaos engineering. Almost half of them, 46% were doing chaos engineering on a regular basis. Consistent failure injection has high levels of getting ahead or problems that allow teams to save money and improve their reliability.

Four Nines: What the World Looks Like

What does four nines look like? Some companies can hit these four nines, failure is still going to happen. Customers see it but this is happening less frequently. There's now better training that is in place. There's much more resilient systems. Things don't fail as often, so many teams may actually be out of practice. When things do fail, the failures are often very complicated and they involve many teams' coordination to triage. Teams are regularly following the incident management process that was built, and the lessons are constantly being shared across the organization. Where are four nines? How is it that we can go and actually try to achieve these five nines? This is going to be a lot of work, but we want to make sure to run organization or company-wide experiments. This is where you're experimenting on a whole system as opposed to one piece. This means that you also want to make sure to continue building the catalog of automated tests and validation built in to stay reliable. You also want to tackle down building resilience and reliability into services from the start. You can't forget to shift failure testing earlier into the development process as well.

How Do We Get There?

How do we go ahead and get to the more nines? Let's run through some of those building blocks that we have available to us. Don't forget that no matter where you are, it's about leveling up to get more reliable, to get the availability number that you want, to get that customer trust that you need. Start out with monitoring and observability. If you don't have any monitoring, you want to make sure to start out by doing things like it. You want to make sure to fine-tune that monitoring and those thresholds, and set out those service level objectives and SLAs. You also want to make sure to have full visibility. If you can, you want to implement things like anomaly detection that allows for you to understand when the metrics are behaving very differently, than it has into the past, taking into account current settings. On the testing and automation section, you want to make sure to move on to individual and ad hoc testing, to then graduate to be able to automate part of this testing. Then, you want to make sure to have things like canary deployments and automated deployments. In this portion, you also want to make sure to get comfortable with failure injection, and things like chaos engineering.

In the next portion that becomes incident management, if you don't have incident management, go ahead and get started with setting up some basic on-call rotation and having some processes outlined with your team. As you grow that, make sure to get comfortable with blameless post-mortems. Really try to understand that post-incident analysis and learning from failure. As you grow in that practice, you want to build your engineers to get a chance to detect these failures faster, to mitigate them quicker, so with that, you can practice those response skills by leveraging things like game days, and fire drills. In this phase of redundancy, if you're starting out with that single zone, single region, and single cloud, go ahead and move on to multiple availability zones. Then, move on to multiple regions and multiple clouds, and really make sure to test a lot of this redundancy.

I also have to call out that building reliability is going to take consistency, is going to require a lot of practice and a lot of time. We have to first start with embracing and accepting failure. Failure is going to happen, so let's go ahead and analyze it and talk about it in a more blameless way. When you have those failures, there are tons of lessons, make sure to take a moment to learn from them. As you get consistent and comfortable with failure, experiment with it and embrace it. Great learning comes from there. Make sure to make changes with the learnings from failure as well. Don't forget to practice. Consistency comes with practice. Make sure to fine-tune your tools and processes, and then go ahead and practice them over again with your team. Don't forget to schedule those things to happen monthly and weekly. As you have the reliability you need in your systems, move on to thinking about reliability in the design process of new features and products to get ahead of this work. Don't forget, building reliability is a journey, and we're all in this together.

The Chaos Engineering Community

As you embark in your reliability journey, I would like to invite you to join the chaos engineering community. We have over 7000 folks that are just getting started in the reliability space. A lot of them are getting started with chaos engineering. It's taken a while for a lot of them to get started. You can go ahead and do that by joining gremlin.com/slack. From the community, I would also like to give you a thank you gift like these chaos engineering community stickers. If you're interested in collecting your stickers, head on over to gremlin.com/talks/ana-infoq-live.

Questions and Answers

Betts: I loved talking about the crawl, walk, run through your reliability journey.

I want to start with one you talked a lot about, nines of uptime, I think the most common reliability metric people are aware of. One of the criticisms I've heard about is that it can hide the true cost of downtime, and you hinted at this. You talked about trying to equate into customer trust, and so, are there ways that you've seen to either quantify the business value that you get from reliability or do you just quantify the business cost for the decreased reliability.

Medina: Over the years, I've spent time focusing on the business cost of reliability. That has been what the industry has adopted for all the places that I've worked in, and with the customers that I've gotten a chance to spend time with. There's always the introduction of that business level objective, where they're actually able to really trace down where every single customer interaction is actually affecting that service level objective that we define in Site Reliability Engineering as well. Sometimes, it's like there's a disconnect with what the conference talks talk about, and what we actually see in the industry. So far, what I have seen folks focusing the most on is that cost of downtime. We do always iterate where it doesn't matter if you have those high availabilities of nines, if your customers are not happy, if your customers are not coming back and really loving your product, and you're not a company that's customer obsessed as well. Sometimes there's a little bit of disconnect with that, where we always want to point that out. Where it's like, it doesn't matter if you have those nines, if your customers are going to be trashing your brand and not recommending it to other folks, that might be a really big disconnect. You really want to spend some time understanding that. When I worked at Uber, and we were building a little bit of how is it that you can have more of a better observability into when things go wrong, this is when we're still trying to decouple things and the microservices are growing. We always joked around where it was like, we can have a Twitter dashboard to let us know when Uber is down as customers come on and start complaining. It also depends a little bit on how is it that you can get more in touch with your customers too and really listen in.

Betts: Could you share some techniques you have for failure injection?

Medina: I've touched base that I've been working in chaos engineering for a few years. That definitely has been one of the most proactive ways to go about injecting failure. Because there's different steps that you can do, like we hear from Chaos Monkey where it was really just turning down that EC2 instance and asking, what happens if our host dies? That also touches base on the crawling, walking, running, where you can start out with something so simple, where it's you manually going into your cloud provider, and shutting something off, trying to understand what happens if you have a little bit of more latency between one service to the other. As you get comfortable with that concept, then it also really goes into your organization being comfortable with failure. You can start looking into that tooling to doing things like chaos engineering, so there's some open source options out there depending on what type of environment that you're in.

I work at a vendor called Chaos Engineer. We have a SaaS platform. Now we're also seeing this space really start evolving. Just earlier today, AWS launched their failure injection software, so that's now cloud providers starting to do that. Earlier this month, we also had Azure start announcing that they're working on Chaos Studio. We're also starting to see just a lot more capabilities that you can have, when it comes to you being like, "I want to do failure injection, but how is it that I can do it in a safe manner, and in a way that I can also just stop these experiments and I'm not causing more harm? I'm really going back to revealing that weakness."

Betts: I think the idea that you can start with manual, you don't have to go to the fully instrumented, automated solution right away. It's also hearing that all the cloud providers are starting to do this as a SaaS solution, it does make it easier to go to that next step, when you're going from crawling to walking.

I do want to talk about your reliability hierarchy that you had. How do you watch that evolve as you go through the reliability journey? Does each layer basically get wider as you add features to that, and then that allows you to build more above it? I might be stretching your metaphor too much. In my head, that's how I saw it, like I'm adding more down here that means we can do more up.

Medina: That hierarchy is not something that I created, that comes from the Site Reliability Engineering book. Those are the ones that we're like, what are the elements that we need to make a service reliable? It starts out like, you can't understand how to make a service more reliable if you don't understand how it's behaving, what it's actually doing when there's customer interactions, what it brings into the business. If you're not able to have that insight, you can't move on to understand when it fails, when those incidents do get caused. You'll see that the blocks at the bottom are going to take a lot more time. They require more engineering effort, or they require more of a culture change, too. It really does get easier as we do go up, but that doesn't mean that all the work that you did on there goes to waste, because you sometimes wouldn't get to the next level, if you hadn't spent six months with your SRE team making sure that every critical service has the proper monitoring, observability piece, then the right thresholds, like the proper alerts, and for there to be runbooks. It does really go into that, like you need to crawl, do some of the work that we don't necessarily love doing. That's really going to move us to the next level of we're in this together, we're building reliability one step at a time as a team or organization.

Betts: Who do you see as responsible for improving system reliability? The easy answer is it's everyone's, but someone needs to make it their job. Is it just SREs? Is it something that architects and platform teams can do that help build the foundation so that the developers can take care of more of their business problem, but also focus on reliability themselves?

Medina: Yes, it's a little bit of both. In this type of question, it does depend very much on the size of your organization. When you're such a large organization that you have so many engineers, you might have an ability to have two of those engineers focus on reliability. That could either be an SRE team, or it could just be a reliability operations team. They can really think about, what does this organization need to be reliable? Try to understand where your most critical points in your organization are. Try to understand, what were the largest outages that you had? Really take a moment to learn from those failures, learn from incidents. There's also times that there's only one person doing that, so that is a big burden to carry. It's sometimes really trying to nail down, what is the work that we need to do first? That's where when we are looking at teams that have only one person, sometimes it's like, go ahead and read the SRE book, but understand that half of it is not going to apply to you because you're an organization of one. Yes, you're going to be embedded on how many teams helping engineers, so sometimes the work that comes in with SRE is really advocating for best practices or practices that are going to make an organization reliable. Then, how do you build documentation that lives on and can actually scale. That is not like, "Ana, I really need to understand how I need to build this to deploy to production because we're launching a new feature in a month." It's like, why haven't I known about this so we can do capacity planning, so that I can get all your other upstream and downstream dependencies.

Sometimes it's really just trying to align leadership and engineering leadership specifically to think about the building blocks to be reliable, and then work backwards. Some of it is, you might have a lot of non-SREs, non-Ops that are just developers working on features. It's them carrying that pager, understanding the pain of not building reliable systems, and then having to build your own runbooks. Them having to really practice how to make their systems more reliable. Then that's when we're starting to really see that shift of more people feeling the pain, caring about this. Then the organization really embracing failure and being like, "All of us suffer failure, whether you're on call or not. As an organization, we are going to do better." It's really building that culture where no matter what we're doing, if we stumble, we're going to get back up and learn, and really make it the best experience for our customers.

Betts: I like that idea of how you build culture. You cannot impose a culture on someone. You don't want to say that everyone has to experience the pain of being on-call and seeing the site down to get it. Sometimes, that's what it takes to say, I have now taken my own personal actions to make sure my stuff is more reliable, because that helps everybody. That then becomes the culture that everyone has reliability.

You mentioned that using several cloud providers is one way to achieve four or five nines. Our company is actually doing multi-cloud, are you seeing that?

Medina: I feel like I've been talking about multi-cloud for a few years. At Uber, we tried to do a hybrid cloud model that was vendor agnostic, and that didn't really take off very far. I got a chance to speak at a multi-cloud conference put together by Cockroach Labs, and really having those conversations where it was only a handful of companies really embracing multi-cloud. Then we've seen a lot of changes over in the multi-cloud industry the last few years where we see products like Anthos come out that really do allow for you to be multi-cloud in just a few clicks. It really does depend on what type of multi-cloud we really do mean when we say multi-cloud. Is it just running our database on three different cloud providers, having a Kubernetes cluster that is distributed across the clouds?

I think the concept of multi-cloud has changed a lot over the last three years. We are seeing more companies be multi-cloud, but not necessarily all of them have all their critical services be multi-cloud. They might only have one or two, or they might just have that database that is one of the most critical pieces that they need to keep reliable, and the state doesn't, to live on. I do think that it is that part where you have just a component on several cloud providers, and is not a full, all my critical services are on Google and we can do that failover to Amazon. If anything goes wrong, we go to Azure. I think the best case scenario that we think of multi-cloud, I don't really know of an organization that has done that. I could be wrong. One company comes to mind but I'm not sure that they did go through with their entire critical service onboarding to cloud like that.

Betts: It's a journey. You kept saying it's a journey. It's, find the little piece that's most important that needs to be the most reliable, and maybe that's where you target and say, this part of our system. Making a whole system fully multi-cloud is very challenging, and you have to ponder if it's worth the cost.

Medina: I also touched upon it on the five nines of the ideal future, where, yes, we could be multi-cloud, but it's so expensive that is it really worth it. Sometimes doing those trade-offs might not really make sense, because you also now have a larger engineering team, you have a larger cloud footprint. Now that we're really having those conversations around sustainability too, that carbon footprint of being a company on multi-cloud might also cause some concerns.

Betts: Someone asked if you could share your State of Chaos Engineering Report. Is there any way you could get that out to our attendees, or anyone individually?

Medina: I want to say InfoQ may have had it. State of Chaos.

Betts: Is there a way people can reach out to you, is it Twitter, is it LinkedIn? What would you prefer?

Medina: The best place to contact me will be Twitter. Twitter is my go-to communications for a lot of this work. My LinkedIn does get a little bombarded between recruiters and InMail. If you need to get a hold of me, head on over to Twitter. You can also reach me at ana@gremlin.com. I can send that actual State of Chaos Engineering, if you all send me an email, if you have any issues with the site. I'll post the link if anyone was interested in the State of Chaos Engineering Report as well in the chat. It has some of the key findings around some of the things that we've learned around doing this work for the last five years.

Betts: How much of this do you see that's going to vary from company to company? What are some common practices that apply almost universally other than just, it's a journey, take a small step first.

Medina: I think 100% is going to vary company by company. No company really did reliability with the exact same process. That's totally ok, because no one's going to really have that model. I would say that the two recurring factors, it's really having those baseline reliability things done, so when it comes to having observability monitoring into your system, how is your system behaving? Trying to understand, what are the most severe incidents, and the cost of downtime that is going on? Once you do that incident stuff, make sure that you're really doing that post-incident analysis, and writing down the work that you need to do but also doing it. It does end up being a little bit of a big chunk of a lot of work. As you ask me, what are some of those things that you do see as common across industry, those definitely have been part of it.

The other one that I'll always continue preaching is that practice. You're not going to have a team that understands the monitoring observability, if you're not really having them see it in action when there's an incident, when there's actually customers on a high traffic event. Really making sure that you go through that same thing with understanding how to be on-call, understanding how to be psychologically safe, and run through the runbook, and escalate to the proper employees of a company when an incident is going on. That really ties in to that post-incident analysis where after you've written down all the work that you need to do to not regress into past failures, go ahead and recreate those outage conditions. That is where that failure injection comes in. You want to make sure that if they were to happen, your company is able to still be reliable against them.

See more presentations with transcripts

Recorded at:

Aug 29, 2021

Ana Margarita Medina

InfoQ Software Architects' Newsletter