BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Cultivating Production Excellence

Cultivating Production Excellence

Bookmarks
22:50

Summary

Liz Fong-Jones talks about several practices core to production excellence: giving everyone a stake in production, collaborating to ensure observability, measuring with Service Level Objectives, and prioritizing improvements using risk analysis.

Bio

Liz Fong-Jones is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 15+ years of experience. She is an advocate at Honeycomb.io for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Connect, see, and speak with like-minded people. Join us to accelerate your learning, be better informed, and drive innovation.

Transcript

Fong-Jones: I'm Liz Fong-Jones. I'm a Developer Advocate and Site Reliability Engineer at Honeycomb. I want to share with you some of the lessons I've learned around how to develop and operate complex systems, sustainably and reliably.

Production is Increasingly Complex

We live in an environment that is increasingly complex, where we're struggling to deal with so many different applications and microservices. They're all running and trying to deploy at the same time. We need to operate them as operators and developers of software. We're adding a lot of complexity all the time, which means that even if we think we can keep up today, we aren't going to be able to keep up tomorrow with our techniques. It's difficult to understand and tame all this complexity. It just gets harder every single day because of the compound challenges of keeping all of that state in our heads.

What does uptime mean in a complex, distributed service? Once upon a time in a system that I used to operate 15 years ago, you could measure the uptime of your service in terms of the number of servers that were healthy or not healthy. If the servers were healthy, then you would have a service that was completely running. If the services were not healthy, then it was because one server or two servers were down. However, we don't live in that world today. Instead, our services run across many hundreds of different machines or containers. Often, we don't discover problems until users actually complain to us. Ideally, we'd like to be a lot more proactive than that. The challenge is that we have to run these services and we also have to develop the features of those services, in addition to running them. We have to think about scaling them and being proactive rather than just reactive. There's just not enough time in the day.

Our Heroes Are Exhausted

You get to a state where you're just feeling exhausted, because there's too much going on, you can't keep up with it, and you aren't sleeping all through the night. We need to figure out a better strategy than heroism. Heroism may have gotten us through the past 15 years, but it's not going to get us through the next 5 to 10 years. We need different strategies, and that's what this talk is about. How do we make sure that our systems are sustainable and reliable to operate in production?

Don't Buy DevOps

One thing that I know for sure is that we should not buy DevOps as a solution to this. We should not buy our way out of this problem. That no amount of tooling is going to magically solve these problems. Let's take the example of what happens when you do buy all of that tooling that you think is going to help you. It actually doesn't. Here's why. Let's suppose that you adopt continuous integration and continuous delivery because you want to ship stuff faster. Ship faster means you ship shit faster. What about infrastructure as code, used to be one stray command might take down one server, now one stray command can take down your entire AWS or GCP environment.

Let's talk about Kubernetes. If you adopt Kubernetes without understanding why or how you intend to use it, then you're just setting yourself up for a lot of excess complexity without gain. One of the cardinal sins I see often is that people adopt this idea of production ownership first, of let's just throw all these teams into PagerDuty and they'll magically self-correct these cycles. That's lovely and wonderful except where you realize that those of us who have operated services for a long time have developed a lot of great coping strategies your developers may not have. It just doesn't make sense to throw people onto a service that they're not going to be able to operate without burning out. When you throw people on call without preparation, they get alerted by noisy alerts and they get woken up at 3 a.m., night after night. It gets those engineers grumpier. They can't sleep well and they eventually turn off the pagers because they just don't want to deal with this anymore, or they quit, they leave your company.

Walls of Meaningless Dashboards

Even if you do have a developer who's motivated to solve these problems, you have this battle station problem. If you've got these walls of meaningless dashboards, each dashboard has 20 graphs on it, it's like, what line wiggles at the same time as the other line? Your dashboards that are canned dashboards are just not good preparation tools for solving your incidents. While you're trying to figure out what line wiggles at the same time as this other line, your customers are suffering and this is impairing your company's time to detect and resolve incidents and impairing your company's customer satisfaction and revenue.

Deploys Are Unpredictable

It's 3 a.m., and you got paged an hour ago. You pick up the phone. You call your tech lead, and everyone winds up bugging that tech lead week in week out, even if that tech lead is not on-call. That winds up with a situation where people are not really able to sustainably get rest and improve the system. You patch the system. It gets to be 8 a.m., you've gotten 4 hours of sleep. You try to actually lastingly fix the problem and you discover that your deploys are unreliable, that you haven't actually been able to push out code for the past couple of days because you've got a broken build somewhere in the last 100 changes that you pushed. You're batching up too much into one change, and therefore your tests are flaky and you're not able to actually push code to production, including critical fixes for issues that woke you up at 3 a.m. It's so frustrating.

Operational Overload

It's what we call state of operational overload, where they're getting paged all the time. There's no time to do projects. Even if you do find an hour or two to do projects, you don't really have a concrete plan to figure out, how do we get ourselves out of this? How do we improve the state of our service and make it more reliable and operable into the future? This leads to burnout. In a team like this, your team feels like they're barely holding on to the surface of their service by the fingernails, and they're in danger of falling off at any given moment. It's not a healthy environment for a team.

What are we missing? What's the thing that can make this better? There's one thing I implore you to do. Think about the people. Think about the human factors. Telling the story got my own heart rate racing, because I've been there before. It really sucks. With me, breathe, take a drink of water, it'll all be ok. Focus on the human factors. Your tools are not magical. Your tools can help you automate things you already know how to do. They can really help you with nudging you and reminding you to do things that your culture has decided are important. Your tools are not going to fix a broken culture. Tools cannot fix a culture of blame. Tools cannot fix a culture where people don't get the priority and time allocation to do things correctly.

Invest In People, Culture, and Process

Instead of adopting a tools-first approach, we really need to adopt a socio-technical approach, where we invest in our people, our culture, and our process first, and the tools will come along for the ride as part of the overall picture rather than being driven first by tooling. This is what I call production excellence. It's the art of making our systems not just more reliable, but also friendly on the operators. It focuses on the overall socio-technical picture of the humans and the computer systems working together. You don't get there by accident, you have to plan. You have to measure along the way to figure out, what are the milestones? How can we make sure that we're making progress along the way continuously? If things do go wrong, where are we going to detect warning signs? How can we improve things? It also means that we have to involve everyone. This can't just be an engineering-only project. That it has to also involve lead teams like customer success, sales, and finance. These are all important stakeholders that too often are left out of engineering-only efforts. This means that we need to have the right psychological safety. That people need to feel safe and confident to speak up, ask questions, and contribute where they have something valuable to add to the conversation.

How Do We Get Started?

What are the four elements of production excellence? First of all, we need to know when our systems are too broken. Secondly, we need to be able to debug our systems when they are too broken. Thirdly, we need to be able to collaborate across teams in order to resolve these incidents. Then fourth, and finally, we need to resolve the overall complexity. We need to eliminate unnecessary complexity and create a feedback loop so that we can solve the operability issues within our systems after we've detected problems.

Our Systems Are Always Failing

Why did I say that we needed to know when our systems are too broken? Why did I not say that we need to know when our systems are broken at all? The answer is that our systems are always broken in some microscopic fashion. I have a lawn in my backyard, and I don't care that the lawn has a single brown leaf in it. What I care about is, is the lawn overall green enough? Is it soft enough? Can I send my family out to play? That's what matters. It doesn't matter that every single blade of grass in the lawn is green. Why does it matter for one server to be slow or down, if the overall picture of query throwing through their system has automatically killed and routed around it? We cannot alert anymore on one server is 90% CPU utilization, or 90% disk utilization. This doesn't work anymore. Instead, we need to measure the quality of services that we're delivering to users, and whether it's meeting those expectations or not. We need to measure, are our systems too broken?

We Need Service Level Indicators

This is a concept from Site Reliability Engineering called the service level indicator. It enables us to measure and create this common language between us, our product management stakeholders, and our customers, to understand what is the required level of service? How do we measure that it's actually being delivered to customers? In order to understand this, we need to understand, what are our critical user journeys? What do customers care about? Can they visit the homepage of my e-commerce app? Can they add an item to their shopping cart? Can they check out? What are some important properties of that? Things like, what's the user? What's the item they're trying to add? Which microservices did they use? What's the request ID? Did they get served a 500? These are all important properties. Above all else, what we're trying to do with these critical user journeys is categorize them as good or bad. We're trying to determine which events are satisfactory to the customer and which ones we could improve upon.

Are User Journeys Grumpy?

One way of doing this is to ask our product managers what the criteria for success are, or ask our user experience researchers. If we genuinely don't know, we can just try to artificially slow down the system and see, when does it feel sluggish to us as people who practice dogfooding? Then we'll understand, what are our thresholds? Is 200 milliseconds fast enough? Is 500 milliseconds fast enough? What is the success? Is it just HTTP 200, or what if it was an empty body? Does that meet our criteria? You need to figure out what are the criteria that your systems can use in an automated fashion to decide what a reliable enough and sufficient response is. After that, though, our picture is not entirely done, because we need to know not just, is any given event good or bad, but instead, what's the total population of eligible events of real customer traffic, not load test or health check traffic, not bottom end traffic, but real customer traffic.

Availability: Good/Eligible Events

Then we compute our availability. What's the percentage of good events divided by eligible events? That is the availability number that we're trying to track and measure over the long term and set a target service level objective over it. SLO needs to have a window like 30 days or 90 days, and a target percentage. Why not one day or two days? If I were 100% down yesterday and I said I'm 100% up today, you wouldn't believe me. Let's take an example SLO, 99.9% of events must be good over the past 30 days. Where an event is good, it was served with HTTP 200 in less than 100 milliseconds. That is a complete service level objective definition based off of the service level indicator we defined earlier.

Why not 100%? A good SLO barely keeps your users happy. If you set your SLO too high, you're wasting money and you're wasting execution velocity. It doesn't make sense to spend billions of dollars to eliminate 1 femtosecond of downtime, if you're operating an e-commerce site. It just doesn't. You can quantify the revenue loss from 1 femtosecond, and that's far less than the cost to deliver a reliability that's only 1 femtosecond down every year. Even if cost were not an object, the practices that you need to have extremely high numbers of nines do not align with the reality that we have to deliver and ship software on a regular cadence rather than just freezing it. The Space Shuttle? Ok, that I expect to have a very high degree of reliability and to take a long time to develop and to be very extensive. Your e-commerce site is not life critical in the same way that a space shuttle is life critical.

Drive Alerting with SLOs

What can we do with our SLOs? We can do a few things. First of all, we can drive alerting with them. We can decide whether or not something is important and worth waking someone up for based off of whether it impacts the user's experience and the service level objective that we measure the user's experience with. In order to numerically quantify this, we need the error budget, which is one minus our reliability target. If we target 99.99% reliability, that means that 1 in 10,000 requests are allowed to fail. If I have 10 million requests per day, that means that I'm only allowed to have 1000 requests fail per day. That means that I can compute how long until I run out of that error budget. If I'm going to exhaust my error budget for the month, in the next couple of hours, that's an emergency. I need to wake someone up regardless of what time it is. If I'm going to run out of error budget in the next few days, it can wait until the next business day. I don't have to wake someone up on the weekend or in the middle of the night. Error budgets are especially great at catching brownouts where you may not necessarily have 100% downtime, where your probers may not be firing. Yet, if you're measuring the quality of user experience, you can detect brownouts and resolve them before users become too unhappy with you.

Data-Driven Business Decisions

The second thing that we can do with SLOs is to make data-driven business decisions. If we have plenty of excess error budget left over that we haven't spent, we can do A/B tests. We can use feature flags to run experiments to determine whether or not we should take a certain product direction. We can use it to increase our development velocity. Conversely, if we're having reliability problems, and we've blown through our error budget in the past couple of months, then we know that we need to improve our reliability in order to meet our error budget for future months. That may mean that we need to slow down on feature delivery and focus on re-architecting your systems.

Perfect SLO > Good SLO >>> No SLO

You don't have to have a perfect SLO, you just have to have a good enough SLO. What you aren't measuring is going to potentially hurt you and your customers. Measure what you can today, even if it's your load balancer logs. You don't have to adopt complex distributed tracing right off the bat, you just have to measure the signals that you have today. Over time, you can iterate to meet user needs. Sometimes that means increasing the fail. If you're monitoring sometimes that means increasing your SLO. Sometimes it means decreasing the amount of reliability that you expect because customers actually favor velocity over reliability. Listen to your customers. Only alert on what matters. Get a peaceful night's sleep, and you have much happier developers.

SLIs and SLOs Are Only Half the Picture

SLIs and SLOs are only half the picture. We also need to talk about once we've detected a problem, how do we actually resolve it? Our outages are never exactly identical, so we need to focus on how we resolve incidents that we've never seen before. You cannot predict how a failure is going to happen in advance, or else you wouldn't have written the bug in the first place. That means that we have to be able to debug novel cases in production, not in staging. We can't wait weeks to debug something. We need to be able to understand things as they're happening in production.

The number one place that I see people spending time when they're on-call is forming and testing hypotheses. Did the system fail because this build ID is bad? Did the system fail because this set of users is located in the People's Republic of China? If you can't test and verify these hypotheses, it's going to take a lot longer to debug your incidents. This means that we have to be able to dive into our data to ask new questions in order to understand what's happening inside of our system.

Our Services Must Be Observable

All this is to say that our services must be observable. We have to be able to understand from the signals that they're emitting without pushing new code, what's happening in the internals of that system. In order to do that, we need those events in context that I talked about earlier, from our critical user journeys. We have to be able to understand and explain, why are some of them erroring? Why are some of them slower than others? Do we have the relevant dimensions that are high in cardinality, things like user ID, things like user agent. These are all important properties that we have to have in order to be able to test hypotheses against any dimension and not just the finite dimensions that we commonly measure in monitoring systems. Even better yet, why do we have to do this investigation while it's broken? Why can't we automatically mitigate and then investigate using the telemetry signals in the morning?

Observability is not just break/fix, though. Observability is this ability to ask questions of our system that are novel, also helps us develop quality code. It helps us release in a predictable cadence, understand what our users are actually doing, and what features they're using, and to understand where are the areas of our system that are especially scary. Where should we invest in more into decomplexifying?

Observability Isn't Just the Data

Observability is also not the data. It's not logs, metrics, and traces. What it is, is the overall capability of our socio-technical system. Can we understand how our systems are functioning by writing the right amount of telemetry in an easy fashion? Can we instrument easily? Can we emit the data to a place that is comprehensive, that is fast, that is reliable, and that is cost effective? Can we query it to answer our questions? That's what matters. Logs, metrics, and traces are varying ways to manifest that data, but they're not the observability capability as a whole.

SLOs and Observability Go Together

SLOs and observability go together, because SLOs tell you when things are too broken, and observability helps you debug when things are broken. You also need the ability to collaborate among your teams. We cannot operate our services off of heroism. That just doesn't work anymore. We have to collaborate across multiple individuals and multiple teams in order to understand our complex microservices. We have to work not just among our engineering teams, with ops teams, customer success, and our back-end data center providers. We have to train together. We have to practice chaos engineering. We have to practice resilience engineering. We have to do game days. We have to make sure that people have those working relationships and trust built up in advance of 3 a.m. when the pager really does go off. Include everyone. Practice these things and improve upon your culture of psychological safety. Make sure your responsibilities are fairly allocated among your team, especially in this time of COVID. Can we lean upon our team? Can we not rely on individual heroism? Can we make sure that people can take breaks for parenting? Can people observe their religious holidays? The team is the unit of delivery, not the individual, so let's not have people feel pressured as individuals to over-deliver when they're super stressed out.

We Learn Better When We Document

Document the right amount. Make sure that people understand what is the service for? Why is it important? How can I mitigate it quickly? What services does it talk to? That's the bare minimum that you have to document? Share that knowledge out. Make sure that people are not shamed for asking questions like, "I don't understand that." Instead of, "I can't believe you didn't know that." Instead say, "Thank you for asking the question. Let's work on documenting that together."

Make sure that your teams are using the same platforms and tools and terminology. Use common standards like OpenTelemetry to implement your observability strategy. That way, you'll really be able to get on the same page and not have conflicting sources of data. Reward that curiosity and teamwork. Thank people for their contributions. That really matters when you don't have that physical, in-person connection. Above all else, we have to learn from the past in order to reward our future selves. In the words of my friend Tanya Riley, "Leave ourselves cookies, not traps for your future self." Because our outages do not exactly repeat, but they definitely rhyme, there are things that we can learn from past outages to improve our future performance.

Risk Analysis helps Us Plan

We need to perform risk analysis to really maximize the impact of the engineering changes that we make. We need to quantify our risk by the frequency and impact. How often does this happen? Does it happen once a year or once a month? How many people does it impact? Does it impact 100% of our users, 5% of our users? How long does it take to detect? How long does it take to repair? Does it take 2 hours, 5 days? By combining all these numbers, we figure out which risks are the most significant. What's going to have the greatest impact in terms of things that we can do to reduce our downtime for users?

We ultimately want to address risks that threaten the SLO. If a MySQL server goes down once every 2 months, and it takes down 100% of users for 2 hours, that seriously impairs our ability to deliver four nines of reliability versus something only impacts 2% of users when it fails like that. That doesn't impact our SLO quite as much. We don't need to completely categorically eliminate these failures. We just have to decrease how severe they are. By having a service level objective agree with your stakeholders, this enables you to make the business case to fix those issues that might cause repeated outages. It allows you to prioritize completing the right amount of work to make sure your service is reliable enough without getting into the weeds. Don't waste time chrome polishing. Don't adopt Kubernetes unless you need it. Instead, really focus on what matters.

Lack of Observability and Collaboration Is Systemic Risk

Two risks that you won't necessarily see called out as individual line items are lack of observability and lack of collaboration. If you do not have the ability to understand quickly what's happening within your systems, that's going to add minutes, hours, or even days to every outage that you have. Improving your reliability can take the form of improving your observability. If you decrease the time every outage takes to solve by hours or days, that will have a dramatic impact upon your compliance with your service level objectives.

If you improve your ability to collaborate across teams, you decrease those boundaries between teams so that people can quickly access the knowledge base of what their co-workers have done, that people can quickly point out when there's a problem. People can escalate issues easily. Again, that's going to really decrease the amount of time that your incidents last from hours to minutes. You don't have to be a hero in order to achieve a successful team. You just have to have the right tools, yes, but above all else, you have to season your alphabet soup with a culture of production excellence.

Production Excellence Brings Teams Closer Together

Bring your teams closer together. Measure what matters by measuring those customer journeys and service level objectives. Debug with the power of observability. Collaborate across your teams by fostering those bonds. Prioritize closing that feedback loop and fixing the unnecessary complexity and fixing the repeated issues that come up over and over again.

 

See more presentations with transcripts

 

Recorded at:

Jan 02, 2021

BT