Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Cultivating Production Excellence - Taming Complex Distributed Systems

Cultivating Production Excellence - Taming Complex Distributed Systems



Liz Fong-Jones talks about several practices core to production excellence: giving everyone a stake in production, collaborating to ensure observability, measuring with Service Level Objectives, and prioritizing improvements using risk analysis.


Liz Fong-Jones is a Staff Site Reliability Engineer. at

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Fong-Jones : Thank you so much for being here. I know that this is a late slot, so I'm going to be as punctual as I can be and as economical with my words. I'm Liz Fong Jones and I am indeed a Developer Advocate and a Site Reliability Engineer. I work at Honeycomb. But this talk is about lessons that I learned throughout my time as a site reliability engineer at Google, working across about 10 teams and working across many of our customers of Google Cloud. And things that I saw that made for really great experiences for the people that were on call looking after systems, and things I thought were kind of problems. So we're going to go through a fictional case study today that explains how you can get from a team that's dysfunctional to a team that's working well. And of course, these slides are illustrated because an illustrated slide is worth 1,000 words. So make sure to give the artists props as well in doing this.

QCorp ‘bought’ DevOps - Fictional Case Study

A hypothetical corporation called QCorp bought DevOps. They went to a vendor who said, "Yes, we will sell you DevOps in a box." And unfortunately, QCorp is finding that the results are not what they anticipated. Because they said, “Give us all of the alphabet soup. I want to order the alphabet soup.

Give me Kubernetes, because I want to orchestrate my containers. Wait a second, what is a container? Well, you know, we'll have to figure that out.” Or “Let's go ahead and deploy continuous testing. Let's do continuous integration and deployment. Let's set up Pedro duty. Let's put all of our engineers on call. Hey, I heard that this full production ownership thing is a great thing. Let's put all of the engineers on call and let's go ahead and standardize everything and put it into code so that we have infrastructure as code. Great, right? Like we solved our problem.”

Well, not so fast I would argue. This isn't working quite as we expected. For one, when we turned on the vendor monitoring solutions, we got walls, and walls, and walls of dashboards. The vendor sold us on, "Hey, we can monitor all of your existing infrastructure and you don't have to write a single line of code. We'll just do it for you." But now I have a new problem. Which dashboard do I look at? How do I tell if something is wrong? And how do I tell all of these dashboards that are provided to me, and the dashboards that everyone who came before me thought were useful, how do I know which one of these to look at?

But it gets even worse because now I've put my engineers on call and they're really, really grumpy because I haven't done anything to address the noisy alerting problem. And my engineers are getting constantly woken up and they're getting really unhappy both at the system, at me, and at each other.

And incidents are taking a long time to fix because of the fact that engineers don't have the appropriate visibility into what's going on in their systems. And while they're sitting there trying to understand what's going wrong, users are sitting there wondering “Why isn't the app working?” And they're getting even more unhappy.

The other problem that happens is that my senior sysadmins are the experts on how the system works. And everyone on the team winds up escalating issues to them when they can't figure it out. And the system admins just say, "Oh, go run this command". They're not actually explaining to the people so they just have to keep coming back over, and over, and over again for more and more advice. And the experts thinking like I don't have time to document this and write it down. Like “I'm just going to give you my answer and then hopefully get back to work”. But unfortunately, they never get back to work.

There's also the problem that our deploys are unpredictable. That what we thought we were getting when we said that we were going to do continuous integration wound up not catching the problems that we were expecting that it would catch for us. All our builds were green, and we pushed out the code. But because of interactions between our microservices and our monolith, we wound up having the code not work in production, Q breakage, Q unhappy customers, Q rollbacks, and Q developers who are not happy to see their production code not actually running in production. And the team is constantly fighting fires still. All this tooling has, in fact, made their lives worse. They have a new set of things to learn, a new set of things to adapt, and yet they're still getting paged all the time “Production's on fire, users are unhappy”. And they feel like they're juggling giant flaming bowls. That's not a good state for a team to feel like they're in.

And even when people do find a spare hour that they're not being interrupted, they don't know what to work on. They don't know how can we get ourselves out of this mess? They're lost in the woods without a map because all of these tool vendors have come in but none of them have offered any kind of advice on the strategy on where the team is going and how they should approach things.

So the team feels like they're holding on the edge of a cliff. They're struggling to hold on. And it's true, they're drowning in operational overload, and no amount of tooling is going to solve this problem. So I ask you, this is kind of rhetorical question, what should QCorp do next? Okay, my clicker has stopped working for some reason. So what should QCorp do next?

I think the problem here that we have to recognize is that QCorp has adopted a tools first approach instead of a people first approach. They've forgotten that people are the ones who operate systems and that we're operating systems for people. We don't operate services just because we think it would be funny. We don't operate services in order to cause pain to each other. That's not the right approach. So we have to really think and look back at the broader picture, the people. Our tools can't perform magic for us. Our tools can make our lives easier if we have a plan, if we know what we're going to do, the tools can help us achieve that faster. But if we don't have the right culture, if we don't know what we want the tools to do for us, then the tools are going to behave more like the drums in the movie Fantasia, they just multiply and grow out of control and create worse problems for you to mop up afterwards.

So what we need to do - and this is taking a leaf out of Randy Shoup’s playbook, he said upstairs a few minutes ago that culture trumps everything, that culture trumps technology. And that culture, bad culture can in fact, even overwhelm the best-intentioned people, the best-intentioned processes. So we have to look at all of these things but especially the culture and the people. And that's how we're going to get ourselves out of this situation.

Production Excellence

And this is where I think I want to introduce the word production excellence. I think that production excellence is both a set of skills that's learnable, that anyone can work on, but especially that any organization can focus on improving. And that it's a method to obtain better systems that are more sustainable, that are appropriately reliable and scalable, but also that don't burn the humans out. And that's what we need to be working towards, even before we think about choosing a single tool.

So our goal here is to make our systems much more reliable and friendly. It's not okay to feed the machines, the gears of the machines, with the blood of human beings. You only have so many human beings and if you have vampire machines, they're eventually going to cause all of your people to quit. And then your system won't work at all. So that's not a good place to be. But you don't get production excellence by accident. You have to plan for it. You have to be really intentional. You need to figure out where are we? And take an honest look, don't lie to yourself about it, right? Don't say, "Oh, we're going to score five on all these metrics, my team is perfectly safe. We are using all of the code in the appropriate ways." No, you have to say, "You know what, I recognize I'm not measuring the things I should. Let's start off by taking that first step, rather than jumping to running."

We have to develop that roadmap and think about where are we going in order to accomplish the production excellence state of having services that are sustainable and reliable. We have to also think not just about short term results but instead about long term results. We have to act on what matters in the long term, not necessarily in the short term. Again, Randy said in his talk a few minutes ago that, if you run into this cycle where you feel you don't have enough time to do the things right, so you shortcut them and generate technical debt, and then you're even more out of time, we're trying to get you out of that. But in order to get you out of that, you have to have air cover. You have to have people that are willing to stick up for taking the time to do things right in order to foster the long term health of the team.

And you can't just approach this by yourself, even if you're an executive. But most of us in this room aren't executives, right? You can't do this by yourself. Changing the culture of a team is an intentional effort that involves everyone on the team. And not just everyone on the immediate team, it also has to involve the people that are adjacent to your team. Most of us in the room here are engineers but when was the last time that you invited marketing to your meetings? When was the last time that you invited sales to your meetings? Your executives, finance? What about your customer support team or in the cases that you're not already trying to support your infrastructure itself, do you meet with the people operating your service? Cultivating production excellence has to involve all of these stakeholders or else you're going to have a really segmented effort that doesn't actually deliver what people need to have their jobs be sustainable.

So we also have to think about on each team, how do we make people more successful? How can we build them up? And part of that involves thinking about how do we increase people's confidence in their ability to touch production, in their ability to investigate things? And this involves, as Andrea said in her talk yesterday about psychological safety, this involves people feeling like they're in that mode of thinking where they can take their time and think rather than instinctively reacting with fight-or-flight reactions. And you have to make sure that people feel safe to ask questions, and that people feel safe to try things out and potentially fail. If you remember Sarah's [Wells] keynote from yesterday, Sarah was telling the story of the product developers being too afraid to touch production and restart the MySQL database for 20 minutes because they didn't know what was going to happen. An outage was continuing that whole time, right? We have to be safe to take risks in order to have production excellence.

How do we get started here? What's the first thing that I should do? Well, I'm going to tell you today the four things that you need to do in order to get started. They're not necessarily in any particular priority order but this just seemed to be the logical way to introduce them to you. The reason that we're here operating our services is that we need to know when our service is too broken. And when they are too broken, can we figure out what's gone wrong with them? And can we work with other people? Remember, these are microservices, they span multiple teams generally. Can we work together and debug together when things are too broken?

And once we've resolved the immediate crisis, not just are we patching things and restoring things to a shoddy state, but instead are we able to eliminate the complexity? Are we able to get things to a more mature state in the longer term such that people can have a smoother experience, that people can not spend their time fixing the same bugs or patching them over, and over again?

Our systems, I would argue, are always failing. When you look at your lawn, do you look and say that individual great blade of grass is brown, therefore, I need to throw out the whole lawn? Do you try and make sure every single blade of grass in your lawn is green? No, you don't do that. We're much more interested in the aggregate state of affairs than we are in any one particular small failure that doesn't impact the end user experience. If your kids can play on it, if your dog can play on it, it's probably a perfectly fine lawn.

Service Level Indicators and Service Level Objectives

That means that we need to take a different approach and instead of trying to make everything perfect 100% of the time, we want to measure that concept I talked about earlier of are things too broken? So that's where Service Level Indicators come into play. Service Level Indicators and Service Level Objectives, their close cousin, are how we measure whether our quality of service is good enough, aka not too broken. So we have to think in terms of every event that happens in our system in its context to evaluate are things in a good state or not?

And we have to have not just humans looking at every single user transaction to figure out did it make the users happy? But instead, we need to be able to teach machines to evaluate this for us, to tell us is this good enough or not? So one way of doing this is to figure out what makes users grumpy? What would cause a user to be unhappy? What would damage your company's bottom line? This is where collaborating with people can help. Ask your product manager. Your product manager probably has a pretty good idea of what users will and won't tolerate. What really delights users? Those are questions that a product manager is definitely well-equipped to answer. So one possible approach to this if you have a request-response service is to look at did the request return an error or was it a success? Did it return quickly enough?

And as long as both of those things are true, then that is probably a good event that satisfied the user on the other end. Or maybe this is a batch request. I was particularly fascinated by Collins' talk earlier in this track, where he talked about the fact that we don't only have request-response services, we have to be able to understand and examine the reliability of batch processes. For instance, are we sending out people's credit card bills on time? Are we in fact, deleting users' data when we're supposed to? This is super important because of GDPR, right? So we need to look at things like is this row of data fresher than 24 hours? If so, that's a good row of data because we know that we've done what we are supposed to do every single day. And if not, then that row of data gets marked as bad, that it counts as an error that we may need to investigate.

So the bottom line for both request-response based and for batch jobs is that we have to think about what threshold buckets these events? What are the discrete events that we're tracking and what lets us sort them into good or bad? Once we have that, that enables us to do some pretty nice things. For instance, we can also think about what's the denominator, right? Like how many total eligible events were there? There might be a condition where, for instance, if you get attacked by a botnet and you send it a flood of 403, do those requests count as good? Do they count as bad? I argue they're not even eligible. They're not in scope. This is not user traffic. So you need to exclude the events that you don't think matter to your end users from your calculations.

Once you have this, then we can compute the availability. You can compute the percentage of good events divided by that number of eligible events. And we can set a target on it. We can set a target service level objective that aggregates all of this data over a period of time. So we need the window, for instance, a month, two months, three months, and the target percentage that we're trying to reach. Why do I say a month to three months and not say a day? The reason is that if you reset that target every single day, you're ignoring the fact that users and customers have memories and you can't say I was 100% down yesterday, but I'm 100% available today. Therefore, I'm going to ignore your complaints. People care about your quality of service measured over months or years.

So for instance, I might pick a service level objective that says that 99.9% of my events will be good measured over the past 30 days using the definition of good from earlier that they have to HTTP 200 served in less than 300 milliseconds of latency. So that's one example surface level objective. But yours may or may not look like that. The important thing about what your SLOs should look like is that a good SLO barely keeps users happy. That should be enough that if a user encounters an error instead of saying, "Oh, my goodness. Google's always down," they say, "Oh, well, I guess I'm going to press refresh". If someone says "Meh" when they see an error that's probably about the right threshold too, to set your availability target for.

How Can You Use Service Level Objectives?

And you can use SLOs for two things. First of all, you can use SLOs to drive alerting but secondly, which I'll get to later, you can use SLOs to drive your business decision-making.

Let's talk about the first case of how we drive our alerting with SLOs and how we clean up this problem of everything alerting us all the time. Now that we've decided what good means and what makes users happy, instead of looking at our individual hard disks filling up on an individual machine, we can look at the trend in our service level indicators. We can ask questions like are things running in a bad direction? Is my service on fire and requiring my immediate intervention? And we can quantify that using math. So the way that we do that is that we set our error budget to be one minus our service level objective. If my service level objective says 99.9% of events must succeed, and I have a million events in the past month, that means that mathematically, a thousand events out of my million are allowed to be bad. And I can use that as a budget. I can be thoughtful about how I spend it and if it looks like I'm going to run out of budget or run out of money, kind of using a more mundane analogy, then I might want to do something about it.

And I can compute things like how long has it been until I run out? Or is the money flying out of my bank account, or is it just like a slow drip? And based on that I can decide, I don't have to make everything an emergency. And if it's urgent and I'm going to run out within hours, then maybe I need to wake someone up and they need to fix the problem right away. But if I'm not going to run out of error budget for days, maybe it can wait in line and someone can look at it during the next working day. And this, as I said earlier, lets us make data-driven business decisions. And this is really cool to see this featured in Sarah's talk, featuring Cherrie, my boss, that this lets us make a trade-off. This lets us set a budget of how much error are we allowed to have and spend it either on increased velocity or in the natural errors that happen inside of our systems.

For instance, I can ask questions now and get definitive answers. Is it safe to push this risky experiment? Well, maybe if you use a flag flip, so you can revert it fast enough and you're doing it on a small enough percentage of users. We can figure out what the maximum fallout would be and say you know what, that's tolerable within our budget, we've got plenty left to spend. Or we can ask questions like, do I need to stand up a second or third region in my cloud provider? Do I need to go multi-region right now? Well, maybe the answer is yes, if you're bleeding our budget, and it's because of your underlying provider. But maybe the answer is no, if you have enough reliability for now.

I'm talking about this abstract SLO concept and some of you may be saying, "Okay, let's try to get it right." But I'd argue here that you can't act on what you don't measure. So start by measuring something, anything. If you have a load balancer, take the logs from that, use that to drive your service level objective. You don't have to get super fancy, you don't have to come right out of the gate and expect to be like the "Financial Times" with synthetic user monitoring and all of these awesome things that they built.

Just start with the basics and then iterate over time. You'll get most of the benefits early on. And if you see problems, well, then you can, in fact, iterate from there. You can, in fact, say I had an outage that was not caught by my SLO. Let's change the monitoring of how we measure the SLI. Or you might even have situations in which it's not the measurement that's wrong but instead, it's the overarching nature of your service level objective that's not meeting user needs. For instance, if users expect to see four nines and your SLO is three nines, you probably have a lot of rearchitecting you need to do in order to make them happy. And you can, in fact, creep that service level objective upwards over the span of multiple months, rather than alerting yourself every time you do not meet four nines. Or if you have a significant outage and users don't complain, maybe that's a sign that your service level objective is set too high and you can relax your SLO.

Settling on an SLO is never a one-and-done process. You have to iterate on it to meet your users’ needs. So check in maybe every three months, maybe every six months, with your product managers and with the business and figure out what's the right thing to make our users happy. And this, when you implement it, means that your product developers will finally sleep through the night because they're no longer being woken up by random alerts that the hard disk is full, or one server is seeing an anomalously high error rate. But SLIs and SLOs, I would say are only half of the picture because SLIs and SLOs will tell you when something's wrong, but they won't tell you how to fix it or how to mitigate it.

Complex Systems Failures

So you need to think about how do we understand complex systems failures? And the one thing that I've realized is that outages and complex systems are never the same. They may rhyme but they're not identical and you can't take the same thing that you've used to solve this outage last week and apply it to today. Every outage is unique.

That means that we also can't predict what is going to fail. We may have some good ideas but those good ideas will only take us so far. It doesn't make sense to focus on the tiny bugs with a magnifying glass in front of us and then realize that there's a giant bug behind me. Oh, no. So we have to instead think about the more general cases. How do we make our systems easier to understand without pre-supposing how they're going to fail? If you have N services that are all talking to each other, that's N squared possible failure modes. You can't possibly instrument each of these N squared cases. I would argue don't bother trying to do that. So we have to make sure that our engineers are empowered to debug novel cases in production. In production is super important.

Staging is a useful tool sometimes, for instance, for trying out new things that you are never intending to deploy. But all of our users’ pain comes from our actual production environments. So if you can't understand what's going on with the user in production, you're going to have a lot harder time reproducing that bug in your staging environment. This means that we need to collect a lot of data about the internal state of our system that our systems are reporting back to us, and we also have to feel like we can explore that data without being constrained by what we originally thought we are going to need to ask from that data.

And all of this boils down to the statement that our services have to be observable. And I am going to use Colin's correct definition of observability, which is that it's information that our system provides to us in order to help us understand the inner state. The system is not observable if you have to push new code out to production or refactor your database schema to answer a question that you have about the inner workings of your service.

Can we explain events in context? We're collecting these individual events due to our service level indicators, but can we actually understand the full execution path? Do we understand why that individual RPC returned failure? Where in the 10 deep stack did it go wrong? And can we explain what's different between the ones that failed and the ones that succeeded? And this is something that I borrowed from Ben's talk yesterday as well, which is that we have to think about explaining the variants, explaining the differences in order to form hypotheses about what has gone wrong so we can go fix it. But even better than fixing it right now, maybe we need to think about how we mitigate our impacts.

Going to the talk from Edith earlier today about why do we do feature flagging, the answer to why we do feature flagging and rollbacks is that feature flagging and rollbacks let us mitigate the impact right away. Roll back whatever is problematic and then debug later. A well-designed automated system that has an appropriate reliability and observability, it will fail, it will correct itself, and someone the next business day can look at the telemetry and figure out what went wrong. You no longer even have to wake people up at night in a system that is appropriately designed to be self-healing.

So SLOs and observability work really well together. You need one in order to tell you when things are wrong and you need the other to debug. But I think just focusing on those two isn't enough to achieve production excellence. Here's why: they don't give you collaboration. They don't give you the changes to your culture to support people in working together and developing those skills, in working together to analyze incidents.

Debugging doesn't happen alone. As I said at the start of the talk, when you work in a microservice environment, when you work in a full ownership environment, your team has dependencies upon other services. And your team has other services, other teams depending upon it. So even if your outage is so small that only one person has to be debugging your own service, you have to work with other people to solve it. And we also have to think about all the various roles of the people that are involved in collaborating to debug an outage. The customer support team needs to know what's going on. The customer support team might even, if you have the right tools, be able to look at production themselves and solve or understand the customer's problem without even escalating it.

Debugging is for everyone. And the way that we get there is by encouraging people to cultivate the skill of debugging and giving them access to these tools instead of withholding them and hoarding them to ourselves. Debugging works a lot better when we all work together and put our brains together, right? We get better ideas about what may possibly be going wrong and what we can do to fix it.

And it's also an interpersonal skill. If you have teams that are not getting along with each other, do you think that they're going to have an easy time during an outage working together to solve it? Probably not. If you have finger-pointing, it makes solving outages a lot harder. People don't admit, "Hey, my service had a hiccup". We have to be able to work together both when things are calm as well as under pressure. And this goes back to Andrea's point again about psychological safety, and about type one versus type two thinking and also about growth versus growth mindsets, thinking that we can learn the skills that we need to learn rather than assuming that everyone has a fixed set of skills that they have.

We also have to think about the sustainability of our operations and not burn people out. Many of us here work in the EU. There is the working time directive, and it says that you have to structure your business so you are not relying on waking up people at all hours, that there are limits to how much you can put people on call. Even if you were to pay them extra, the EU doesn't let you put them on call past a certain amount. And it turns out that giving people flexibility with regard to their involvement in production is a really powerful tool because what that enables you to do is that enables you as a busy manager to not have on call during the day.

How often have you been a manager and had a meeting interrupted by a page? It sucks, right? Like you're having a one-on-one and then you get paged. That's a really awful experience for you and the person you're having a one-on-one with. So trade places, take the nighttime on call away from the person who has a small child, take the nighttime on call for yourself and give your daytime on calls to the person with a small child. Flexibility lets us do a better job of not burning out.

And it's not just on call, there's ticket load of responding to non-urgent events. There's doing customer support work of responding to customer queries. There's a lot of things that are production involvement but not necessarily on call. I think it's important for everyone to be involved in production to a degree but I don't think it has to look like on call. And not all on call has to look like 7 days a week, 168 hours straight at a time. At the end of the day, if you don't sleep, you can't be creative and creativity is essential to engineering. So we also have to think about documentation as part of our collaborative process, that we have to leave ourselves cookies, not traps. That's a quote from Tanya Riley, I encourage you to look at her talks.

When you write down the right amount and keep it organized, it can help your team. But don't brain dump. Don't brain dump things that are going to become out of date. If you commit to writing down documentation, commit to maintaining it as well. And that will enable you to fix your hero culture. If people are sharing knowledge, that means that you no longer have that one person who's acting as the bottleneck. You no longer have that one person burning themselves out because they keep on getting pure bonuses and praise whenever they take one for the team.

So share that knowledge, fix your hero culture, and really don't over-laud people for solving the problem themselves, or shame people when they have to ask for help and escalate. And that goes to the point of rewarding curiosity and teamwork. We have to make sure that people want to ask questions and understand how the service works. We have to have the growth mindset, and we have to practice. We have to give people opportunities to exercise their curiosity to work together with others in terms of game days or wheels of misfortune in order to make sure that people are exercising these skills before it's a real incident.

And we also have to make sure that we're not repeating the same problems over and over again. We have to learn from the past and reward our future selves. So outages don't repeat, as I said earlier, but they definitely rhyme. We can identify the common patterns and see what we can do to mitigate those failure modes.

Risk Analysis

And risk analysis, a real structured risk analysis, done with type two thinking rather than type one thinking, done by deliberately thinking about it rather than letting our gut guide us, lets us really decide what's important to fix right now. For instance, if you have a bridge and maybe the depth of the bridge is letting cars fall through. You know, maybe an earthquake is going to come in 30 years that might cause the columns to collapse but not right now. Maybe you should think about what is urgent to fix right now and what can I plan to fix later if it becomes important enough?

We have to quantify our risks by frequency and impact. How often do they happen? And sometimes we don't have a lot of control over that. So that's sometimes known as the time between failures. Well, we have to also think about what percentage of users do they affect? Because sometimes you can control it. If you only push code to 1% of your servers at a time, that means that potentially you're only impacting 1% of users at a time rather than 100%. You have control over blast radius. You also have some degree of control over how long does it take to detect failures, and how long does it take in order to restore service? Instead of deep diving to try to understand what's going on in the moment, can we do a quick mitigation like draining a data center? Maybe that's the right thing to do to reduce the impact and severity of an outage.

Once we figure out which risks are the most significant, that's the point at which we can prioritize our engineering work. We can figure out what do we need to work on today, and what can we put off till until later? And fundamentally, the SLO here is, again, useful because we can address the primary risks that threaten the SLO. If you multiply that out and you say, you know what, I'm going to get 500 bad events out of every single month when my error budget is 1,000 bad events, that sounds like something that's worth fixing because if that happens more than twice in a month, you're going to have a problem, right? So that lets us prioritize where we make our improvements. And that also gives you really excellent data for the business case to fix them. Because if we agreed that the right threshold for a service level objective was 99.9%, or say 1,000 bad events per month, then that lets us go back and say this is going to significantly increase the risk of us exceeding that threshold. We need to stop working on new features in order to fix this bug.

But you need to actually complete the work. It doesn't do anyone any good if you write a beautiful post mortem, and you put the action items on a board somewhere, and they never get dealt with. So actually make sure that you do the work. But I think that we also have to think about the broader cultural risks. We have to think about the broader picture of what can contribute to us exceeding our error budget. And I think that a lack of observability is fundamentally a systemic risk. If you can't understand what's happening inside of your systems, that's going to magnify your time to recover. It's going to dramatically magnify your time to recover. So we have to think about, as our systems become more complex, how do we tame that complexity by making them more observable. And if not, you're going to have a situation of operational overload and you're going to have a situation of unhappy customers.

I also think that lack of collaboration is a systemic risk. If your engineers cannot work together with each other, and with support, and with every single other department of your organization, outages are going to last longer, or people are not going to find out about outages, that time to detect. If you have a lack of collaboration between your support team and your engineering team, because every time the support team reports an issue to engineering, they get told, "Oh, that's a problem between keyboard and chair. Why did you even bother us with this silly thing," they're going to not report a real incident.


Collaboration really matters and does have an impact on your bottom line and the happiness of your users. So I argue that in addition to buying the alphabet soup, you also need to season the alphabet soup with production excellence. But together, we can bring our teams closer together and tame the complexity of our distributed systems by using production excellence, by measuring “Are we achieving a high enough reliability?” Debugging, when that reliability is at risk. Collaborating together, both on debugging and on identifying problems, and figuring out, what are we going to fix? What is the right thing to fix that's in the critical path rather than wasting our time on things that are not going to meaningfully impact the happiness of the humans running the system or the happiness of the humans using the system? That's all that I have. Thank you.

Questions and Answers

Fong-Jones: There's one question back there. Also, while he's running the microphone over, Charity and I will both be right in front of the podium after this talk. So if you want some stickers, if you want to chat about this, I know that asking public questions is not everyone's jam. Please don't be shy and talk to us privately afterwards.

Participant 1: You talked in particular, you noted at one point saying about disks filling up. The system we're building at the moment we're taking the approach that you suggest using SLIs and SLOs to base our alerting. But I have to say that disks filling up terrifies me because the SLOs are a trailing indicator of the problem. We don't have a leading indicator of the disks filling up and our whole system exploding. Is that a problem or can you base SLOs on leading indicators of problems like that?

Fong-Jones: Yes, I think that there are two answers to this. First of all, that if you have a predictable leading indicator, for instance, if your disks are filling up at a rate that will lead to exhaustion within a certain amount of time which you know always is correlated with your system crashing, sure. Go ahead and set up an alert on it in the short term. But keep a close eye on the operational load that generates. We had a rule when I was working at a large cloud company that you are not allowed to have more than two incidents per shift, because you could really only investigate and fix thoroughly two incidents per shift.

So if that starts eating into your budget and causing too many people to get woken up all the time, maybe fix it and come up with a better solution. Which leads to my part two, which is maybe your systems should be resilient to disks filling up. Maybe if a disk fills up, you just restart that individual VM rather than allow data to accumulate on the disks that is no longer being backed up or stored because we're out of room, right? It's always easier to clean up an old replica and start a new replica than it is to try to clean up a disk that's been filled up.

One question down here in the front. And I think there will be time for one more question after that. And you can definitely find Charity or me in front afterwards.

Participant 2: First of all, thanks a lot. Great talk. Speaking about production excellence and speaking about collaboration, does it require pair programming or it's optional?

Fong-Jones: I do not think it's specifically requires pair programming. What it does require is that during an outage, people feel comfortable working with each other. For instance, looking over each other's shoulders, sending graphs back and forth, like bouncing ideas off of each other. Whether you do pair programming, whether you do code reviews, those are all good software development practices, but I don't think they have a direct bearing on operations necessarily.

Well, we'll be down there and I so appreciate that so many of you will turn out for a culture talk in the last slot of the day, so thank you again.


See more presentations with transcripts


Recorded at:

Mar 28, 2019