BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Embrace Complexity; Tighten Your Feedback Loops

Embrace Complexity; Tighten Your Feedback Loops

Bookmarks
35:49

Summary

Fred Hebert discusses various small approaches and patterns that influence how teams deal with reliability, and highlights some of the key interactions and behaviors.

Bio

Fred Hebert is a staff SRE at Honeycomb.io, caring for SLOs and error budgets, on-call health, alert hygiene, incident response, and operational readiness. He’s a published technical author who loves distributed systems, systems engineering and has a strong interest in resilience engineering and human factors.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Hebert: I'm a staff SRE at Honeycomb. My presentation is nominally about dealing with complexity and embracing it, and tightening our feedback loops, because those were the lessons I most often try to apply to work. That's the descriptive title for the talk, and that follows the guidelines for the conference. If I really had to follow my own heart and my own policies about what my talk would be, it would look a bit more like this, "This is all going to hell anyway, and all we can do is influence how long it's going to take." Things are going to be challenging. We're going to be veering towards the edge of chaos anytime. Any improvement we make to that pattern is going to be taken right back to drive us back there. More bandwidth, you transfer bigger files, you make things simpler, then you make more things until they're hard to understand. Any improvement brings you right back to the chaos that you have there. The best I hope for is the ability to influence and stay away from that. Ultimately, I don't think there's enough control or ability to deal with that. Instead of just asking people to accept this as taken, I want to take a bit of time and go about why I think this is the way it is.

Moving off the Map (Ruthanne Huising)

What is probably my favorite paper ever is called, "Moving off the Map" by Ruthanne Huising. It's an ethnological study that she did with multiple corporations that were going over transformations, process changes, reorganizations, that kind of deal. What they did is mapping. The mapping would be essentially asking the question, what is it that we do here? Each of the teams, each of the people working, you ask them, what is their inputs? What are their outputs? What do we do when there's a problem? How long does it take? With whom do we deal with? What are the tools that they use? You put that on a wall with these big bits of strings like that, to understand what is going on in the organization. This process is driven by people who are thought to be the experts within the organization at what it does. They are the people who know and understand the things the best. They are asked to help us build a map of what it is we do before we make the change.

She went through five of these organizations, they built something like that. These experts looking at the maps they built had these reactions. One of them just said like it was like the sun rose for the first time ever. They had seen all the components in isolation. They understood what they meant, but never how they interacted together. Another said, the problem is that none of this is designed. Everything was piles of adaptations that they built, that they created to solve local problems that they had, and traded the overall organization. It was not designed. At some point, one of the managers walked the CEO through the map. The CEO just looked over it, sat at the table, put his head on the table and said, this is more fucked up than I imagined. What happened is that the CEO revealed that the operation of the organization was outside of his control. In fact, the fact that he had any control at all was imaginary, it didn't really exist.

What was really interesting is that when she looked at what happened to the people who helped the projects, after the projects, a bunch of people got promotions. They were the people doing the communications, the training, the process design, and dealing with the costs and savings. When she interviewed them, what they said is just like we had a big project to put in our resume. We were always in contact with higher-ups that really helped us climb the ladder. Then there was a second group of people, and they are the people who more or less politely fucked off, they mostly moved to the periphery. If they were in the central production units, they moved to the edge of it. Some of them left the organization and became consultants, came back to work there. Some of them just flat out left the industry. She started looking at them, and they were the people who collected the data and created the maps. The question was, what is it that makes the people who understand the organization better move away from the core and into the periphery to do it? She identified two reasons for that really. Either they mentioned that they now felt they better understood how to influence change and make things work, and that was not in the core. Or they felt alienated about the organization and their impact, and they just wanted to get out. They didn't like the experience anymore.

She attributes that to something called a fatal insight. It mentions that the organizations that we have, the institutions, the roles, the routines that we have, the procedures are emergent properties. They are not a given that people follow, even though a lot of people follow them. They are maintained by the actions of people every day. The choices we make reinforce and create that structure. That's something that the sociologists knew for a while, that the worlds are not independent, they're not stable. They're constantly changing. Because these rules are not as hard, and given that they thought this is why some people felt that they were more empowered now, they could just go and change things directly a bit more in the grassroots. The other ones who worked on the structure and really believed in it tended to be those that were alienated and wanted to leave.

Nominal vs. Emergent Structure

Leaving the paper behind, this brings me to a concept like that. Do we have anybody who has worked in a flat organization? It's a bit like the middle drawing, the one where you have one manager at the top, and then everyone is supposed to be equal. In practice, this isn't how it works. There is still a structure there based on trust, based on authority, but a bit less accountability. Some people have more knowledge, have more influence, and they can make or break a project. That brings me to split it into these two parts here. The nominal organization is a hierarchy. The structure it has is usually a given to get transparency, accountability, and in multiple ways, constraints. You want to make sure that people work on the things they're supposed to be working. They work generally in the proper direction. It's not everyone inventing what to do and spread out. In practice, that structure is never on its own alone in there. There's the one on the right, which is people who might like each other, hate each other, influence each other, have some amount of respect. Sometimes it's going to be based on the knowledge. You take someone who worked on a legacy component, you move them somewhere else, they still own that component somewhere and have influence on that team. Sometimes part of that network is no longer in the organization because they left or retired, but it exists and it is there.

It isn't that really one structure is better than the other. They both exist in superposition at any point in time. If you make a project that only works on the nominal structure, and you try to impact change, it's likely to fail, because it's going to miss a lot of the messy interactions that you have within the organization. If you only work on the emergent, the trust structure, the knowledge structure, and you forget the rest, then the strictness of the hierarchy is likely to break you down. In the weird ways that they interact is that, as I mentioned, nominal structure is often constraining you, letting you know what you can or can't do. The emergent structure, the trust that you have decides how these rules are enforced, and to whom. Some people are given more leeway because they have more trust in the organization. Again, not arguing in favor of one or the other. They just need to coexist, and you need to be aware of both.

This is also related to something called the gap between the work-as-imagined and the work-as-done. The work-as-imagined is the way you conceptualize the work in your head. I believe engineers do work this way, that way. I'm going to give them the following tools and ways to do it. That's how you prescribe the work as it is supposed to happen. The work-as-done is mostly invisible or impossible to fully describe. It is how you take these rules, these prescriptions, the real situation, and you bridge really the gap between the real world and what you're supposed to do. Even if you try to describe it, you're not able to describe the entire thing. You're not fully aware of the things you do, but it has a real practical aspect. In some cases, you're going to disclose how you do your work, what happens, what were the results. You're going to cover or keep some things, either because you are unaware of them or just because you don't want people to know about them, and that feeds back into it.

Just as a fun example. In January this year, I went on Mastodon with my followers. It's a super scientific poll. I asked 120 people, if you're working in software and you have to fill in timesheets about which project and customers you work for, and you were out of time but you still had to complete the project, what did you do? Multiple choice answer. Thirteen percent of the people say like, I work for free. I stopped the work, or I just don't track it. Good for them. Thirteen percent just said like, I put the time in unrelated projects with more buffer, which in some cases, I believe is fraud. Thirty-five percent said like I enter the time in the same project, regardless. Comments were things like, it's not my problem, the budget is wrong, it's going to take the time it takes. Fifty-eight percent of the people just said like my time tracking was always fake and lies. If you make decisions based on the timesheets of the people, and we have that cycle in here, you are being told it takes this much time. It takes a given amount of time, you report a different amount of time, but you enter that cycle and that feedback loop of encouraging the way the work is being imagined.

To break that cycle is not necessarily easy. I think it's being forced by something called the pressures and goal conflicts. Whenever we make decisions at work, we are balancing a lot of really different constraints. They're all important. The amount of workload we have to do. What's the staffing we have? How urgent is it as a task? Is it risky to do for me, for the business, for people, for the environment? What's the amount of trust I have with people and whatnot? What's the budget that's available, and all of these things? For each decision we make every day, they are part of that. What kind of material are we going to do? How much time are you going to allocate to that? Sometimes they succeed, and sometimes they fail. The outcome is not necessarily known ahead of time, but the way we manage all these pressures is a bunch of micro decisions every day, and they build up over time. It's one of the many ways you could decide to describe your culture as a business or as a team, how will you manage these tradeoffs?

Because this is all very local, it means that it's really easy to influence, which means that it's a great way to start a counterculture. You see people make decisions on your team, and you go like, can we look into that? Is it cheaper to maybe buy more servers and spend your time on this? You can slowly create that counterculture in a team to do things differently, which sometimes is going to be great because you're doing skunkworks, and you're working against a big bureaucracy that you want to bypass. Sometimes it's going to be terrible, because you don't have the proper values for the organization, and you're just pushing your toy projects in production, or something. It's not that one is necessarily great or not, but it's one of the centers I believe we can apply pressures or influence, that over time, just bubble through the whole organization.

How to Embrace Complexity - Negotiating Tradeoffs

The rest of my talk is really going to focus a bit more about that, how do we embrace the complexity? I've divided it in three sections. First one is negotiating the tradeoffs, these goal conflicts. How can we help with that? It has a perspective that's a bit more DevOps or SRE because that's my experience. Then there's a section of it that's about aligning it back with the organization and the feedback loops. The third one is just having a good system view. We're going to kick it off with negotiating the tradeoffs. This is the first lesson I've learned talking at a conference a long time ago, with SRE team for one of these sites where you put images in there, and you put them on the board, and they show you ads and stuff like that, is you don't deliver more than you're asked to. This SRE manager was telling me like, we are having terrible problems in our organization, the website is going down consistently. We're burning people off. The call rotation keeps getting shorter. It's not able to hire anymore, because we have a reputation for having terrible schedules. It's just not looking great, what can I do to help with that? He mentioned there were perverse incentives at play. Every time the site went down, they made more money, because the ads kept being shown but they wouldn't display the images and so there was no bandwidth cost associated with it. They're going, there's incentives, but everyone is happy, users seem to be returning, so what the hell? I just asked the question like, is it possible that you're trying to deliver stuff outside of professional pride, you're an engineer and you want things to be good, working well? You adopt that as your ethos? Is it possible that you're shipping more than anybody else in the organization is asking, and you're currently fighting both the users in the organization shipping something nobody asks you to do? It's the equivalent of someone wanting a shed then you build them a nuclear bunker. Like, why don't you use the bunker? It's like that's not what they want. They just want to put a rake in there. It's one of the lessons that stuck with me, that word is you have to calibrate and actually deliver what you have. In some cases, you want to do it and you're asked to do it, but you don't necessarily have the equipment to do it or the capacity to do it properly.

This one is a story that comes from us that work at Honeycomb. We've decided that a good call rotation is five to eight people. Fewer than five people, you start burning people off. Higher than eight people, you're on-call so infrequently, that you feel out of practice every time and that's very stressful. Five to eight is like good exercise, you do it frequently enough that you're always in good shape, not enough that you enjoy yourself or that you become rusty. The problem is that most of our product teams have enough engineers to maybe fill a rotation of three to four people. It's impossible for us to have that rotation, where we have all the knowledge required about components to have in-depth runbooks, the understanding of all the things. If we want to have a five to eight people rotation, the thing and the approach we took is we just merge teams together on the rotation, which means half the people don't understand half the things they are on call for. Which sounds like a really terrible idea. The approach we wanted to have is really preserve that capacity. Instead, what we did as a tradeoff is gear the entire response towards dealing with the unknown. When we build or ship a new feature, rather than having a long soaking time of trying to understand all the problems that could be happening with it and having the whole procedure, the demand we have and the guideline we have as an approach is, I want to have feature flags to either integrate it, turn it off for a single customer, or turn it off for everyone. If there is a problem outside hours, and nobody is there to help, we just shut the thing down. That's the tradeoff we take. Maybe the beta period is a bit longer. Maybe the early access is a bit longer, or maybe it's going to be a partial outage. Everyone who's on-call has that high level pattern of, this thing is weird. It's threatening the overall system. We're going to kill that part, preserve the rest, go to bed, and when people are up this morning and know what to do, they're going to help us fix it. We used it multiple times, and the pattern is now propagating within the organization.

During business hours, the thing that might happen is, yes, this is a big incident, and we need the expert who wrote this thing to help us with that. They're not on the rotation right now, we're going to bring them in. There is this understanding within the organization that in order for people to be on-call less often, they have a weirder scope, but there's going to be more interruptions. The roadmap is understood to be more flexible or possible to turn upside down as a consequence of these decisions. By having that very frank conversation about what it is that we want to do, and what we can actually do, we end up having this tradeoff negotiation about really the capacity planning, the iterative development, the training, the onboarding, the staffing, the testing approaches, the operations, the roadmap, and the feature delivery, it's held above water, and people can participate and understand how it is made, rather than being made implicitly and tacitly in different ways by all kinds of people.

To have these tricky discussions and decisions, you have to be able to bring them up. That's a requirement for psychological safety. There is no replacement for that. You have to be able to mention what is a problem and have it heard. One of my favorite stories from that comes from a previous company I worked at, where there was a long sequence of incident with some themes, 30 incidents or so in a year. We did a cross reference analysis between them, and found out that most of them had as a reason, there are not enough tests, or testing was not good enough. Director of engineering went on the way of saying, we should hire people to train our employees in testing better, they're asking for more tests. The next time an incident came up with that reason, I said I want to run an experiment. I went outside across the departments, went with them and asked just one question, which is, when we write tests, there was a point between zero tests, [inaudible 00:17:34], and we write a formal proof with a mathematician. We somewhere in the middle of that decide this is enough for me to ship my code. The reviewer looks at it and says, "Yes, that's ok. That's enough tests."

What guided our decisions? What were we looking for when we did it? The engineer took me aside and said, we knew the code was broken, it's just that we get yelled at less for shipping late than shipping broken code. There's no amount of training in the world that's going to fix that problem. You go back to the director of engineering, report the facts, anonymously of saying, yes, they just ship broken code because that's better than shipping late. It's like, that's ridiculous. They should know that they can pull the Andon Cord, press the big red button, stop the supply line or something, and make it work. I went back to the team and told them that. I was like, that's not how it works. It's just not how it works. That's not how they did it. They had nobody senior on the team to be able to call the shots on that. You don't get that until you're able to have these discussions, surface that stuff. That's one of the patterns where we have a big difference between the understanding and the imagined status of the organization, and how it actually works. That disclosure requires psychological safety.

How to Embrace Complexity - Keep Feedback Loops Tight

We can make the tradeoff negotiations a bit easier and simpler by talking about them openly. Same way as that story about testing. If you never change the culture you have, if you don't realign it, you're not going to get necessarily better results. That requires us to move that information and make it available to people who act in good faith. Metrics are there to serve you and not you to serve them. It's still the same thing that's been discussed plenty of times. A metric that becomes a target becomes gamed. In my case, I like to mention metrics as being good to maybe confirm a hypothesis you have, but they're generally garbage at helping you come up with a good hypothesis. They tell you something happened, but they're a very lossy compressed signal about what is going on in your organization. In my mind, one of the only good reactions to a metric is to ask the question, what is going on in there, and have a deep dive? If you pick a metric, there's always going to be loss in the data that you have. There's something you care about, might be user happiness or something like that. There's satisfaction. There's a limit to how many surveys asking how many friends you would recommend than you can send in a month. There's a thing that you end up measuring that is easier for that. You go like, the uptime is a really easy thing for us to measure in comparison, and we suppose that a site that is up longer is good for user satisfaction, because if the site is down all the time, they're not happy. The way you measure that is going to be messy itself. You could measure the end-to-end stuff with the end user device, and then you find out that half your problems is people overseas, and you don't control their networks. You end up cutting it at your own service. What happens is that you have these weird influences that happen, and some customers have more data than another. It's never the right shape, where your metric is worse than the thing you actually cared, which was their happiness is almost entirely disconnected from what you measure. Really, the thing you want to do is have a deep dive when they happen, to understand what is going on. Use a metric to direct your attention, not as a way to direct your ambitions.

Which brings me to the next one, a useful indicator renders itself useless. It's related to that one. There are simple ones, where if you figure out that people who drive at night have more incidents, you stop them driving at night, then measuring how many hours are driven at night is no longer useful. That's a simple one. One that we have a little bit more in software is going to be the existence of bottlenecks. You're going to use a bottleneck usually to drive autoscaling or something like that. My favorite one at work comes from disk usage on our main query engine. It used to be that disk usage was the thing that forced us to scale horizontally all the time. It became like a cost target because the hardware you run every month was problematic with it. At some point an engineer got pissed off, optimized it away, and then disk was 10 times more faster to offload to S3, and then disk was no longer a bottleneck. Funnily enough, I think this is one of the most common contributors we have to incidents. We fix something, and now that we no longer have a bottleneck, we also have no longer a good signal about when to scale and how. The moment we fix that bottleneck, we are in unknown territory. We are only going to find when the next bottleneck and problem is, through encountering it in production and often in an incident. Really, the thing that I have there is be ready to change your metrics all the time, whether they are at a high level or a low level one, they are going to become a target, whether you want it or not. Because if it's a good predictor of something, you're going to act on it. These actions have consequences, and people are going to want to work on them, and that's going to ruin the thing, and you need to find new stuff. That's ok.

Worth pointing out, people will do what they believe is useful. This works in two ways. If you tell people that they can't no longer connect to a production host, but they do believe that they're going to need to do it, they're going to put a backdoor in there, or they're going to just put it in an admin with a password or something. It's a backdoor with nice clothes. It's going to be there because they think they're going to need it. In other cases, it's going to be that you're going to ask them to do things they think are useless, and they're going to half-ass it. They are going to give you action items and incident reviews that are worth nothing because you're asking for them, and they can move on and do something else. What this means here is that writing a procedure means nothing if people are not actually going to follow it, or just report that they're going to follow it. The opposite of that is that if you're able to find a practice that people find useful and interesting, you don't even have to encode it, it's going to spread through the organization and be adopted by people. If you're able to have a procedure that people love and follow, more power to you, but it's not necessarily great. A related concept to me there is that if you have action items in your incident review, and you find that they go in a backlog to die forever, a reflex would be people are not following up. I prefer to see it as it means people think this is not useful. We need to review how we do the incident review and what they get out of the action items. Because if it were useful, they would be ok, and they would do it, and they will schedule them. There is a value issue in that one. People are not going to take something that's super useful, empowering to them, and just do nothing with it because they feel like disobeying people. That's a signal that the process is wrong more than the people are wrong.

Related to that, a team you don't trust is just random people. The best way to tighten your feedback loop is to act on it as close to the source as possible. Generally, that's going to be people with boots on the ground. It's going to be your engineers in the case of software, who see the thing happen really early. You can only trust them and cut the middleman which is yourself at a higher level, making the judgment call. If there is a clear alignment on these goal priorities that you have, in that they have the autonomy and the understanding of what the broader objectives of the organization as a whole is going to be. That's a difference between trust and verify with people, which is, I trust you to make the same decisions I would make, and actual trust, which is, you're going to make your decisions, sometimes they're going to be wrong, but then we're going to talk about it. The feedback has to go both ways. You have to trust them to make the decisions, but the people have to trust you with the information about them, and not necessarily going to be punished for trying to do their best there.

One of the great patterns that you have there is. Has anyone worked with big corporations where contractors come in? First thing these contractors do, they go around and they ask, what's the stupidest thing they have you do here? People have complaints that they've been gathering for years that nobody would listen to or that they thought were not worth mentioning. They gave them to the consultant. Consultant brings them up, things change. That's the easiest job for the consultant starting. They just ask for all the problems that people know exist, and you make them visible. If you have these feedback loops in place, it's super crucial to keep them around. If you don't have them, like the consultant is the example of going to the periphery to help the system change. If people are not allowed to be wrong, then it means they can't do anything new, because doing something new implies there's a chance for mistakes. If people can't be allowed mistakes, they're just going to do the same thing they were doing and is known to work. It's not going to cause a lot of adaptation to take place.

How to Embrace Complexity - Interactions Over Components

Interactions over components. It's hard to change the outcomes without changing the pressures that foster them. I don't know if it's going to work well in New York City, it works really great in the suburbs. It came with the idea that I used to be trying to get rid of all the weeds in my lawn, the dandelions and everything, and I sucked at it. The only thing I had was a shitty lawn full of holes. Until someone explained to me, it's just like, the problem is not that you have dandelions, is that your soil is poor, it's dry, it's not healthy, and nothing else but dandelions are going to grow in there. Even if you keep pulling the weeds all the time, that's the only thing you're going to get. This is one of the risks that you have when you try to just stamp out errors all the time or root causes. You look at the failures, and you try to remove the failures, but you are not necessarily fostering a soil in which you get better decisions that are made. The idea is to really look for what are the positive behaviors you're looking for, and finding the ways to reinforce them and to make them more likely to happen. In terms of reinforcement for some of these, there's a big risk sometimes in using the carrots and the sticks.

When you have good behaviors, and you try to do it with a carrot and a stick, you are not helping any of these things, you're adding variables to that decision making. If you're getting a bonus for a good call, you're getting a penalty for a bad call, you're adding variables in there. Generally, what's going to happen with a carrot or a stick, is you're going to make the same decision you did before because you did it to the best of your knowledge. Then you're going to report something different. Because now you can get the bonus or avoid the penalty for the thing you would be doing anyway. A carrot and a stick, in general in my mind, are not a great way to help facilitate and foster good decisions. If you want to do it, you have to get back to that psychological safety and really discuss what are the challenges, and align on what you think the good decisions would be, and be open for that discussion. Adding incentives, not helping a whole lot of stuff. If you're fostering these good behaviors at a low level, it's really rare they're going to run afoul of the organization. If you go to the airport and you're looking to the business management sections, like all the books are about how to get people to do what you want. If you have a good initiative and you just bring it up, you're going to usually get validation from management extremely rapidly.

In terms of fostering behaviors, really give them room to grow. One of the practices we've done at Honeycomb that I think is one of my biggest success, is every week, I've put a one-hour slot in people's schedules where we discuss on-call and operations. That sounds like absolute garbage, just more meetings. In some of these we discuss things like, is it ok to be visibly angry and swear during an incident? What do you do when you don't understand what is going on? How do you deal with the feeling that you're burning out? Do we think code freezes are a good idea or a terrible idea, and when do they apply? What ends up happening is that week to week we have some discussions and sometimes it's how do you operate this new service. We share that experience across multiple departments. That ended up being a fixture of what operations are at the company. When we wanted to do incident reviews at some point, we realized the hardest part of the incident review is not the analysis, it's scheduling the meeting. You have six teams, and they're all in different time zones and they all have different schedules, it's impossible. We had that slot on the calendar, and we just said, an incident review is operational. If you have a big incident review to run, put it in that time slot, we're giving it to you. The observation is that over a year, we went for incident reviews that had 5 to 6 people, and then usually 1 or 2 teams, into incident reviews where we had a crowd of 40 to 60 people, and we're about 200 in the organization, I think. It's not that we planned ahead of time to have that stuff, but we made room for it. When we saw a need, we leveraged the tools we had to give it more room, and we got good behaviors out of that.

Finally, an indicator is most useful when it is acted on. Those are error budgets from SLOs. Usually, we define an SLO. It's going to be, what's the success rate or the error rate we want to have? Invariably, when we pick OKRs, people are going to propose like, what if meeting our SLOs is the OKR? My answer to that is I don't care if we meet the budget or not. I care that we meet them, but I don't necessarily give a shit to have them as an objective. The thing that I care about is that if the budget goes down, we have a serious discussion about what we want to do to change that. It's very possible that the budget on the top left was like two incidents, and they're considered to be one-offs. We had issues with the deployment, it was a weird outage. We think we've fixed it. It's pedestrian as an incident. We're going to go, "Fine. On-call." We reset the budget, we move on, we do something else. If it's the one on the top right, that error budget talks or speaks to a gradual degradation over the course of months. Are these possibly new features that are now more demanding? Maybe we just got new customers who use the product in a different way that it's not really capable to deal with. Maybe we reached a tipping point in scaling, and we need to do emergency work. It's possible it doesn't fit within the on-call scope. Rather than getting paged every two or three days because a budget is emptying itself, the thing I want us to have and that we have is a discussion with the various departments. Product talks to engineering, and we have customer success and support in there. We make the call about whether we need to actually shuffle the roadmap to address this, whether we let it burn, because there's an improvement coming in maybe two weeks, and then we just stop alerting. Or maybe we finally relax the SLO for good because customers have not been complaining about that, it's not actually a problem. It's no use being paged about that. We just need to calibrate better. Then, if we change these targets, how do we communicate that to customers? Having that large wide-level discussion is what I actually care about for the SLOs. They're a good signal, they're never a target.

The two SLOs at the bottom are interesting ones. One is showing like a 10% budget, we're burning almost the entire thing. The other one is 94% of the budget, it's possible that this is like five nines right now. I think the budget on the left is better, because we are having normal error rates every day. This will tell us about a deviation slightly better. It's very possible that customers are ok with both of them. We don't really know until we ask as a question. The way I calibrate SLOs at work is I go talk to support. It's like, is there any issue that's happening right now, you're hearing complaints that we don't know about? Ninety-nine percent of the time it's a success because they just tell us, no, every time there's something you're already looking into it. Recently we had a case, like a customer, low volume enough, never made a dent in the SLO, is like, performance is absolutely terrible for them. At that point, it's just like, even though the SLO is not getting them, we know the signal is no longer valid. The complaints are coming through. The problem is the calibration and we need to change that stuff. Really, the knee-jerk automated reactions are not useful. All of these things together and a few more, I think point to us trying to be the feedback loop. The SRE team, the way we've built it, tries to be on the side of the organization. It crosses the silos, doesn't necessarily belong to anyone. We're going to talk to management, to various engineers. We embed ourselves on multiple teams. The things we try to do is really have these relationships in place to help facilitate this escalation in relation, when we have a big incident that doesn't fit anything anymore, bring people into the fold and help with that. Or in some cases just have the information at the ready. We go like four or five projects are shipping at the same time, if they do in the wrong order, we're going to have a problem. All of that information that comes into operations, incidents, the stuff that hits production is essentially the feedback we want to have, because they are the consequences of decisions made a long time before. How do we train? How do we set the roadmap? How do we schedule stuff and whatnot? These are all contributors to incidents. We have a unique view at the edge of operations to bring that back and help transform the kind of practices that we have. I think this is how we are able to take a bit longer before it all goes to hell.

 

See more presentations with transcripts

 

Recorded at:

Apr 16, 2024

BT