Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations The Endgame of SRE

The Endgame of SRE



Amy Tobey discusses sociotechnical thinking, exploring ways SREs can impact reliability at scale.


Amy Tobey has worked in tech for more than 20 years at companies of every size, working with everything from kernel code to user interfaces. These days she is senior principal engineer leading Applied Resilience Engineering at Equinix.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Tobey: My name is Amy Tobey. I'm a Senior Principal Engineer at Equinix. We're here to talk about the endgame of SRE, which is what happens after you've done all of the things that SRE is supposed to do. We've built the best infrastructure that we can build, and yet, our teams still have reliability issues that they need to figure out. To do that, we need to go explore the sociotechnical space. To enable that exploration, we're going to go into the town of Endgame. We're going to do what you always do at the beginning of a new game, or in the beginning of a new journey through DevOps and SRE, we're going to go talk to the people on the ground and find out what they're feeling.

The Root Cause of Outages, in a Perfect Environment

Right away, we talk to hacker Dave, he's in SRE. He set up the perfect Kubernetes environment for his people. Somehow, they still have outages, and this baffles him. He wants me to figure out why and I don't really do root cause. We'll just go around and talk to people to understand what's going on. Right away, we meet Agi Lecoach, and she is somebody who works with a lot of software teams in this organization. She's a great source of information about what's going on here. Immediately, she keeps telling us about the leadership team, they hang out at a place called Mahogany Row. We're going to talk about a lot of sociotechnical factors. That's really what this is about, is how all these things come together. One of the first ones that I've noticed is a lot of people in, I see individual contributor positions, look at the leadership as like they should know all things. Really, they're just people. There's a team called Inferno. They deliver, but they have quality problems all the way through the process. We're going to meet Team Marathon. Their velocity is smooth as can be. They make it look easy. They go home at night. I want to learn more about what works for them. Then there's Team Disco. They work really hard, but their velocity is always going up and down. It's hard to predict when they're going to deliver something.

Another tip I like to give software engineers and SREs, and really anybody who's trying to do this DevOps thing is, go make a friend in sales. We've made a friend in sales, Glenn Gary. He's telling us right away, he's having trouble selling the product, because when he demos it, it crashes. That's like the worst case, because if we can't sell the product, we can't pay our bills. He says Team Inferno, they talk a big game. They never shut up about it. Their customers are stuck with us, so they haven't really figured out their quality issues. That's an interesting point. Team Marathon, it looks like again, that they're just taking it easy. I've heard this for a different product, from actually a real sales guy before, and this is like the best case, when the whole product and engineering and the machine works together, the sales folks will complain. They'll be like, "I'm basically just an order taker here. I don't have to work to sell the product, people coming to me being like, just help me buy it." That's like the best case. Then again, Team Disco, they always deliver, but it's never on time. Their releases are always dodgy. They're always busy. That's starting to tell me something. I think I smell what's going on there. Then, Glenn tells us about Mahogany Row and how they're always saying that everything is fine. This is the first concept I want to drop on you all, which is what I call Instagram leadership. This is when you got a leader, you get all the highlights, the best pictures of their organization, but you don't really hear about the stuff that's happening on the other side of the camera. I start to get curious when that happens, and be like, what do we not know about that they're not talking about because we can't help them as leaders, if we don't know what they're struggling with.

We meet Frau Barella. She is a UI engineer, or UX engineer, and is very proud of the style unification effort that her team went through. What she's pointing out to us is that sometimes, like with quality, with reliability, with security, it seems like we have all of the elements. All the pieces are there, but it's not coming together to create a great user experience. Again, this brings us back to the sociotechnical space. This is the story we're going to go through with these teams is, all of the stuff that happens in that messy liminal space that we call sociotechnical between the products that we're building, the technical bits, in the social space where all these humans are trying to do stuff is where these efforts tend to go sideways. With that, let's go meet Team Disco. Right away, we meet Team Disco's manager, Managear Greg. He comes right out of the gate telling me he's doing a thing that we call the crap umbrella. A lot of managers, especially young managers think that this is a way to help their team, in that they make sure that all the bad news goes around them. That they don't get hit with the worst of what's coming from the rest of the organization. It's a noble goal, but great people need context to do their best work. There's something I'm probably going to have a further conversation with him about, do some coaching around. We know where this is coming from because when Managear Greg goes to his director, she tells him, and goes, "Don't come to me with problems, come to me with solutions." This is a useful phrase for really great managers, but a lot of middle-ish directors and things will say this. What they're saying is, they don't know what to do. They're looking for that answer. The idea is, as a leader, your job is there to help your people succeed. It's ok to ask for a plan, because sometimes your people will be stuck. They don't know that they know the answer, and you lead them through it. Often, it's also a way to bypass responsibility too. I think that might be what's going on here.

We meet Polly Math who is all over the place. She's working at kernels. She's doing incidents. She's fixing tests in kernels. She's really doing a lot of things and probably a lot of glue work. This work is probably not being recognized. It might just be a matter of tracking it better. A lot of times, it's more getting out into the sociotechnical space and making people realize that all these little kinds of things that people just take care of are what make our business function. Because it doesn't show up on the books, and it's not revenue facing a lot of times, it doesn't get the credit it deserves. Here's another opportunity for us to engage with this team and the teams around them and help them improve their communications. Next, we meet Isabella, who tells us, we're so far behind, and they keep giving us more work. This brings up the law of stretched systems. The law of stretched systems says, if you add capacity to a system, it's going to get consumed, basically. This is what we see happening on this team is that they're never really getting their heads above water. Every time they find some capacity, it just gets hoovered up, and everybody is still complaining.

Next, we meet the tech lead, who says we love our work. Sounds like the team likes each other. Again, we know they're having trouble with their velocity. They ran an experiment. They said, let's split the work up, and as individuals, we can go faster as individuals than as a team. Let's split it up, so we can go faster. Great idea, except as Polly Math predicted, business is just going to keep asking for more if you do that. Now your team is disconnected and people aren't up to date on each other's projects so it's harder to communicate within the team. Also, you haven't really gotten out of the hole. I don't think this technique is really going to get them where they want to go. Probably give them advice to get back into that team sharing mode, if only to prevent burnout from people being too responsible for a single product. Cody is the senior engineer on the team. He brings up probably the most common, bar none, complaint that I've heard from various engineering teams, product teams. It's usually like one saying one side or the other, is, they never give us enough detail in the product specs for us to do our job and to build the right stuff.

Product people often say the opposite. They're like, we give them everything and they complain about it and we don't know what they actually need because it's in the technical space. There's always opportunity here when you hear this old song to get involved and start figuring out where those communication breakdowns really are. Because usually once you get these folks together in a more constructive setup, sometimes you need a designer in the room, sometimes an SRE sitting in the room helps. It provides the opportunity for them to start communicating instead of just taking shots at each other over the fence.

This does happen, where product teams tend to want to protect the customer. They own the customer relationship. They're very proud of it. This often could create barriers where engineers aren't able to get face time with customers, and it's harder for them to build empathy and build really great products when that happens. Then, finally, for Team Disco, we meet Devo Pestorius. She tells us another story I've heard before, which is, we have all this great CI/CD stuff, we're doing DevOps. Why are we so slow when all of our vendors told us if we implement this stuff that we will be great at DevOps? Once again, the answer is, it's not your tooling. It's the sociotechnical system you exist within that creates the environment in which your team is performing poorly, so let's go out and adventure more into that environment, and see if there is opportunity to change it and make it better for them.

Raising the Bar Leads to Burnout

Given what we learned about Team Disco, they're a classic case of a team. Probably the most common thing I see with software teams is that they're churning. They're having a really hard time making progress, because something keeps pulling the rug out from under them, or their priorities are changing too fast so they never really get things over the finish line. We go meet the Bob's. It's what everybody calls him. This is why, is because he's doing something that infuriates me, and I've seen this in the real world. It's that he thought he could get more from the team by raising the bar. A lot of leaders and product managers, and so on, feel this way, is that they're like, if I put pressure on the team, I will get more. The reality is, and most of us who are software developers and SREs, and so on, know that when that happens, it doesn't necessarily get you more, it burns people out. When you see this, it's a good time to bring people back to reality and remind them that velocity points are for your team, and they have no relevance outside of it. If I had my way, product managers would never be able to see story points. This is why I think it's deleterious to do that, because it's there for engineers to get better at estimating over time, not necessarily for people outside of the team to make predictions about how much capacity they can squeeze out of it. It's like, whose line is it anyway? The points are made up.

We meet Directrix Tina. She is the director that says, come to me with solutions, not problems, but she doesn't want to talk about that right now, because what she sees is an opportunity. She sees an SRE that knows how to code. She's got teams that are behind. What if I add another person to the team? Maybe, Amy, can you go and write code and help us get this over the finish line? You'd be an asset to the team. A lot of us, especially as you get into staff, principal, SRE roles where you're not directly in the line of revenue, there's a temptation there to do this. Nobody succeeds without a support system. We're the healers. We're the back end of making things happen. I'm going to politely refuse again, for Directrix Tina. Maybe have a talk with her about this whole helping up Managear Greg. She recognized that Disco is struggling, but she really just wants me to talk to Inferno.

The Telephone Game

Another pattern that all of us have probably seen, is the telephone game. Pandora goes out and talks to a prospect, and shows him this new dark mode demo to get them excited. They want that before they're going to sign a deal. Then that goes up and down the leadership tree. It's gotten to the CEO. He recognized, we got a lead in front of us, we got to get that. He's like, what if we could theme the whole site? It's CSS, it's so easy. Like this guy knows. He's got the power, so he can make it happen. He went to his guy over in Team Inferno, and they started work on theming the whole site. I can only imagine how this is going to go. Getting back to the telephone game I mentioned, so we've already gone from dark mode to theming the whole site, and now we're talking about a color wheel. I don't know where that came from. You need a color wheel to theme the site. I don't think you need it for dark mode. This guy is just trying to do his best. He's trying to get that promotion, and so he wants to help. The intent is great, I suppose. The impact is that, now we have these multiple ideas in our organization about what we're building. They're going to conflict and they're going to slow things down.

Pantone Values

We meet the Mistress of Scrum. She's forming a tiger team to get ahead of this problem, and to implement the new Pantone values. We've gone all the way to Pantone values now, and we've covered the telephone game thing enough. What happens is, a lot of times, especially in command-and-control organizations, what you'll notice is that communications tend to go up and down the chain of command. This is slow. Because of the telephone game, often you have information loss along the way. We know about information loss in analog systems if your Ethernet cord is too long or frayed, you might drop packets. The same thing happens in human networks, and it causes all kinds of problems, that we sit there scratching our head about, how did they come up with Pantone values? The other thing is the tiger team, it seems like a great idea. We'll go get our best people from all the teams and put them together into a new team so we can knock out this new feature. Except that you completely screwed all those other teams by taking their best people away from them, and those people probably had a role to play in that team. I'm not a huge fan of tiger teams, unless there's like a really existential business thing happening. Even then, we have to keep things moving, we have expectations to keep delivering. That often doesn't get taken into account properly, I think when tiger teams are formed.


Then, I can't go by without making a pass at water-scrum-fall, and what we see as a lot in our field. It's inescapable in a lot of ways. These organizations that misunderstand what scrum is, or agile is and then they also want to keep holding on to waterfall. What we end up with is they plan all of the sprints out for months, which is completely missing the point of agile. They still have sprints, but they're scheduled ahead of time and the plan never really comes together. We lose so much time and momentum in that. We don't get the chances for the feedback to go back into making the plan continuously better. It's one of those things that I look for, that tells me where to start talking to people and intervening in the sociotechnical realm. Finally, from Mahogany Row, we meet Desi Goner, she's a designer. She was sent up here from Team Inferno to figure out the critical user journeys. As we were talking about, without a strong product process in place, if you go ask 7 leaders, what we're supposed to build, you might get 8, 9, or 19.2 answers. That's also a pattern I look for is when we just have these diverging opinions colliding with each other and preventing great communications.

Creating Room for Downtime

We were asked to go take a look at Team Inferno. Let's go see what's up with them. Immediately, I'm getting a vibe here, not just the red, but let's go talk to this guy, he looks very boss like, and indeed, his name is The Boss. It used to be a joke I used to do when I was younger, and then I was working for a manager, I really liked it. I thought it would be funny to call him boss every once in a while. He stopped me and said, "I don't like it when you call me that. I don't like being a boss, I'm your leader. I'm trying to get you to understand where we're going instead of telling you everything you're supposed to do. I think that's what bosses are and I don't want that vibe." It's like, that was a good lesson for me. The other thing going on here with the boss is right away he's coming at me complaining about the millennials. That's right off the bat a problem. What he's really complaining about is boundaries. There's a little bit of a generational divide there. It's fuzzy. I'm not going to make any claims about it. What we see happening with the quiet quitting debate or the quiet firing debate is people, those of us, the workers are setting boundaries. We're saying, I need to have downtime to do my best work to continue doing this job for decades. I can't just work 24/7. The boss says, he turned out ok through all that hustle, but I'm looking through the armor and I'm seeing red eyes. I'm seeing that maybe he didn't turn out ok, after all. To just put a cherry on top of this guy's problems, and I've seen a bit of this in the real world too, where people, especially an authoritarian style leader, will jump to being like, "If you guys write an exploit, I'm going to fire you, so don't write any exploits." I don't really understand this line of reasoning, but I try to empathize with these people so I can understand them and fix what they do. A threat will keep somebody from writing a vulnerability. Those of us who write code that goes on the internet knows that's bullshit. We know that it's your state of mind, your ability to focus, your level of rest that determines the quality of the code that you produce on a regular basis over time. Lots of interventions already needed here. My recommendation would probably be to get a new manager for this team. Let's see what else is going on before we come up with recommendations.

This guy, his name is Tenex. He clearly needs to work on his skincare routine. We've all probably worked with this person. I've been this person, earlier in my career where I liked working a lot. I liked what I was working on. I just never understood why people didn't want to do what I did. It wasn't till I got older and went through some burnout that I really started to understand. This is probably something worth addressing, because the standard he's setting is unreachable for the rest of the team. That's going to create problems. It could create problems with morale. If he's working that much, I've got questions about the quality of his code. The other thing that's happening is we're bypassing the planning process. Any business that's bigger than a few developers, we've really got to have some process to make sure that we can get into throughput mode, and crank out all that code that builds the business that we're supposed to be building. When we keep churning, we end up like Team Disco, where we're not actually making a lot of progress on each thing that we're trying to build, and they all end up disjoint and low quality. Yes, planning is for nerds, but I like being a nerd and I respect nerds.

The Lead Dev is frustrated. When you have somebody who's out doing 100-hour weeks and producing low quality code, and maybe he doesn't have respect for their coworkers, somebody's usually doing that cleanup job. It's usually somebody with different socioeconomic backgrounds that usually end up with that work. She's fed up with this. This team is looking at attrition, because who wants to work in an environment where you have to constantly chase this guy around and smell his lack of showers, when there's a wide world of developer shops out there to go work at. When this happens, each time we have attrition, it's easy to go like, Amy moved on. It's not just that person going away in their capacity to produce the future. It's also all the knowledge and implicit knowledge they carry about the systems that we've built, that they take with them, and that we no longer have access to. The people with our implicit knowledge is the unit of adaptive capacity. This is how we are able to get through turmoil and changes in the world around us. There's a concern here with attrition. We'll report that and move on. Then, this guy, it's interesting when you're trying to do the sociotechnical work, trying to engage with teams and help them get better. There's usually some folks. I find them baffling. I have longed to understand the mindset that they can work in, where they're in a mercenary mindset. They're like, "I'm here. I get paid. Not my circus, not my monkeys, just, whatever man, I just do it." I can't do that. I'm like, I got to fix this. I got to fix that. These people are kind of inert. I want to try to help them and they will benefit if we make the team healthier. They're probably not going to help us get there.

Shelly the Intern is also frustrated. She's on a team that's toxic. She's unpaid. That's a whole problem of its own, in that unpaid internships are only really available to people who can afford to not work. She's mentioning that the Boss brought in a new developer metrics tool. These are very popular right now, to work on the quality problems. Immediately, he berated Shelly for having reviews that took a while. My feeling is that an intern's code reviews, their PRs, should spend a little bit more time in scrutiny for it to get that opportunity to teach them. As a adaptive response to this, she got together with Leah Dev, and they decided just to stamp all their commits and not spend time on actually reviewing them. They'll go off and do discussions on their own outside of the system, just so that the Boss doesn't harp on mean time to review. This is where these metrics tools, especially when used naively can take your organization extremely sideways. The intent there from software managers is, yes, they got a lot of complexity to manage, they got to understand the health of their team and where it's at, and it's hard. There's so much to look at, and so they reach for these tools. Then immediately metricism kicks in, and the target becomes the goal instead of the outcomes for your customers. Sometimes you can't fight these things, but you can get involved and talk about these unintended consequences of the tools and the numbers. Often, that does have an impact.

Learned Helplessness

Finally, for Team Inferno, we talk to Kirito. He's saying like, "I came from another job. I started here, it's just weird. I'm here, these people are all crazy. I feel powerless." It brings up a common situation, especially in toxic teams, or in larger teams that have toxic elements, is you get this effect I call learned helplessness. What that is, is when you go talk to a team, and they're like, "We have this terrible stats tool. We have to stamp PRs, because if it goes over an hour, we all get in trouble. There's nothing we can do about it." That's learned helplessness. Over time, humans, the thing we do best, is adapt. Adaptation isn't good or bad, or evil or good. It's just what it is. We will adapt to bad situations, and we will adapt to future situations. In learned helplessness, we adapt to that helplessness and it just gets under our skin. When you run into a team that has that, it really does take a leadership intervention to start to break out of that mode of learned helplessness. SREs often have a role in doing that in that we can provide tools, and access, advocacy with the leadership team. Breaking that is usually the first thing that needs to happen to get the team into a healthier place. Because if people don't feel autonomous to go and make their environment better, how can we expect them to produce great products? I don't think it's really reasonable. Once again, we're looking at some attrition. It's pretty common for this to happen on a toxic team, where maybe it slides under the radar for a really long time. Because what's happening is people join, they stay a year, almost exactly, and then they bail. Just the way things work out if you've been through enough interviews you know, there's almost always that chronological reading of your resume. They go, "You were only at that job for six months. Why were you only there for six months?" Most people will try to stay at a job for a year, just to get past that hump, that perception. I think that's what Kirito is doing. He doesn't seem happy. He does see some hope. There is a team that seems like it might be better for him, but they almost never have openings, probably because they don't have high attrition.

A Westrum Generative Culture

Let's go see what Team Marathon is doing. Team Marathon, immediately when we come into it, it's just a very different vibe, especially coming from Inferno. It feels cozy. People seem relaxed. We'll talk to their manager, Nyaanager Evie. What I love right away is she's just open about incidents. I've met teams where they're shy about them. They're like, we don't want to talk about our incidents, because they feel bad. In a really healthy team, I feel, especially in internet live services and stuff, what we want is for them to be really curious about incidents, to talk about them, to learn from them. Because it's one of the most powerful places we have to go and learn about the things we didn't already know about our systems. Already getting a great vibe from Nyaanager Evie, and because of the incident and the drain it put on the team, she told them to take off for the day. It's really nice. When we talked to Mahogany Row, it was pretty clear the leadership team maybe is, let's say, still growing into their roles as leaders. What's happening here is we have a team that's fairly healthy with these other unhealthy teams around them. Then Mahogany Row distributing a caricature of all the worst leadership things I could think of. I see these teams all the time in large companies, small companies, where there's that one team that's pretty good, even when there's other teams around that are trash fires. What usually is going on there is that weight of that, of the organization is putting that pressure on that manager. They're the ones holding it back, maybe sometimes that crap umbrella I mentioned earlier.

I get the feeling that Nyaanager Evie is a little bit more straightforward with her team. There's a real burnout possibility here. This pressure, whereas Nyaanager Evie wants to grow her career, but when she goes to the Mahogany Row folks and says, I want to grow into senior manager, or director, or whatever. She gets told, "You got to push your team harder, they're making it look too easy." To which I would respond, "We deliver on time. We deliver with high quality, and our customers are happy." That's what's most important, more than how many features we put out. We all know this, but for some reason, we have to say it again and again. In effect, what we see going on with this team, is a Westrum generative culture. It is a framework for looking at cultures. It talks about a pathological mode, a bureaucratic mode, and then generative mode. Generative is in that fun space in a team. If you think of that team you were on, that was like the best, where you were having fun. You were cranking out code. You were delivering features, and it felt good. You start to look at that through this generative framework, I think a lot of things will start to look very familiar. Boba Jacobian has pointed it out to us, and I think we're all on the hook to go learn about it.

Incidents and Soak Time

Next up, we're going to go meet the Seventh Daughter of Nine. She's really into incidents. She's going to do the incident review. She's only got the gist of it, but she's going to wait till next week, to go and interview the team and find out what really happened. This is more of an incident thing. It comes any time we have a conflict or an adrenaline moment, or the team running really hard, is, you want to give people some soak time. We talk about this a lot in incidents, but it's applicable across our domains. In that, we give somebody a bunch of information, and some of us are really good at processing that and can immediately work with it. Most folks need a day or two to really internalize it and start to understand, and be able to come up with a story, a narrative to tell us. This is a good sign that she understands the people around her and that they need some processing time. She also mentions that this guy over here, Xela, we're going to talk to him next, was bragging about error budgets, that they use error budget alerts to find out about the incident. They're comparing it to the incident timeline. This is an interesting thing we can do when we have SLOs and great metrics, is when we put together our incident timeline, we should be able to line up the dips at our traffic with the timeline. That's cool that they're collaborating and doing that work, while they're waiting for the team to get their thoughts together.

Error Budget-Based Alerting

Hidaslo Xela, he's a modest guy. Usually, he'll tell you, he's the most modest person you know, but he basically recently set up these SLOs and set up error budget alerting for the team. They're feeling pretty good about how those error budget alerts fired. They found out about the problem well before the customers did. It seems like something pretty elementary to a lot of SREs that we should have really great alerting. The opportunity going forward is that with SLOs becoming more commonplace in our industry, with error budget-based alerting, we can start to clean up a lot of the noise in our alerts and focus on things that impact our customers and not the things that impact our computers. The stuff that impacts the computers, matters. At 4:00 in the morning, when I've been up maybe playing games all night, I stayed up too late, and I get woke up by the pager and it's some stupid disk space alert, and my customers are fine. Do I really need to be getting out of bed for that? This is one of the problems that error budget-based alerting tends to solve. Going back to Westrum culture, he's exhibiting something that we like to see in a healthy team, which is, when they discover a new technique that helps them succeed, they're eager to share it with the organization. The organization would be eager to have it. I don't know that this organization is ready for what Hidaslo Xela has to sell. I think he's on the right track on how to help these teams improve. I'm probably going to buddy up with him. Be like, let's go try to implement SLOs for these other teams. I think especially for Team Disco and that churn situation, the error budget will give them a tool to start to push back on the schedule, and say, our error budget is empty, we got to go spend some time on quality to get this service back on the road.

Learning from Incidents, and Minimizing Burnout

We meet Doctor McFire, who says, on-call is a rush, you're typing away and next you're incident commander. We have an on-call who went right from the SLO alert to the traces, immediately figured out that she screwed up and wrote a roll forward PR. What's interesting about this is like when we have a really fast feedback cycle between, we cut the code, we merge the PR and it gets automatically deployed, and we immediately get alerted on it, all of that knowledge is still fresh in her mind from the code that she wrote. Doing a fix for it is super easy, because she doesn't have to sit down and debug and figure out, so what did I do two weeks ago? It's all just right there. That's a really good sign. It's a reference to a friend of ours. Dr. Laura Maguire wrote a paper about incident coordination. It's very good.

Then the last character we'll meet is Courage. He's going to talk to us about something I like to see in individuals, which is, they still have some gas in the tank after work. When we run a team all the way to their capacity, we're running people towards burnout. What we want to do is pull that back a little bit so they have a little capacity left at the end of the day. Because if you run out until you're down at the bottom of the barrel, you're not going to be writing your best code. You're going to have more exploits, more reliability issues, and unhappy customers. This is a really good sign that he's got this zeal to go write Rust. "It's a full moon, the sky is clear, and I love coding."


We visited three software teams. We visited one that is churning, and not really making progress. Another one that's a toxic waste dump of ideas, and things that we don't want to do. We met one team that's functioning really well. In each one, there are interventions we can find when we start to understand the social-technical factors, and explain them to the people around us so that we can make these connections, and start to make our teams healthier. I think we will get better code out at the other side.


See more presentations with transcripts


Recorded at:

Jul 27, 2023