InfoQ Homepage Presentations Service Ownership @Slack

Service Ownership @Slack

View Presentation

Speed:

Download

50:46

Summary

Holly Allen talks about the bumps and scrapes, triumphs and pitfalls of Slack’s journey from a centralized ops team to development teams that own the full lifecycle of their systems.

Bio

Holly Allen is a leader in Service Engineering at Slack, with SRE, Safety Engineering, and Storage in her portfolio. She is tireless in her efforts to make Slack the software reliable and scalable, and Slack the company a delightful place to work. Prior to Slack, she worked at startups, DreamWorks Animation, and was Director of Engineering at 18F, a civic tech startup in the US government.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

I'm Holly Allen. Like Jessica said, I've been in software development for 18 years. And one of the great things about software development is you get to take it to all different kinds of domains. So I've worked in pharmaceuticals and publishing and government, entertainment. And now I'm writing just normal business software for Slack. So I've done a lot of different things with computers and I've learned a lot of different things about businesses and how organizations work.

But I started my career in mechanical engineering. I absolutely loved studying mechanical engineering. Thermodynamics, materials science, statistics, fabrication. After college, my first job was with a team designing a brand new kind of engine. We were in the design phase, so I drew a lot of mechanical drawings like this one. I sent them out for fabrication, either in our own in-house machine lab, or out to external fabricators. And you wait, and then you get your part, and you build up the new version of the engine and you test it. And you analyze your results, and then you redesign your engine and you send out more parts for fabrication. And you do that over and over. I worked there for two and a half years. And the whole time, it was just a really slow cycle of testing and learning and testing and learning, each cycle months long. And at the end, we didn't even have a product yet, or even a field testable prototype. We just had a series of tests units that looked a lot like this one, and we were just still in the lab, we weren't even in the field.

At the same time, I was writing software to interface with sensors during testing and to do data analysis. And I was also on the team that was writing the fuel and air flow control algorithms. And that software had really progressed much faster than the engine itself. I found it really satisfying to make that software better every day. And it turns out it was a lot more satisfying than trying to build an engine for years on end.

So I made the switch to programming full time, and I haven't looked back. And it's not because I don't love making things, because I really do. And it isn't because I wanted to throw away four years of very expensive specialized study. Because I definitely didn't. But it was because the mechanical engineering work I was doing anyway, was way too slow. It was slow to learn, it was slow to get anywhere. Any given tests might give you 10 or 20 new ideas, but it would take forever to actually test any of those out and learn anything. So, you know, again, we didn't even know if we had product-market fit, just test units in the lab. Whereas with software development, everything just felt so fast. You could write some software, test it immediately, and then fix it right away, because of course it wasn't working. With good tests and user testing, you could actually have a lot of confidence that you were building the right thing every single day.

I didn't have the right words for it yet, but what I was feeling was the difference between a fast and a slow cycle around this loop. Design a part, or some code, and experiment, or some sort of business process, and then try it out and measure your results. Learn and try to do it better next time. And the faster your trip around this loop, the better.

Toyota Production System

I'd been exposed to these principles in college when studying the history of manufacturing. The Toyota production system developed after World War II revolutionized car manufacturing, and was the precursor to something called Lean Manufacturing.

Traditional car manufacturing relies on costly stockpiles of parts and finished cars in order to deliver cars in a timely fashion. And that's because the link between fabricating parts, or ordering parts, and creating, assembling a car and delivering it were really lossy. They didn't know how long those things would take, so they built up inventory to compensate. And so they had to make these predictions about how many they'd need, and then they had to pay for storing them all someplace. And the Toyota production system designed out all of that, to eliminate waste. An order from a Toyota dealer triggered the production of a new car at the plant. Before that, if there wasn't orders from the dealers, the plant wouldn't make a new car. So the dealer is pulling a car from the plant to fill an empty spot on their lot that they have because they just sold the car. The plant starts to build a car using parts. When they use up a part, that triggers the fabrication of new parts. And so empty spots in the system pull new inventory through.

They track parts with physical cards called Kanban. Those Kanban cards stay with the parts wherever they are throughout the plant. When a part is used up, then the cards are placed on a physical Kanban board. A card without parts represents an empty slot in the system that need to be filled. The cards start on the left side of the board, and move right as the fabrication steps are completed, until finally, at the end, they're attached to the new part inventory and they go into the plant. The colors red, yellow, and green are used to indicate priority.

So you can control how much inventory you have in the system just by how many of these Kanban that you have for any given part. I lived a decade or more between my tour of a factory where I saw physical Kanban cards attached to physical inventory and when I first used a Kanban board to develop software in an agile team. I remember feeling like my whole worlds were colliding as I mapped everything I had learned about lean manufacturing to software development. Because Kanban for software was so similar to manufacturing, but I'd never seen it before. The cards represent work to do, and the cards still flow from left to right as tasks are completed. And the idea is still to eliminate waste in the form of long waiting or unpredictable cycle time for getting any particular piece of work done.

Kaizen Continuous Improvement

Besides reducing waste, one of the most important concepts in the Toyota production system is Kaizen, which is sometimes translated as continuous improvement. For example, line workers are empowered to change their own physical workstations at any time if they think it's going to improve their workflow and make things more efficient. And work teams can try new ways of organizing the entire plant to make it more efficient, without involving management. Everyone at the factory is empowered to design, measure, and learn in their own work. So empowerment is really key to Kaizen, because otherwise, possible improvements would have to go through layers of approval, or it would be coming from the top down without involving the people who are actually doing the work. So the Toyota production system and Kaizen, and Lean Manufacturing, all look like the lean thinking we talk about here in software. Design, measure, learn. Go around that cycle to continuously improve.

Now, like most of you, I've gone through plenty of agile transformation projects, where you take the agile training and you become a scrum master and you attend thousands upon thousands of daily stand ups, and you groom your backlogs, even though they're hundreds of tickets long. And I remember back in the early days of joining an agile shop, doing a lot of lively debates about the merits of tee shirt sizing versus the Fibonacci sequence for tickets, or, "Should we include bugs?" and the number of tickets that we do in any given sprint. Is that like pure or not? But over time, I kind of realized that almost all of these agile teams seemed to be doing okay, but in the end, we were putting in a lot of effort, but we weren't really seeing the benefits.

So organizing the work feels great. Staying in sync with my coworkers, thanks to daily stand ups, that feels right. And writing code every day was still awesome. Knowing that my code was compiling and passing tests and giving me output every day, that was still really exhilarating. But that's obviously not enough. You need to ship to the users and learn in production. And all that execution has to actually add up to something. And too many agile teams I've been on feel like a racehorse on a treadmill, going really fast nowhere. And so we told ourselves we were doing well. Our metrics were good, we had short lead times, and we had accurate sprint estimates, and all of our ship dates didn't slip and things like that. But we weren't actually delivering any value out into the world, and we didn't know if we were building the right thing. So again, this isn't about the difference between scrum and Kanban and Lean. Those all pretty much add up to the same thing if you don't get some of these fundamentals right.

And in my personal experience, there are two things that differentiate teams that are delivering the right things fast from teams that aren't. The top one is executive dedication to learning. If your highest leaders aren't committed to creating and learning, an adaptive organization, then, you know, the one that's fearless in the face of change, then nothing any individual team is going to do is going to succeed on those terms in that environment. It will just get squashed down and told to “just keep succeeding in this little box we've made for you.”

And the second thing you need is high trust teams, because a high trust team can really dig into what's working and not working. They can be honest with themselves and ask the hard questions about, are we doing anything right? Should we question our fundamental assumptions? Can we make radical changes to make it better? High trust teams can really push themselves to do better, over and over and over again. So, high trust teams, in the right kind of learning environment, can really go through the cycle very quickly to tremendous benefit. But too often, teams aren't really willing to measure and try something new. It's much more comfortable to avoid conflict and not talk about the bigger issues.

Change is hard, especially when you don't know if that change is going to even work, so, many teams that I was on languished, and didn't ask the hard questions and weren't really learning. But if you don't learn and change, then you're not going to get anywhere.

Slack

Let's talk about Slack. Slack launched in February 2014. That was when it became publicly available and anyone could sign up for a team. Slack got big really fast. At first, our active user count was pretty small. We were supporting teams from 10 to 200 people, roughly the size of Slack itself. Today, we've got 13 million weekly active users. And we're supporting teams that have hundreds of thousands of users in them, which is much bigger than Slack itself. And because Slack is the center of work when you're using it, users keep Slack open all day on their laptop. And you've got active sessions going for 10 or more hours at a time. Same time, we scaled our cloud presence from about 10 servers at the beginning, as far as I could tell, to over 15,000 servers today, mostly in AWS, and we're in 25 cloud data centers around the world.

Meanwhile, Slack as a company grew from 8 to over 1,200 employees today. And again, we've got nine offices worldwide. So that is a ton of growth over the course of about four and a half, five years. But the great thing is that Slack lives and breathes this lean thinking, and continuously improving. The executive dedication to learning is really high, because Slack itself is a massive pivot. It used to be a gaming company, and when the game failed to make enough money, then the execs were like, "Okay, well, we have this collaboration tool that we wrote for ourselves. It was making writing the game a lot easier. So let's just see if maybe we could turn that into something that the market wants." And it turned out that they did.

So from the very beginning, shipping code fast was a priority. And if you saw the keynote this morning, all this will look very familiar. So, continuous deployment, so that any developer can push their code to production in minutes. Built an experimenting system, so that developers can craft some experiments to test new features or possible UI changes, and just try that with a tiny slice of the production users and see if people are understanding it or like it. And instead of releasing major new features without testing them at all with users, we're testing everything as we go and leveraging our user research department to really put some analysis behind this and make sure that we're heading in the right direction.

Who Should Be Responsible for the Management, Monitoring and Operation of a Production Application?

So things are looking great. We're learning, we're growing. High trust, high risk tolerance. But eventually there was at least one thing that didn't scale, and that was a central operations team. So if the question is who should be responsible for the management, monitoring and operation of a production application, there is no single right answer.

A central operations team is one answer, and it was Slack's answer, for years. If anyone saw the keynote this morning, that was what they showed is that dev teams, being separated from the operations team that was running things in production. So we had one team of humans to provision our cloud instances, write the Chef and Terraform, take the pages and manage the incidents. By dividing the labor between product dev teams and operations teams, letting each set of engineers really focus in their specialty area. You've got devs focusing on moving the product forward and helping it scale. You've got the ops engineers, focusing on the cloud infrastructure, and deployment efficiency, and monitoring. This model works for a lot of companies, and it worked for Slack for a long time. In the early days, most Slack engineers could fix almost anything that was wrong in production. And as Slack grew, ops engineers generally knew who they could contact to get some extra help.

Ops is Getting the Pages

So ops is getting the pages. Ops has all the permissions on AWS and Chef and Terraform. And devs show up when they're contacted, or even just when they see something's wrong. And the system works basically fine. One problem though, over time, again, was growth. Product development grew much faster than the operations team. And to a certain extent you expect this, but it got a little out of hand. At one point we had at least 20 product developers for every operations engineer. So the operations team literally didn't scale.

How Can Operations Reliably Reach the Developers When There's a Problem?

So then there we had a new question. How can operations reliably reach a developer when there's a problem? And the answer changed over time, but gradually developer on call rotations were created. And so this was slow at first. Only the most ultra senior developers were in these rotations. And they had to be able to cover almost any part of the infrastructure, no matter if something was happening on main message flow or notifications, or emails being sent out. And they were only for escalations from ops. So ops still tried to solve the problem first, and if they couldn't, then they paged.

So ops gets the first page. They can't fix it, they page the dev on call, and if the dev on call can't fix it, they know who to call. They know that engineer that knows this system. "Let me just look up their number." So devs and ops are feeling a little bit different about all this change. Some devs had never been on call before in their entire careers. And they were not particularly confident in their ability to handle things when a page came and the stakes were really high. Ops was feeling a little bit more confident, because now they knew that there was a PagerDuty escalation that they could tag to get a developer reliably instead of hoping that someone would show up, or the people whose numbers they remembered not being on vacation. So ops is still getting the first pages, but your ultra-senior devs are on call when needed. Machine pages and customer experience escalations are coming to ops.

So again, Slack is a high trust learning environment. So, after a while, when Slack keeps growing, and there are more engineers and features and teams, we have to ask this question to ourselves again. How can operations reliably reach developers when there's a problem? Because too often, operations would reach out and get a developer who actually didn't know the system that was having a problem. Just one big dev rotation.

So in fall 2017, most of product development went on call. Seven new pager rotations were created overnight, covering specific parts of Slack's product and infrastructure. So, the change management here was not great. Ideally, people are included in changes to their work. So, remember with continuous improvement, empowerment is key, and when a big top-down change gets thrown on you, you feel really disempowered. In fact, some devs found out that they were going to go on call because they got a calendar invite for their on call rotation coming up in a couple of weeks.

So again, ops is feeling pretty good. We've got these targeted rotations, so that if I need a front-end engineer or I need to reach a search engineer, I will be able to get them. And devs were initially surprised, but, you know, high trust, so they're here to help, they're willing to help. And because a couple of these rotations actually had a huge number of people in them, like all the PHP developers, for example, most developers were only on call for two or three times a year. Which, you know, being on call is like anything else; you learn by doing it. And so for those dozens and then hundreds of developers who were only on call a few times a year, it probably felt like a relief to only have the pager a couple of times a year, but if you don't do something very often, you don't get any better at it. It's scary the first couple of times carrying the pager. You don't have any confidence that you're going to do the right thing, that you're going to have your computer with you in the right way. And you feel all this pressure that you're the one responding in this incident, and the stakes are really high. So ops is still getting the first pages.

Now, we've got all the senior devs on call, and we've got seven dev on-call rotations, instead of just one. Now at this point, because of our continuous delivery system, we've got 100 or more prod deployments a day. And that continuous deployment system is letting devs push out code within minutes if they want to. And so, you're an ops engineer, pager goes off. You have to figure out what changed. And hopefully fix it yourself, but if not, figure out which of the seven rotations you might be able to page for help.

What Changed?

So luckily, we've written a couple of tools to make this a little bit easier for ourselves; all of our stage and production deployments posted into a specific channel called Deploys. It was just Deploy Chicken till the cows come home. Just every single deploy, with links off to the PRs, all searchable. And so it was like, "Okay, it's the middle of the day, something just changed, let me go see what just got deployed. Let me go see if somebody did some late night work." And we had a bunch of alerts that maybe weren't paging people, but were interesting, and devs thought that they might be useful. And so they were being piped into channels. So again, you could search on this. You could be like, "Okay, I'm having a trouble, I'm seeing problems on this particular graph, and so let me just search through Slack for that." And oop, up pops those alerts there in those alerting channels. So you can kind of triangulate and figure out which team to page, narrow down the symptoms, page the dev.

And again, like even with seven rotations, chances are you're going to get a dev who's like, "I don't actually understand this part of the code. I work on threads and this thing seems to be about setting your do not disturb setting. And I know JavaScript, I'll dig in." Individual devs sometimes felt like failures. Meanwhile, the operators are vastly outnumbered by the engineers, and don't really have the time to deeply, deeply understand everything that's going on in Slack. So they're like, "Well, I've got these machine alerts here and I can show you some host metrics, but maybe you could just fix the problem."

Human Routers

So ops has to keep like a really detailed system understanding in their head. And know at least what engineers maybe work on different parts of the system, so that they can be human routers during an incident and say, "Well, again, let's just call this guy. He seems to know how this stuff works. And we're 20 minutes into this now and we really need to fix it."

Now, we're a learning organization, and one of the best ways to learn is in a postmortem. Post incident meeting. Now, there's a chance, in a good postmortem meeting, there's a good chance for everyone to learn even more about all these complexities and nuances to the system, how different subsystems interact, and how they act at scale in production. But we were starting to get a little overwhelmed, and because the postmortems were being run by the ops engineers, because they were the ones who were first responders, and at that time it was falling to them, a lot of times, they didn't have time to do the kind of in-depth preparation that can lead to a really really healthy postmortem. For the really big events, definitely. But, you know, for the sort of medium size and smaller stuff, the effort often wasn't going in.

And so postmortems became a little bit perfunctory, you know, go through the timeline. Okay, you know, “what could we have done better? Fix that thing, great, put it on the backlog.” Never get to it. Action items. “Okay, yes, we put in an alert for that, great.” Bunch of JIRA tickets. So, it was fine, the people who were involved in the incident would show up, but we didn't have a lot of extra people showing up to learn from these postmortems.

Can We Catch Problems Earlier?

So we're always asking ourselves, "How can we catch problems earlier?"And how can we catch it before it becomes really big? How can we catch it before a customer notices it? So we're always iterating on alerting and testing. This is a good one. Automated error filtering, pattern detection. You don't have to get that algorithm super great. You just have to get it good enough that it's providing you some information. People don't have to sift through every single log looking for stuff. You just get a good enough algorithm to do it for you, and then surface it up in a channel for you to talk about it, see what's going wrong.

This is one of my favorite integrations. Custom unfurling. So, either in an incident or just when you're talking about "What's going on in the system? How can we make it better?", you're always linking off to your own dashboards or your logstash, or your data warehouse. And you have to click on that link, go into a browser, open it up, figure out what's going on, come back, comment on it. We created some custom unfurls, so that for the things that we care about most, even though they're internal systems, Slack will actually open it up. This is something you could do too. So this is actually super handy in the incidents I've been in to have both graphs like this one, and in times when you're just doing some calmer analysis, things like this, with the data warehouse query, so you don't have to jump off to the data warehouse to check out the work. You can pretty much see what the person is trying to tell you just from the little snippet that gets put into Slack.

So all this custom work helps us be more efficient. It helps us get by and adapt, even though ops is vastly, vastly outnumbered by production, the product engineers. And things aren't really getting better at the rate they need to from a human perspective, but we're still able to get by and we're still able to fix things pretty fast, we're still able to keep working. We're investing in the tech, we're detecting and remediating faster, but we're still human routers.

How Can Slack Ensure That Developers Know When There’s a Problem?

Fall 2017, about a year ago. Operations got new leadership, which resulted in a big reorg and a big mission change. They renamed the group from operations to service engineering. And that doesn't sound like a big deal, but we're thinking about making tee shirts, because it's still a year later, they're still calling us ops, and it's like, "We're not ops. There is no ops." So we ask ourselves a different question this time. "How can Slack ensure that developers know when there's a problem?" Not, "How can operators reach developers?" but how can we make sure the developers know there's a problem? We decided that a centralized operation team wasn't the answer to this, and that service ownership was an idea that we wanted to start embracing. So service ownership is the idea that the development team that writes the code is the best group to own and operate the code, right down to getting the machine pages and the incident response. So of course this is a radical departure from Slack's past.

All of this kicks a really intense cycle of learning and change in the org. Which is not particularly comfortable. The department doubled in size in about six or eight months. So there's lots of new faces, including mine, talking about all these new ideas. But again, Slack is a really high trust environment. So everyone’s sort of like, "You're making me really uncomfortable here, but let's talk about it." Let's talk about what outcomes we want. And we start collaborating to test these new ideas.

So again, service ownership is this idea that service engineering is going to focus on tools and being specialists, and are going to offer guidance and expertise, and slowly push responsibility back towards the dev teams. I say “back”, but I really mean kind of for the first time. So now we've got this split, where developers are still doing features, but now they're increasingly responsible for reliability and performance, and we tossed postmortems over there while we were at it. And service engineering is providing the cloud platform. Here's the cloud platform that you build your services on, and observability tools. Server discovery and defining best practice on how to do all of this stuff, because we've been doing it for a long time.

How to Empower Development Teams to Improve Service Reliability?

So in February of this year, I joined Slack. And there was already a lot of change going on. And the question we were asking ourselves at that moment was, "How, if we're going to push all this new responsibility over into the development teams, how do we empower those teams to really own that and improve their service reliability?" Because without a centralized operations group, how do you provide that support now, if each of your service engineering teams are actually producing their own products?

So one of the things that we did is, we defined what service health and operational maturity meant. And just to be clear, this is all very much still in flux at Slack right now. It's a thing that we're actively working on. But here are some of the ways that we define it. So if you've got a healthy service that has good operational maturity, you've got at least one alerting health metric. If you've got an API, maybe that's latency. If you've got a job worker, then maybe it's throughput. You have to think about what software you're writing, what does it do? What does it do for the user? The user might be another subsystem, but ultimately what is it trying to accomplish? And how would you decide whether or not it's doing that well or not? And alerting is in there because we want it to actually go somewhere where a human will see it. Either a channel, or, if necessary, to a pager.

So, some teams, they don't have very good metrics. They don't have like a lot of great data coming out somewhere. So our observability team is there to help and consult, help you get it into our Prometheus cluster. Then you can work with the alerting team to think about like what kind of thresholds you might want, if latency goes up over a certain amount for a certain period of time, or maybe if your system throughput drops, again, for a certain period of time, how would you tune that up? But working alongside those dev teams as they twiddle the knobs, so that they can do it themselves in the future, while they've got experts sitting next to them. This is the theory.

Another way that we define service health and maturity is that the team should at least be on-call ready. Even if they're not on call right now, they need to be thinking about what does it mean to be ready. So the way that we define this is that the bare minimum number of people you need for a pager rotation is four. But ideally, you've got at least six people, or maybe more, maybe 10, for a good sustainable on-call rotation. And then it depends on the team and what they do, but it might be a 24/7 rotation. Or it might be just during the workday. For example if you're on a build and release team, maybe you mostly have to cover it when most devs are in the building. Maybe you've got like a ripcord, page the manager or something if something's wrong in the weekends. Really depends on the service. But we want teams to actually have that conversation and think about what kind of service do they provide to the rest of the org through this on-call rotation.

Here's another one. Runbooks. Does anyone know what runbooks are? Runbooks are your standard actions and your troubleshooting steps. We put all those runbooks in a central location. We store it all in markdown files in our GitHub repositories. And all the runbooks are in one place, so you can find everything really easily. If you've got everything, for example, checked out on your laptop, you can just search through it with grep, find the runbook you're looking for. And those runbooks have to be up to date and usable by, "any engineer" is a little flexible, but any engineer who has a reasonable understanding of that system should be able to do it. It should be really clear, step-by-step, because in an incident, you don't want to have to think about what's not being specified here in this runbook. You need to just be able to do it, because maybe it's 4:00 in the morning.

So those paging alerts should really link to those runbooks, ideally, so it makes responding to that page much easier. The other thing that we do is we like to practice that incident response. Because again, some teams have never gone on call before. So our very first baby step into this is we have an incident lunch every two or three weeks. And what we do is, it's sort of a low stakes event, because we're obviously not messing with production. And it's not about computers. But it's kind of high stakes, I mean, you want to eat lunch. So what we do is we get a group of people together, we sort of talk about what an incident commander is versus the people who are responding. We sort of break people up people into teams, and we have them use the incident response process that we use at Slack to arrange how to get lunch. And then some of our folks have created these little cards that are a bit like random production events that could happen, like, "Oh no, it's somebody's birthday. You have to get a cake." Or, "Uh-oh, Slack is down. You can't use Slack to coordinate about lunch." Something like that. It's open to anyone in the company. So anyone can come and just sort of practice and learn a little bit about what responding in an incident might look like in terms of the weird words we use or the way that you have to communicate a little bit differently.

So what about those high stakes teams? You know, the ones who we really want to invest some dedicated support, because most likely they just have such a pivotal spot in the running of Slack that you want to make sure that you really get it right. You don't want to just provide a little bit of consulting on the side. So it was decided to create an SRE team. So site reliability means a lot of different things to a lot of different people, but at Slack, site reliability engineering means DevOps generalists who have a really high emotional intelligence and a really good mentoring capability. Because we embed those skilled practitioners into other teams as ambassadors of this way of working. Their goal is to go into those teams, become members of the teams, and increase the operational maturity of those teams.

So, SRE embedded into dev teams, with the goal, again, operational maturity, but overall raise the service reliability. The great thing is that this was a totally grassroots effort. SREs themselves decided that this was the best path forward. So then management just kind of swarmed around and provided support and cleared some roadblocks. But we're not the owners of the overall vision for success. SREs still report back into service engineering, but they're also members of that dev team. They attend the team meetings, sit with the teams in the office, they deliver work against those team goals. That was the theory, but at first they had a little problem.

How Do We Lower Operations Burden on the SREs?

There was still a lot of operational work to do. Now it was SREs getting those first pages, sometimes dozens a week. And with Slack growing all the time still, that meant a growing number of servers and subsystems, and still over 100 production deployments a day. So we learned pretty fast that embedding SRE wasn't going to work for as long as they were still the first ones getting paged. It was just too much work. So we asked ourselves, "How do we lower the operational burden on the SREs, so they can contribute in this embedded model?"

So the SREs came up with a plan. They wanted to categorize and reroute all of these existing low-level machine pages to the right teams, so that those teams could act on them. And with all those pagers going to the dev teams, rather than this nonexistent centralized operations team, then SRE would have the time and energy to dedicate to their embedded teams. So again, the devs are generally on board, they're like, "Okay, we don't know what these look like, but, sure. Will you train us? That would be great. And maybe you could put some guardrails in place so we don't accidentally take Slack down because we hit the wrong button?" They kind of saw the work that SREs were doing. They were like, "We don't really know what you're doing over there. So, you know, protect us from ourselves." And so meanwhile, SREs are categorizing the alerts and saying, “these will go over to the calls team, and these will go over to the Flannel team, and these will go over to the web app team.” And they start to plan some really intensive trainings in things like AWS and Chef and Terraform for the devs. And it's like, "Okay, we're going to sit them down for three days, and we're just going to work through all these examples. We're going to give them all the permissions, and at the end they're going to be great."

And nothing changed. Because the SREs were trying to make progress on all this training planning and guardrails, but, again, they just had so much work to do that it was pretty slow going. And the devs, meanwhile, were starting to work on some of their service health checks, and sending them to channels, and tuning and tweaking, and trying to get them just right. Because you don't want to get paged if there's nothing wrong, so let's get them right. So everyone was working really hard, and everyone was really highly aligned, but we're aiming for perfection, and we're going nowhere.

SREs spent hours every week looking at the previous week's pages. Low CPU, out of memory, out of disk, everything was like a host level. And these same alerts had been paging ops for years, and they were a major component of the entire alerting strategy. And so there was a lot of fear about turning them off. It's like, "Well, how will how we know there's even a problem if we turn these off? I mean, we don't have anything else." So ops spent a long time looking down at the hosts and felt like that was really the only way to do it. So, finally we get to a point, we were comfortable showing it to the teams, and it's like, "Okay, well, we're not ready to train you quite yet, we're not fully categorized. But here are some examples of some things you might be paged on. What do you think, dev team?"

And the devs teams, they were pretty unanimous. They looked at those and they were like, "Those alerts are completely useless. What are you even getting those for?" And it was sort of like coming out of a fog. And we realized that there was a completely different path forward. And no one should care about host health. If you're in the cloud, if a host is bad, you just kill it and provision a new one. And you focus on automating that, because that doesn't always work, depending on how the service has been set up. But you work on that, you work with the teams to ensure that all of that can work well, and you just kill those host-level alerts.

And honestly, we still waited, because we wanted to do the automation first, you see. We wanted to get that right before we turned off these alerts because what if something happened? We knew that automating everything right and setting everything up just right was possible. And so again, we were aiming for perfection. The breakthrough really came back in early September. Straight from the top of engineering leadership, there was a push for reliability and fast incident response, and that was the most important thing. And I don't know if you've ever been in a company where there's that kind of moment of clarity, where everyone kind of comes together and feels really harmonized, and there's super intense alignment throughout this, you know, 1200 person organization. We had that moment. And I was not going to waste that, so we swallowed our fears and took the plunge.

One afternoon, we took off all those old low level host alerts. And then walked over to the desks of those dev teams that had been working on those nice finely tuned health alerts, and said, "Pick one or two of those and go on call for those today." And they did. Because they were like, "Okay, yes, we're doing it." They'd been trying to perfect these for months, so it wasn't a complete surprise. But it's not like we had some sort of big plan. We just decided, that day. And nothing broke. And everything was fine. Maybe one or two things broke. Mostly everything was fine. So again, the change management for this was not perfect. There was no comms plan, we didn't ask the managers first, we just walked up to people and did it.

But this time, all the people affected had been working towards something like this for months. And we'd been talking about it for a long time. And they had tuned these alerts themselves so they really understood them and why they might be a little bit noisy in the wrong ways. And they were empowered to continue to improve those alerts and their own on call process, because they more fully owned those systems now. And since that day, the SREs have had more time to help their teams. And I've seen those teams continue to iterate and improve all of this alerting and monitoring and the automation. And some of the provisioning is getting better. They're also working to decouple some services from each other to reduce cascading failures. And one team's moved away from a pet approach to their services towards a little bit better automated approach. So they're not hand naming each other servers anymore, for example. So they continue to dramatically improve the resilience of their services.

How Do We Test Our Understanding of How Slack Will Fail?

Okay, one more story, real fast. New question. How do we test our understanding of how Slack will fail? We're not ready for chaos engineering yet. So we created a program called Disasterpiece Theater, where we brought groups of engineers together to intentionally bring parts of Slack down. And the goal was learning and validating our understanding of how the alerting would work and the automation would work to make that a seamless event for developers and they wouldn't notice a thing. This has been a really fantastic design-measure-learn loop on this project. The scenarios are always really carefully planned, pre-tested in non-prod environments. The right groups of people are brought together, experts and hopefully as many non-experts and learners as possible, and you test your hypothesis, and you reflect, and you document. And you rehypothesize. And learning is the goal.

The whole goal is, did we learn anything in here today? Did we learn that the user experience was a little bit different than we expected? Or, "Hey, that worked just like we hoped. But the metrics were a little bit weird." Or, "Oh, I didn't know you were looking at that graph. Let me go over there and look at that and maybe add it to my dashboard too." It also provides an opportunity to practice incident response, again, in a very low stakes environment. We can actually get better at that part of it. I'm hoping to add more rigor around the incident response practice part of Disasterpiece Theater.

So we're not done yet. We still have a ways to go. Here's just rapid fire, a couple of questions we're still asking ourselves. How do we ensure the teams are being alerted, instead of skill sets? We've got these giant pager rotations of all the front end engineers, for example. And that's just not going to work long term. But we need to include these folks in the solution to this, not just do it to them.

How do we make postmortems a place for learning again? There is a lot to do here and I can't talk about everything, because I'm running out of time. But we know a couple of things we need to do. We just need to find the organizational will to do it. One of them is training; training postmortem facilitators to run the meeting with a learning bent in mind.

And then finally, how do we make sure a capable incident commander is available for all incidents? This is a tricky one because this is really falling still on service engineering. But that's not scaling, like I mentioned before. So how do we incentivize and train up so that we've got incident responders, or incident commanders all throughout Slack. So that if a problem is happening over in this dev team, they have an incident commander right with them, who can run that incident until it becomes something really big that needs a scaled response.

So there's a lot to do. Slack is a high trust learning organization. And we can talk about what's not working and we can suggest radical changes to make things better. And I know we'll keep making progress because of that. One last thought. So the Toyota production system had all this philosophy. The entire management philosophy is designed around reducing waste. Low inventory levels is a really visible outcome from that work. But everything else about the way that Toyota works and does its fabrication has to be designed and empowered throughout the organization in order to actually make that happen. And a lot of other car manufacturers saw these low inventory levels and were like, "Well, that's great. We should do that." And they drove down inventory levels. But it didn't work. Their supply chains just busted, and they didn't actually get any of the good results that Toyota did, because they didn't understand the philosophy.

So imitating another company without understanding the underlying concepts doesn't work. So what works for Toyota, or Slack, or me, or you, won't necessarily work for anybody else. Just like just following the scrum process perfectly, to the T, doesn't actually get you any good results. So, one way to think about it is to copy the questions and not the answers. You need to know what you're trying to accomplish, and be willing to learn and try over and over. So I'm not telling you that scrum doesn't work. I'm just saying it didn't work for me and I'm not telling you that a centralized operations team doesn't work. It just doesn't work for Slack anymore.

So think about the questions you want to ask and not copying any given answer. So I want you to believe that change is possible in your organization today. No matter how entrenched and very, you know, "this place will never change" it feels. You don't have to be ready. You don't have to have the perfect plan to make things change. As long as you have the support of leadership and are yourself empowered to make change, and if you commit yourself to continuous improvement, then progress is completely inevitable. So, go around this loop as rapidly as possible. Design thoughtfully, measure ruthlessly and learn faster.

See more presentations with transcripts

Recorded at:

Jan 04, 2019

Holly Allen

InfoQ Software Architects' Newsletter