InfoQ Homepage Presentations InfoQ Roundtable: Embracing Production: Make Yourself at Home

InfoQ Roundtable: Embracing Production: Make Yourself at Home

View Presentation

Speed:

58:32

Summary

The panelists discuss operating distributed systems in production, how they embrace production, and ways to make it easier for others to onboard and keep the system up and running.

Bio

Shelby Spees Michelle Brush Kolton Andrus Haley Tucker Glen Mailer Luke Demi Aysylu Greenberg (moderator)

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Greenberg: Happy to have you join the panel, "Embracing Production: Make Yourself at Home." It's been a pretty interesting topic. In general, something that I've focused around my career. I'm very happy to welcome our panelists, Shelby Spees, Developer Advocate at Honeycomb. Michelle Brush, who is an Engineering Manager for SWE-SRE at Google. Kolton Andrus, who is co-founder and CEO of Gremlin. Haley Tucker, Senior Software Engineer, resilience team at Netflix. Glen Mailer, Senior Staff Engineer at CircleCI. Luke Demi, Software Engineer at Clubhouse.

First Experiences When Onboarding to Carry a Pager

Could you please describe your first experiences when you're onboarding to carry a pager? What worked well? What didn't? Anything that is noteworthy.

Brush: I think the thing that I had to learn the hard way when I first started taking the pager for production, and I've seen a number of the folks that I work with struggling as well, is this fear that you'll have this outage and it'll be so unique that you won't know what to do, it won't match your training. The thing to get used to when you're first going on call is that you will not know the answer immediately, and that's ok. The people that look like they know the answers, they've been on call for a lot of things, and they've tried a lot of things. They don't know what to do, either, they just know what to try. The thing you need to have your mentors, your coaches, your manager, your team training kind of help you understand is, how do you observe, and then form a hypothesis? Then, what can you try to test that hypothesis, and just do that over again?

Greenberg: How about you Kolton, what were your experiences like?

Andrus: I was thinking back about this recently. Before I joined Amazon, I worked for a couple of startups. That was really my first on-call. I don't think there was a PagerDuty. There was some system that alerted us. The on-call training was, you've been here a while, so we need you this weekend. Somebody is on vacation. Hope you're ready. Sure enough, I got paged, and we had a handful of hosts. I was SSHing into those hosts. It was a lot like Michelle said, what are the things to try? As a more junior engineer my list of things to try was shorter. I went through a couple of logs, a couple of this, a couple of that. I believe the first two or three times I had to get my VP or my manager on the call to help me debug it. It's this hard balance of you want to fix things, you want to do a good job, you want to help out, but it's scary because you're really trying to learn on the fly. What the nice thing was, as I reflect on it, getting those folks involved didn't feel like a failure. They had a hard time. They struggled. I was able to contribute and point out some interesting details that helped us resolve the issues. We all collaborated and commiserated in the morning, a little tired and a little bit groggy from the incident.

Greenberg: Haley, could you please describe your experiences? I know that you've been focused a lot on the dev side and the Ops side. We'd love to hear from you.

Tucker: If I go all the way back to when I first started carrying a pager, I think the roles that I was in was more like consulting roles where I had a client and we were working with them on their billing system. When I would get paged, it's like, yes, it's important. It's critical to the business. We got to fix this thing. A lot of it was, ok, I can go fix it, and then I can regenerate all their builds. It wasn't like you had a bunch of customers that were complaining on Twitter and people breathing down your neck. It was like I had some time and space to figure it out and breathe and to reprocess. Then when I joined at Netflix, that was probably the biggest change for me, is just how quickly it went from, there's this little thing here, to like, all these systems are falling over now, what happened?

I think joining the incident channel for the first time, the hardest thing for me to get used to was, there's all these people asking you all these questions, and all these people like sharing, could it be this? It's very overwhelming. You're like, what do I look at first? How do I communicate? Because I'm trying to absorb and process and come up with something useful, but then there's also all these signals coming to you. To reiterate some things that Michelle and Kolton mentioned, you just have to be ok with that. You have to over-communicate, say, "I'll get to that in a second. Here are some things I'm looking at." Make sure that people know what's going on and can help you and contribute. It's going to take a whole group of people to figure it out, and that it has to be ok.

Greenberg: Luke, do you want to share your experiences?

Demi: Sure, like everyone else, I think I started on-calls at companies where it didn't matter as much. It had to do with just being more junior. For me, I think what I really enjoyed about being on-call is partially that feeling of ownership and the satisfaction of being responsible for things. Also, I think, to Kolton's point, is this camaraderie and this adrenaline. This feeling of like being in the trenches. I just remember really being proud to be on-call and excited to be on-call. I remember studying up to understand certain systems and just viewing it as like a rite of passage. For me, those early on-call periods were very sweet compared to the later on-call periods, where you are responsible for something that people are going to bad mouth you on Twitter for, for being down.

Greenberg: Shelby, would love to hear from you now.

Spees: I think my first on-call experience was interesting, because a year or so before I joined the company, they reconfigured the paging system so that all the Ops pages went to the DevOps team, and everything else went to the individual platform engineers who owned each of their three main services. It was a really small team. It's a very small company. My team would basically only get paged if AWS was down. This was something that if I had stayed longer, I probably would have gone in and made it a little bit more balanced. What ended up happening is our handful of platform engineers who, basically, were the ones who built the system, will get paged for the things in their silos and their little islands of knowledge, and that knowledge didn't really transfer. I would try and hop on if there was a daytime incident and help debug. It was just so outside my knowledge and wheelhouse, and it was very frustrating. I wanted to help more. I think probably, I could have done more to learn about it, and pull teeth in and get that information out of people. It was very much like the dev side and the Ops side, and the couple of times that something was our fault, it was catastrophically bad but it was almost never our fault, on the DevOps side. It was a really interesting experience, and I definitely learned a lot just eavesdropping on incident calls. I feel like the knowledge transfer that you'd like to have in incident response wasn't exactly happening.

Greenberg: Glen, I know your experiences from us talking before, are somewhat similar to what others have mentioned on this panel. Please share your experiences.

Mailer: I was thinking about this quite a while ago now, so I try to remember the whole amount. Because I remember one of the architects of the system used to run this session where he basically picked up a pen and drew on a whiteboard and explained to you how production works. At the end of that was like, "Here are your keys to production, don't break anything." Then I remember the on-call rota, it was actually you got a phone. There was a phone which was the on-call phone that we would literally physically hand to the person who was on-call next week. We were in charge of the application software. The company at the time, it was a gambling company. That gambling risk metaphor I think is very useful. The other thing that was very interesting is we had very predictable traffic spikes, and it was very spiky. On a Saturday, we would get double the traffic of any other day. Then on two or three particular Saturdays of the year, we would get double the traffic of any other Saturday. Generally, any scaling issue you knew when the scaling issues were going to happen. Then anything else is usually like some third-party or self-inflicted, "We just shipped a bug scenario."

I definitely remember the issues we run into being, like others have said, not super critical. One thing was, we knew how much money came into the system at any given time. Depending on when something broke and how big it was, you could very quickly assess, how important is this? Then that would set the tone for the rest of that incident. I think that something has really stuck with me, is that really understanding what is the customer impact of this incident? Then from there, how do I triage it? How do I treat it? Triage is very much the word, it's like, ok, what's important? How do I turn things back on? Do I need to understand the problem right now, or do I just need to get the money pipe flowing again?

Greenberg: That resonates with me as well.

How the On-Call Experience Helped With the Development of the System

All of you touched upon what caused that action, on the development side of things and understanding where the gaps were, where the holes in the system were. All of us have worked on sizeable systems, or still working on very large scale distributed systems. Let's talk about the development side of things and how being on-call changed that, contributed to that, especially because through the DevOps movement, and all of us talking about DevOps, a lot of the focus has been on the Ops side. Let's focus on the dev side now. How did your on-call experience help with development of the systems?

Mailer: The company that I was just describing earlier, effectively, they'd bought some third-party software off-the-shelf and hosted it. They had a very Opsy Ops team who would treat the software as a black box and just run it. I was brought in as part of this growing development side. We were very much developers learning to become operators. The DevOps movement to me has always been about developers learning to become operators and follow their code through to production. I think the follow the code to production bit to me has always been the key. It's like, I've made something but I haven't realized that value. It's still inventory in the lean sense until it's actually induced by customers. I think when you start from that, that's when you start to get into looking at logs, you start to get into metrics. You get annoyed at stats Steve is not actually telling you what you want to know. Basically, it's all about that analytics, that understanding, how does this actually behave when it interacts with real people? That's been the key to me.

Greenberg: Haley, does that resonate with your experiences?

Tucker: Yes, it does. One of the practices which I've been a huge advocate of is like feature flags. When I first started at Netflix, my team owned a library, and that library got pulled into four different microservices. I think coming at it from that perspective, you realize that if I make a mistake in this library and it causes a problem, it's not even my service that I'm rolling back, I have to go to other teams and get them to roll things back. Something that we leveraged very heavily there was this idea, if I push this out and it causes problems, how do I mitigate it? How do I do that in a way that doesn't require a whole bunch of people to have to get involved? I think that still applies to microservices, even if you're pushing out your own service, and you have full control over that, the ability to push out all of the rest of the features and just turn the one off that's causing the problem is a huge value. You can also pull that all the way into your unit tests. As you're adding the feature, you add unit tests around that flag to make sure that feature flag works and actually will turn off the feature as expected. You can pull as much of that left in the development cycle as possible, so that once you get into production, you're fully prepared and you know what to do in the case of an issue.

Greenberg: I know with Netflix embracing chaos engineering, and it having direct feedback loop into development has been very beneficial.

Kolton, you've been very involved with chaos engineering, and actually your company focuses on providing that to other startups. Please share with us what your experiences have been and how that helped you as a developer.

Andrus: It's interesting, those early on-call experiences really sparked my interest in reliability as a whole. When I went and joined Amazon, I was offered many teams I could join, and the team that was in charge of making sure the website didn't go down sounded daunting but important, and something that I could do that would have a real impact on the business and on customers. I was thinking about how that's impacted, really it shaped my whole career, working in Amazon and Netflix, both focused on reliability, starting a company focused on reliability, trying to build tooling to make that easy for folks.

I want to tell the other side of Haley's story. I was on the API team. We ran all of the code that all of the teams wrote. We'd call it like the fat client problem. There was a lot of logic in those clients and we were responsible for operating that logic but we did not write that logic. That was the birth of Hystrix and Circuit Breakers at Netflix, credit to Ben Christensen. This was actually before I arrived but this was the reason I joined this team at Netflix was, someone else's code could cause a problem. Let's wrap it in a Circuit Breaker so, first of all, if it does have a problem, can we gracefully degrade? My favorite team to pick on at Netflix is recommendations. If I can't tell you what movies are personalized that you care about, let me just show you a cache top 100 list so you can find the movie you want and watch it. Circuit Breakers are also powerful because they allow us to have timeouts and resource constraints so that our work queue doesn't grow unbounded and result in cascading failures. It allows us to isolate that, just like a bulkhead on a ship. I think that pattern to me is one that resonates as a good, solid engineering practice that in my early days I never would have thought about.

Greenberg: That resonates a lot in terms of really defensively protecting not only your system from all the upstream, but also making sure that the impact going downstream isn't as bad as it might happen when systems are running wild in production.

Michelle, would you like to share your experiences? I know you've had vast experiences in multiple large scale companies, I would love to hear from you.

Brush: One of the things that I think I realized when dealing with a large distributed system and being on-call for it as a dev, was how much a better production experience relied on me making earlier decisions around design in a way that would facilitate it. Two things that come to mind immediately is the idea of a generic mitigation, and I think Haley touched on this a bit with feature flags. The idea that if you have something you can do that's safe and fast, you have more things you can try, but you have to build the system to allow those things, like operations need to be idempotent. You need to be able to not have all these complicated nuances and configurations to be able to perform the operation. You have to think about, how do I design a system that supports those types of things that can be tried over again with minimal side effects? The second thing I learned pretty quickly is there's a lot of scaffolding development that has to be done if you want to do a zero downtime migration. Am I planning for that upfront and saying, this is part of the acceptance criteria of the system we're building, is that we have the things in place to guarantee zero downtime.

Greenberg: The experiences especially designing from the beginning for reliability, it sounds like your experiences have been in general something that is a good idea, really helps down the road. As you mentioned, a lot of those things are not obvious. Then when you learn them and start designing with them in mind, it influences and informs the design.

Let's hear from you Shelby, all the wisdom you've gained.

Spees: I think, like one of the things that I saw a lot on that first team was armchair debugging, where you start getting errors, or people start complaining about latency or something, and you sit in your chair and you're like, maybe it's this. You have a hypothesis, and then you go search your logs and see if it supports that. If it's wrong, then you have to come up with a whole new hypothesis, and then just go and search your logs for that. It's this very long process. For the things that you just release a change, and everything's breaking, and so you just revert that change. That's pretty quick. For the things where it's like some intersection of the stars aligned in a very certain way, and something that we changed two months ago is now having weird side effects, it's a lot harder to debug. If you've been on-call for a long time, and you're used to relying on your metrics, and monitoring tools and your logging tools and stuff, you learn like that particular squiggle means it's Redis. If you haven't been doing it for decades, which I haven't, and you haven't seen all the big failure scenarios for that system, it's almost like, how would I even possibly guess?

I was actually a Honeycomb user before I joined the Honeycomb team. I remember being like, there's another way we could do this. That's when I learned about instrumentation, and why we would want to do something besides adding StatsD timers around different chunks of our code, or changing the log level. There's a different way. There's possibly a better way, I would argue. That's where connecting the dots from writing your code, and then going through and observing it all the way through its lifecycle and production becomes really important. Thinking about, what is this change going to look like? How do I know it's working when I release it? How do I know it's not working, and adding instrumentation for that. I got to see what happens when you don't have that, which I think a lot of people experience.

Greenberg: With the theme of InfoQ Live, this specific one is observability and planning for it from early on. As you're developing, making sure that you add that instrumentation and so that's possible so that you can see it playing out, and hopefully, that simplifies your life when something goes wrong.

Luke, we'd love to hear from you and the couple companies that you've worked at that had huge distributed systems.

Demi: I think it's interesting, Michelle brought up how being on-call changes the way you think about new systems. I think there's also this perspective that being on-call is really the only lever you have when you have bad broken systems. Being on-call is such a misnomer, for me, because it's like, really, we're talking about when systems go down. It's not just the experience of being on-call, it's the experience of things are breaking. When things break, and when you're on-call, and you experience that, that's your lever to help change culture within an organization so that you are building the right systems for the future, that you are doing things like adding better observability. You are using feature flags, and not relying on 10-minute deployments to fix things. You're making idempotent changes, better runbooks. On-call is the lever to improve those things. Through on-call and things like code yellows, code reds, that's what drives change with culture, but also reliability of an organization from a dev side.

Helping New On-Callers Onboard

Greenberg: I would love to hear from panelists for the second part of the title of the panel, which is, make yourself at home. How do we help on-callers, especially those who are onboarding for the first time, onboarding on a new system, how do we help them onboard so that it's not so intimidating and scary as it might have been for some of us when we started out?

Demi: The first thing when somebody is joining the on-call rotation is that you go and you scramble and update the runbook. I think that's the rite of passage. Most runbooks are built this way. They're built in that everybody updates them once every three months when somebody new joins the on-call rotation and needs to get updated, or adding like the oh shit levers and those type of things to the runbook so people know where they are and those type of things. For me, I think when you're trying to make people comfortable coming into on-call, I think the most important thing is how can we explain this system in plain English and not make it just a series of, squiggle here, change this there. Because, really, what's happened in the past dictates what's on a runbook, and it dictates what's on monitoring dashboards and those type of things. The reality of a system doesn't really change, or it does change, but you can ramp somebody up on what a system is: what are its inputs, what are its outputs. Give somebody that feeling, in plain English, of, what is this thing? Why does it sometimes behave and break in these ways? I think the best favor you can do for somebody who's coming on board, is to really give them that plain English walkthrough. For systems where I'm onboarding a lot of people, sometimes what really helps is just a long talk in very plain and candid English about what this thing is and how it operates, and how you should wrap your brain around what this system does. For me that's been the most effective. Runbooks are also helpful, especially when knowing what levers to push, but the biggest, most impactful thing for new people is just understanding the corners and the edges of a system.

Greenberg: That makes a lot of sense. Somebody doesn't have the mental model of what the architecture is, how do you expect them to understand what's going on in the heat of the moment?

Michelle, you and I had a very interesting discussion in terms of how to make sure that people are comfortable being on-call, and minimizing the stress and the fear around it. We'd love to hear from you here.

Brush: One of the tools I love doing is having the main on-caller, that's the experienced one that's been on the team for a bit, pull the new on-caller along with them on outages, and stress that the experienced person has to think out loud. Every hypothesis they have, everything they're trying to do, they need to express that verbally. That helps the person onboarding learn how they think about the problem, not just, how does the system look? Not just, where are all the dashboards and where are all the playbooks? Like, how do you think about production? The next thing that I really want to do whenever we're onboarding someone on-call, is make them feel empowered to own the decision. I think too often we end up in the space of people want to look to the TL or the manager or the architect or someone more experienced. I worked for a company once where everyone looked to the executives who had the least amount of information to make a decision in an outage. Where they were looking to someone with more authority than them to make the decision about what to do next, or to give them permission on what to do next.

One of the big things that we try to work for is making the on-caller feel like they are the one that owns the decision. It's ok. It'll be safe. We trust that they made the best decision at the time given the information they had, but we want them to own the decision so they'll feel more comfortable and be able to operate without panicking and saying, "I need my manager," or, "I need the principal engineer to come into this." That's a big part of that thinking out loud is that experienced person showing that they own the decision, and they can make it on their own. Then if they need to pull in other people they do, but they ultimately feel that sense of ownership.

Greenberg: With embracing the blameless culture we all learned, I'm sure it will resonate with a lot of us. We all learned the most from our mistakes, not the things that we did well, because those mistakes stuck with us. We at some point also felt safe making mistakes, and still do.

Kolton, any parting thoughts, wisdom to share about onboarding others?

Andrus: I'll echo Michelle's points. The, who has the responsibility was one of the daunting things. Being a call leader at Amazon, you're elected, there's 10 to 14, and you have the authority and the responsibility to do whatever it takes to resolve that outage as quick as possible. That is scary. That was a little bit before the era of blameless postmortems. I saw some people get called out for questionable judgment when obviously you're acting on incomplete information. I think one of my soap boxes, I love to ask live audiences, how many of you have participated in a fire drill? Pretty much everybody. There's a reason for this. We want people to react calmly and safely when a fire breaks out. We don't want people to panic. We don't want people to inadvertently make things worse. One of my pet peeves is I think our industry can do a lot better about training folks to be on-call. We can all make this joke about, here's your pager, here's this out of date runbook, figure it out. Really, that's a sad joke. We need to invest the time to give people time to prepare. Shadowing a primary is 100% something I would recommend. Mock exercises. Given the chaos engineering background and angle, create a real but small incident that you have control of and let your team practice all of the steps. That involves, did they get paged? Did the escalation work correctly? Do they have access to the right system? Do they know where the logs are stored? Do they know which dashboard to look at? Do they know where the runbook is? Do they know who to ask for help if things go wrong? Just letting people practice that during the day after the coffee has kicked in instead of having to figure that out at 2:00 in the morning by yourself in the middle of the night, would make a world of difference. That's my plea and request to the industry as a whole.

Greenberg: Definitely it resonates. I think a lot of us from my conversations will go with you, had that whole thrown into the deep end, have fun, go figure it out. Then if you can't, then we'll talk about it later. We shouldn't have to do this to other people who are onboarding now.

Shelby, would love to hear your thoughts on this.

Spees: I love everything Kolton said. Comparing it to fire drills is such a useful comparison, because yes, everyone's experienced that. I think like software as an industry, we have a lot we can learn from safety critical industries, and especially like we're embedded in so many parts of people's lives, a lot of us are arguably in a safety critical part of the industry. I think learning from how EMT response works and firefighters and things like a light gun, ISS, how they do their training for incident response. We can absolutely learn from that. I think the resilience engineering community especially has been promoting that. Another thing I want to know is, as the industry grows and more people are coming into tech through more novel routes, more people are coming in through boot camps, or are self-taught. We've had self-taught people for a long time, but learning from online courses and stuff. We set goals to improve diversity, both through the pipeline and through retention. Everyone has different experience with being in that decision making position. Of course, we want to empower people to make decisions at the moment and that's part of why training is so important and the shadowing of senior on-call people is really important.

At least from my experience, I think there's also the stereotype threat of being the only person in the room who looks like you, and also being responsible if you make a bad decision. If you don't have the psychological safety there, then you're going to freeze up. That's how something that could have been a small mistake can become a big mistake with ripple effects, and make an incident a much bigger problem than it needed to be. I think the more we do to create psychological safety on our teams and promote equity and diversity in our organizations, the better we're going to be able to make people effective at incident response across the board.

Greenberg: Yes, just making sure that everybody does feel like their voice counts, and that they're ok. They're safe in expressing it.

Spees: They're not going to be disproportionately punished if they make a mistake in the moment.

Should Engineers be Paid Over and Above Their Salary for Being On-Call?

Greenberg: Glen, we'd love to hear from you. Also, Sean asked, should engineers be paid over and above their salary for being on-call?

Mailer: Anything is interesting, because you can flip that question to say, should somebody who's unable to be on-call be paid less? Say you just had a baby, should you take a pay cut now because you have to step off the on-call rota? In principle, yes, probably, I think people should be paid more to be on-call. If for life reasons, you need to step out of the rota, I don't think you should take the cuts. There's also some really interesting incentives I've seen play out where if you get paid a bonus when you are called out, that can lead to some really bad incentives around fixing problems. Generally, those are in less healthy engineering cultures that I've seen that play out.

Making Sure Onboarding On-Call Isn't so Scary

Greenberg: Helping others how to make sure that onboarding on-call isn't so scary.

Mailer: One thing that I thought was interesting, Shelby described the drills, like looking into how EMTs do things. I think that framing is likely to be a lot more effective than the fire drill framing, because I think practicing is good. If you use to compare to a fire drill, everyone gets up and walks out the door and does exactly the same thing every time. I think incidents are very much not like that, they're dynamic, evolving situations. Because the boring ones, we can just get a computer to fix that for us. I really wanted to echo what Michelle said around thinking out loud. Some point in the past, I got into this habit of thinking out loud in text. If someone says, should we start a Zoom? I'm like, can we not? Because as soon as you start a Zoom, as soon as you start talking out loud, all this context, like even if you record a Zoom, who wants to watch a 3-hour Zoom of a horrible incident? I really like to write down my thought process in the same way that I would vocalize it. It helps share with anyone else involved in the incident. It also means people can learn from that incident asynchronously by following that thought process and going, "I see what happened there." I've seen that be really effective.

Greenberg: That also reminds me of things that I learned from Michelle earlier, which is externalizing your process. If you want to teach people how you did things, how you know about this, just externalize it, which is, just explain your thought patterns. Explain how you reached those conclusions so that next time people have a good framework to follow that might help them as well.

Brush: I'll add to what Glen said about getting it in writing. That helps a lot when you do incident analysis later, and you're either doing a postmortem or some retrospective, like having all the thoughts written down helps you understand where monitoring led you astray, or where the signal that you got looked like this other outage, but it wasn't. You can go try to identify those for better observability.

Compensating On-Call Engineers

On the point of pay, I have worked both worlds. I've worked a world where you didn't get paid for being on-call, even if you're up at 3 a.m., working on an outage for 4 hours. I've lived a world where you did get compensated for your on-call time, as if it was not full overtime, but you got something per hour for the time that you're on-call. I think that the being paid is always better. No one's going to argue with that. It does help with really bad days, where you're having a really bad outage and you're like, at the end of this, I'm going to get some compensation, so I can get through this. I think where it's hard is when you try to fan out, you try to pull in more people and they're not getting paid. Then that creates this unfairness. If the organization is going to pay for on-call, they need to do it in a fair, equitable way.

Andrus: The pay one's an interesting one, because just look at the market, SRE wasn't a title that existed 10 years ago, and Ops work was the work that no one wanted to do, and they had to incentivize folks to do it. Now we've flipped it. It's like me playing magic in the '90s. Now it's cool, when in the '90s it definitely wasn't. Now SREs are cool, and they're in demand, and that skill set is in demand and therefore it is paid more. You see at a macro level a bit of balancing, that doesn't change. Glen makes a totally interesting and valid point of flip it around. It's been interesting for me to observe that change the past few years.

Demi: I think it's worth pointing out, there is a psychological impact of being on-call. It's more than just more work. It's a psychological, you go to bed, you don't know if you're going to be woken up. It's like having an infant. It's worse, I would say. When it comes to, should you be paid more? The answer is yes. It's simple to me. If you are on-call and you're dealing with the psychological impact, and you're on the front lines, you should be paid more. How that works, I don't know. That's not my sphere of influence, but I just think when you factor in the psychological impact, it takes years off your life, I feel. It's a whole topic. It's, how do you avoid being on call when you join a new job? How do you hide from these responsibilities? I personally would love to not be on-call as well. There's nothing I'd love more. This is to answer Glen's question, I would take a significant pay cut to not be on call. If I knew I could just have somebody else to deal with all the problems that I threw over the fence during my work day, I'd take a pay cut.

Mailer: That's blasphemy in the DevOps world, how you ended that, Luke. I was on board until you got to that point.

Spees: It's such an important thing for developers to be on call for their own code. It's another thing entirely if you're on call for some random team's code and you never experience the effects of your own changes, but to be on call for your own code is such an important growth experience. I think that's a lot of like what many of you have been saying about just the learning experience that changes how you think about production. I think it's also on engineering managers to make it so, as much as possible, your on-call experience isn't taking years off of your life. There's only so much we can do. There's a reason why EMTs get paid so much and firefighters get paid so much is there's risk there. It's also true for people who are on call for safety critical software systems, or just anything that could wake you up in the middle of the night, really. Historically, the support team or the NOC is some divorced team that has to deal with your shit. They're the ones who are in another time zone, so hopefully they're getting less paged in overnight.

How to Gain Experience Supporting Complex Production Systems Outside of Work, For Engineers Early in Their Career

Greenberg: Ellie asked, for people early in the engineering career to have any recommendations for how they can gain experience supporting more complex production systems outside of work. I've struggled to gain this experience in my side projects because of the scale and cost. I think that resonates with the question of how to make people comfortable. How can they gain this experience? We'd love to hear from you, Haley.

Tucker: I have some recommendations, I don't know that they're necessarily outside of work. I might need to noodle on that a bit. Some things that I found useful when I was early at Netflix and at previous companies is like, everybody has that low volume of errors that people know are there but they're not critical to dig into. I find it useful to put new people, and I did this myself, like, let me figure out if I can trace this error back to something, because we don't have a runbook for that. It's going to require you to figure out like, how do all these pieces connect? Can I even make that connection? What questions do I need to ask people in order to figure out how this is connected to this over here? I think there's a lot of value in pointing people at some non-critical problems that require a lot of investigation, and tying things and trying to figure out how to do that. That's one idea.

The other, 100% plug for chaos engineering here is that being able to inject failure into a system in a controlled manner, and see what happens as a result, and limit the blast radius of any fallout from that is hugely valuable, because you learn so much. You learn how all of your design decisions that were well intentioned and meant to mitigate a problem are actually not doing what you intended them to do. There are open source tools and platforms dedicated to this that you can play around with. I would recommend that.

Experiences with Having an Incident Coordinator for Major Incidents to Do Stakeholder Comms

Greenberg: What are your experiences with having an incident coordinator for major incidents to do stakeholder comms when engineers are busy fixing major incidents?

Mailer: The incident process we do at CircleCI actually has two coordination roles. We have an incident coordinator who is responsible for keeping track of where things are overall, and who's working on which bits, and also, for pulling in any additional people. Generally, for a small incident, the first responder starts with all the roles. Then if the incident blossoms, then you start to hand off roles. One of the earliest roles to hand off is ok, here is a coordinator who is not involved in trying to actively fix anything, they're just keeping an eye on what's going on and making sure people are making forward progress. Then we have an additional coordination role, which is our communications coordinator, who is specifically responsible for taking what's going on, and translating that to the outside world and keeping us up to date. We find that it's a big improvement in customer sentiment and reduction in tickets, if every 20 minutes or so we're providing useful information on our public status page and not if it's an ongoing story. Having someone whose sole responsibility is to keep track of that time, of, when did we last update? What state are we in? What can we say? Can we say more, this is a bit vague? We find that really effective.

Andrus: One of the opportunities I've had in my career is to be a call leader at Amazon for several years and see how they did it. When I joined Netflix, I brought some of the learnings there and got to see how Netflix did it. I think there's a few key points if you end up in that role and you're in charge. I've had the one role, two role, I think there's value in having people to help coordinate, take notes, and communicate. I think it's good to have one decision maker, one person that doesn't need to know everything, but needs to make that call. Some people will not feel comfortable making a questionable call in those circumstances. You need someone that can fill that role that can weigh the evidence and make a call knowing that they might be wrong.

I think the status updates are critical. When I'm running a call and people are joining for the first 5 minutes every 30 seconds, I'm telling people what we know, what the signal is, where we're at. Then I'll start to space it out more. Reminding people what they should be doing, tasking them to go look at their services, their dashboards, their logs. We're not aware of what's broken yet, so don't assume it's not you, assume it is you. Go look at the things and rule yourself out. If you can rule yourself out, excuse people from the call. I've been on very long calls where it's very obvious that it was us and the load balancer team and the networking team that needed to be on, and the top five critical applications were not needed. As soon as we had high confidence in that, excuse those folks so they can go back to sleep, so they can go back to work so that they're not being randomized. Those are my call leader tips and tricks.

Mailer: I think one trick was taking that verbal update, and then putting that in Slack or whatever chat tool you use, you've got to update. Something we've also used sometimes is you start a Google Doc or some editable doc that you can keep, basically, have the summary at the top like conversation scrolls and being able to join in and say, what do we know right now for sure, in one place, that's really effective.

How Learning from EMTs and Firefighters Responses Might Work

Greenberg: Shelby, I'm curious, something you were talking about like the EMTs and firefighters learning from their responses. Anything that you can share specifically in terms of how that structure might work? I'm not very familiar besides what role they serve?

Spees: Honeycomb are really lucky. My manager joined, he used to work at PagerDuty, and he actually was able to give us the PagerDuty incident response training, which is, I think, available open source. My understanding is that was built based on the history of EMT, and all of that stuff, and that sort of high safety critical incident response. I recommend taking a look at that, just reading through the documentation. It explains a lot about the role of an incident commander, and the reason why you have separate incident commander in comms, the reason why incident commander doesn't do any investigating. You assign people tasks, and you want them to come back within a couple minutes, or you set timers around things. It's very specific. It's something that you can drill. You can drill, like Kolton was saying, with a very small incident or even like a fake incident, just to get people the muscle memory and the comfort level of making calls in the moment so that when you're paged at 2:00 in the morning, you don't have to remember what you were supposed to do.

Our team is still very small and oftentimes we'll assign someone as incident commander and we won't follow the PagerDuty protocol perfectly at Honeycomb, but just having a little bit more structure makes things go a lot smoother. We've taken to having incident reviews for even just the smallest incidents. Any time something requires some manual intervention, it's worth reviewing and discussing. That's the other half of that is you can learn a lot from just something being a little bit of friction. It doesn't have to be like the world's ending in production, because you can find and fix things when there's a little bit of friction, and that makes it smoother when you have like a world ending Black Swan incident. All of that just continuous improvement stuff is really important.

Greenberg: Just having that structure should also alleviate some of the stress where you don't have to do everything. You can do some of the things and then learn from them and maybe some roles you're more comfortable with.

Key Takeaways

I would love to hear from each panelist, key takeaways, maybe one or three-sentence summary on three things that you think are most important, so that the attendees can walk away with the key insights.

Spees: What this is making me want to do is it's making me want to shadow our own on-call teams, which is something I've wanted to do for a while in my role right now. It's not one of my main responsibilities. There's always so much to learn. I think a lot of what Michelle was saying actually really resonated with me about the systems thinking, and upfront design thinking. That's something I'll definitely take with me going forward.

Andrus: Probably not surprisingly, my advice is to be proactive. Go out, go read every postmortem or incident review at your company or online so that you get a context of the things that could happen. Gremlin has a free version. If you're interested in chaos engineering, go play with it, go cause failure in a safe, controlled way, validate your hypothesis and understand what happens. Observability is great but that's read only access to your system. Write access and seeing how things change when things go wrong is critical. Use that to run a fire drill or whatever you want to call it to help give people an opportunity to practice and prepare so they're not figuring things out at 2:00 in the morning.

Brush: I think my biggest takeaways around this whole space is just how everyone from all these different companies, myself included, all had a general sense of, we don't feel like we know what we're doing but we know what to try. That helps anyone understand, don't feel like an imposter. Don't feel like you're doing the wrong thing. You're doing what you can, given the information you have. The second thing is, I hadn't thought about what Shelby had said around needing a lot more additional safety when you're in an underrepresented group. It was funny, because I should have realized that, given my participation in the tech industry. That's something for me to take back and think about more is like, when we talk about psychological safety, are we being fully inclusive and understanding what that means for everyone on the team?

Mailer: One thing that's struck me is that the title of this panel is, "Embracing Production: Making Yourself at Home." We've mostly talked about bad stuff breaking. What I've wanted to reinforce is that production is not just about when things break, production is also about when things go well, and the positive changes we make out there. I think by getting used to following your successful changes out to production and understanding how they behave and making sure that the impact of your changes is what you expect, that's how you can build the muscle and the preparation so that when something does go wrong, you're used to interacting with production directly. You're not only looking at it when things are on fire.

See more presentations with transcripts

Recorded at:

Jul 24, 2021

InfoQ Software Architects' Newsletter