BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Contribute

Topics

Choose your language

InfoQ Homepage Presentations Production Readiness: Fighting Fires or Building Better Systems?

Production Readiness: Fighting Fires or Building Better Systems?

Bookmarks
39:54

Summary

Laura Nolan discusses why we don’t have a fire code for software, and what Production Readiness Reviews can and cannot achieve in terms of reliability.

Bio

Laura Nolan's background is in site reliability engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'. She is a member of the USENIX SREcon steering committee.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

INFOQ EVENTS

  • Oct 24-28 (In-Person, San Francisco)

    Learn how to solve complex software engineering and leadership challenges. Attend online at QCon San Francisco (Oct 24-28) Save your spot now!

Transcript

Nolan: In this talk, we're going to talk about why it's important to have a process to give space and time to teams to think about the reliability needs of their systems, and what needs to be done to productionize them. We're also going to talk about why production readiness should focus on team engagement, rather than being a strict set of rules, or even strict set of processes that a team has to go through. We're also going to talk about what to do when your services are not sufficiently productionized. How to get ahead when you end up in a situation where your team is very reactive, and just fighting fires all the time. This talk is really how I think about production readiness. The first section is a little bit philosophical, but we will get very practical at the end.

Background

I'm a senior staff engineer. I have contributed to various books. I write a regular column about SRE at USENIX's ;login: magazine. I write occasional incident analysis on the Slack engineering blog. I spent a bunch of the pandemic getting an MA degree in ethics. This is an actual picture of me here to the right, attempting to write an MA thesis in the third lockdown.

Tanya Reilly's 2018 Talk, 'The History of Fire Escapes'

I wanted to start off by recalling a talk, this is from about three years ago, and it's by Tanya Reilly, who is a principal engineer at Squarespace. Tanya's talk is amazing. It's a really eloquent, funny, and well-illustrated argument for proactive software reliability work. I agree with that completely. We should be working proactively on making our systems more reliable, because otherwise, you can end up in a very bad place indeed, circling the drain. This talk is definitely not a critique of Tanya's talk, it's maybe a response or some corollaries. There are limits to what we can do in terms of proactive reliability, and we can do quite a lot. Our systems will never be perfect, and they will never be entirely incident free. Buildings are not software. There are some important differences between the two things. My most important message here is that for software reliability, I do not think that strict rules and regulations serve us well. I don't think that we're going to wind up with legal rules around software reliability. I do think that in a lot of organizations that are newer to SRE and new to the idea of doing engineering work on reliability, you do end up sometimes with organization level strict rules. I think that those can be really counterproductive.

Regulation in Software

Why don't we have a fire code for software? Actually, if we think about it, we have quite a bit of regulation in software. We do. Chances are you've had to deal with SOX, or GDPR, or maybe CCPA. Certainly, there's a plethora of different privacy and data protection rules coming in. Maybe you've had to deal with something like FedRAMP certification. Maybe you've worked in a safety critical domain, such as aviation software, or automation software for cars, or medical devices. There are other domains where there are certain rules about how you build software. Primarily, we see regulation, and this is all domains, not just software, typically in areas where you have some risk to human safety, and you have a conflict of interest where we think that organizations might cut corners in harmful ways. Or maybe there's a risk of financial fraud, like with SOX. In most of the cases where we don't have one of these factors like fraud or safety, we tend to see self-regulation. That's where software reliability sits, except for the safety critical cases. We see it in cases where there's issues around fraud or privacy.

Regulations can also act as a repository for good practices, a way to pass on information, often very hard-earned information about what works. We don't want people to have to reinvent the wheel. We don't want to make silly mistakes again and again. This I think is the way that Tanya is mostly thinking about a fire code. I think it's the more productive way to think about software, because we often find things in our systems that are not what they could be. You have an incident, and you say, why didn't I have loadshedding here? Why does this bottleneck that we didn't know about exist? Why do I have this gap in my monitoring? This, I think, is the level where the fire regulations on the software and reliability analogy seems to work. There are still some problems with it.

Regulations Evolve

These are Georgian windows. I live in Dublin, in Ireland, and the city is full of Georgian buildings. The history of these buildings and the architecture, and the regulations is quite interesting. You've probably heard of the Great Fire of London in 1666. After the great fire, London brought in new fire regulations, which said that you had to build in brick or stone, that you couldn't build your buildings in wood anymore. The fabric of the buildings was inherently more fire resistant, and it was less likely to get spread over large areas. Originally, windows were flush with walls. See the leftmost picture here, we have this wood-framed sash window, and that is completely flash with the bricks. That was the first style of building that you got between 1666, and 1709. 1709, you see a switch over to this middle window. The regulations have changed in London at least, and window frames now have to be recessed into the wall. You're keeping that wood away from the surface of the wall, so if there's a fire outside, it is less likely to catch your window on fire and spread into your house.

Then in 1774, the UK parliament passes a nationwide building act. At this point, the whole window frame is supposed to be covered in brick. This is the third picture all the way over to the right. You can see there, it's recessed back, and a lot of the wooden frame is covered in brick, you can just see the window bars and the glass now. This again, makes it less likely for your window to catch on fire. This is 108 years of evolution. What have we done? We have covered a wooden window frame with brick.

We've been building large web scale distributed systems for about 25 years, but they are quite a bit more complicated than sash windows. Software systems, they're not static in the way that a building mostly is. They have this dynamic equilibrium. They're harder to inspect. You can't physically go look at your software system. The way that you inspect it, you have to build monitoring and tracing, and think about it, and interrogate your system directly in that way. Loads and other things in the environment, they vary hugely in software. Think about the Facebook outage that we had at the start of October 2021. As a result of that outage, some of the DNS services started to see 30 times their normal load. This was because every Facebook SDK, and every application or web page that included Facebook, was repeatedly requesting facebook.com via DNS. Software systems, their environment, they change constantly. The built environment changes rarely and slowly.

Not all buildings are the same. The same is true for distributed systems, you have simple or complex ones, standardized versus custom. A Ruby on Rails or a Lambda is not the same as your custom large scale cloud infrastructure. Your systems can be stateless, or data intensive, with different SLOs, different users and use cases. Are your users internal or external, things like that? There's a huge amount of context. For those 25 years that we've been building these systems, we haven't been building with the same technology: different programming languages, frameworks, web standards, cloud, mobile, all these things. As an industry, we make mistakes, and we get better. A simple example of that. Around 12 years ago, I was working for an eCommerce company, and the first ever iteration of the mobile app came out. There was an obvious problem with it, in retrospect, which was that it had code that would every day update the user's catalogue of all the products that were available. Every app in the world did at the same time every day. We wouldn't do that now. That would be very unlikely to pass code review at almost any sizable organization, I think. We have this shift towards new, better practices. The same happens in building safety, but normally not multiple times in one career. The pace of change is huge.

Downsides of Rules and Regulations

There are downsides of rules and regulations. Adhering to rules and regulations, it's time consuming. It will reduce your speed and agility somewhat. You'll have to have more process, and more compliance, and more checks. The other thing about regulations, particularly in quite complex and poorly defined areas like software, they'll be one-size-fits-all. They're not going to be specific to your system and architecture. I would very much love to see more detailed productionization guides come out for open source software and cloud software. We don't really have a lot with that. Particularly for your homegrown stuff, you don't have it. Regulations for reliability are easy things like, you should load test each component. Even that, it sounds reasonable, but perhaps you have got components where throughput is inherently limited. Maybe your time could be better spent elsewhere, where maybe you have things that are embarrassingly parallel. We should think about things as guidelines and not as rules. Strict rules move decision making further away from a specific context. This is particularly not great, where we have to make so many tradeoffs, and deal with so much ambiguity. We should not be making our decisions by prioritization ahead of time, and further away from the specific context.

Production Readiness Reviews (PRRs)

With that preamble aside, on to Production Readiness Reviews. From the Google SRE book, a PRR is a process that identifies the reliability needs of a service based on specific details. The two key phrases here are identifies and specific details. What we're doing is we're engaging with a service with the specifics, to figure out what it needs. We haven't figured out what it needs ahead of time. The whole process is about figuring that out. It's not about pre-judging it. Typically, SREs, or another owning team would do a Production Readiness Review, at the start of an engagement with a service. In the classic Google SRE model, an SRE team works with the developer team. The SRE team is going to run the service and hopefully they've been involved at least at some level with the design and the building of it, and now it's ready to go into production to be launched. You're going to do the Production Readiness Review. After that, the service is going to be SRE managed and SRE owned.

Evaluating Reliability Needs

The things that we typically think about are everything on this list. Don't think of this as a checklist, think of this as a list of areas to consider. That's how it worked on the teams that I saw that did this well. They didn't see this as a hard and fast list of things that they had to do in order. It was, here is what to think about. Nothing here is too surprising, I don't think: SLOs, monitoring, observability, alerting, runbooks, and so forth. Training for on-callers is really important. I think, in a lot of teams, a lot of organizations, we miss this out, but the resilience of any system comes from having people around who understand it, and can deal with problems. Who can even spot whether a problem is significant or not. If people are not well versed in the system with the sharp edges, its particular quirks, it's very hard to do that. Training for on-callers on how the system works, how it's supposed to manage failure, what are some of the things that can happen? What are the processes that you can do to recover things? All of this is super important and should not be overlooked.

Change management. How do I roll out configurations? How do I roll out new binaries? How do I know if it's working? Do I have canarying? Robustness, failure testing. Have I tested how the system actually reacts to the failures that it's meant to be able to deal with? Do I have automation for things like doing backup and restores, or any other recovery operation that I might need in production? Or even things like rotating secrets, stuff like that. How will the system scale? How will it manage extra load? Do I need to recharge, things like this? One other thing that's often a little bit overlooked as well, is tooling. Some teams that I was on, have taken this very seriously. The idea here is that in software operations, there are typically a bunch of ways for accomplishing any given task. It's very nice for a team if you don't have three different services that do rollouts in three different ways. It's very nice to have a fair degree of standardization here. Sometimes it makes sense for things to be different, but the default should be standardization.

How Long Does a PRR Take?

How long should this take? How long is a ball of string, or a piece of string? It varies wildly. Partly this is because people think about PRRs in different ways. I've seen people do PRRs that just consisted of filling in a template document over a couple of hours. I've seen PRRs that took two or three people, two or three quarters. Part of this is to do with scale and the complexity of the service, and what condition it's in. Does it already have good monitoring, automation, runbooks, and so forth, or do you need to build all of this? For my part, I think that for any complex or critical service, I think a PRR should take at least one person a quarter. This really depends on who is driving it.

Consulting vs. Ownership

Broadly speaking, in SRE, there are two major models that we see. We see the consultant model and the ownership model. The consultant model is where an organization has a small, centralized team of SREs, and they are production experts. They consult with other teams to try and help them make their services more reliable. They probably work on tooling and do things like that as well. Maybe they write production best practices, and all this stuff. Then there's the ownership model, where you have a team that's either fully SREs, or maybe it's a team with some embedded SREs in it. They're going to own that system long term, and run it permanently. Consultants are transactional. They want to take a ring, bring to the volcano, and throw it in. Then they're pretty much done with the ring. Gollum cares about that ring, long-term. Maybe it's not good for him. We're going to get onto that when we talk about what to do when your system's always on fire. Gollum cares long-term. Gollum has skin in the game. When done right, that ownership model can be very good. You care a lot about that system, you're going to be experts in managing it. You should have a lot of context on the domain.

With consultants, the failure mode tends to be that you'll get an engagement and you'll end up with things like generic monitoring. You'll have the four golden signals type monitoring for your system. You won't necessarily have monitoring that's thought about the specific aspects of your system, how you will know whether it's working as intended or not. That really does matter, and not just for monitoring, for every aspect of the production readiness.

PRR Antipattern: Shallowness

Shallowness is a PRR antipattern. You may get this with consultants. Because a lot of us want to say, I've been doing this for years. I know about making systems reliable. I can help other teams do that. We make a PRR template, give it to other teams. They fill it in. We review it, and everything is great. That's shallow. PRRs should be about depth. A big part of a PRR is just giving the team that's going to be owning that system, time and space to spend with it, to think about the failure modes, and what it needs, and what it's missing. You can't do another team's PRR for them. You just can't. For years, I was an avid scuba diver. When we do scuba diving rescue training, it always says, you can't swim for another person. They have to swim for themselves. You can just help them. You can deload them a bit. You can't do another team's PRR for them. You can consult with the team and lend your expertise. That should look like sitting with the team regularly over an extended period of time, while they engage with their system. Help them ask the right questions. Help them go deep, rather than rushing through, being shallow, and throwing the ring in the volcano. We don't want that.

PRR Antipattern: PRR Law

The second antipattern goes back to the start of this talk about the difference between regulations and guidelines. This is the payoff for that whole section. We should not build a PRR template and then make that a hard and fast rule, like a set of hoops that a team has to jump through to launch their service. Because this does happen in some organizations. I've talked to a bunch of people that are in this situation where decision making has been effectively taken away from them. They've just been given this set of boxes that they have to check. You get shallowness. You get this phenomenon where teams are given a bunch of not necessarily useful work that they have to do, that may be taking time away from other more important things. The list of things that we do in the PRR, it's not a set of rules or hard and fast things that you have to do. It's a set of things to consider and think about. Sometimes you consider and say, no, that's not applicable to my system.

Software Readiness - The Shared Space

Here's another building safety metaphor. This is a shared space. A shared space is an urban design approach. It minimizes the segregation between modes of road users. You have pedestrians, and you have cyclists, and you have cars, and they have this space that is shared between them that doesn't have a lot of road markings, or signs, or lights, or all these things. The idea here is that you remove all of the external cues that tell a driver what to do. You make that driver feel uncertain. You make them have to negotiate with other road users to use the space. Why is this like software readiness? Software production is complicated. Software systems are non-standard. Resources are constrained. There's always too much to do. There's always environmental changes, demands on software change. Because of all this, that's why I think that we need to leave those decisions to software team. The team closest to the system, because they have the most context.

The last thing that we want to do is distract that team, or give that team a sense of false safety, by giving them a set of lines that they are meant to color within. I prefer to make it a set of guidelines, or a set of areas to think about, because it makes that team more like the driver in this photo. It makes them have to negotiate their way through this process, to go slowly, to take their time, to do it right. Shared spaces are a little bit unfriendly to blind or partially sighted people. There are some evolutions of the concept that help them by providing pedestrian-only spaces on the side.

PRR Antipattern: Forgot the Humans

This brings me on to our third PRR antipattern. This is one of the biggest antipatterns: forgetting the humans. Human beings are the things that make your systems resilient. When your system breaks at 3 a.m. in the morning, it will be human beings who get up, and we figure out what's wrong, and what to do about it. There's no getting away from that. We can make our systems robust, and we can do everything that we possibly can to prepare our systems. Pretty much every system I've ever seen, still occasionally needs a little bit of hand holding. It's the human beings that do that, because we're the ones with the ability to jump out of the system. What's really important in a PRR, in many ways, is that it's a time for humans to have a focused engagement with the system to learn about it. To learn about its sharp edges, its failure modes. If you try and rush through it, inevitably, you forget about this really important human aspect of PRRs.

What PRRs Can Achieve

Having said all that, PRRs can achieve a lot. The upsides of PRRs is that they make you consider the basics. They make sure that you have monitoring, alerting. They make sure that you have runbooks. They make sure that you have a way to roll the software out. That's really great. It includes knowledge transfer. That's easy to skip over if you're in a rush. It gives you that time and space. It makes you think about anticipated failure modes. PRRs are a structured way to do all these things.

What PRRs Cannot Achieve

PRRs can't compensate for issues with the design. A PRR is typically done quite late in the lifecycle, like close enough to launch. If your basic design has got problems, this is going to be an expensive time to fix them. Just because you're going to do a PRR before launch doesn't mean that you should skip design reviews. PRRs cannot ensure a perfect reliability because software systems are complex and surprises will always happen. PRRs don't solve the problems of scarce resources. They do not magically create extra time to do all the things that you would like to do. You're probably still going to have to prioritize which things are the most important to fix before launch. Some things might go in the backlog. Some things, you might decide not to do. At least you've thought about them.

Ops Mode Antipattern

Moving on to the last section, what to do when your team has ended up in a stage where you're just reactive, when you just firefight all the time. This is an antipattern that we call Ops mode. You've got manual work of some sort, scaling with your systems. You're spending a lot of time on tickets or incidents, alerts, reactive work like that, and you're not making any progress on engineering goals.

Ops Mode 'Smells'

Smells are things to look out for. If your team is in Ops mode, or if you're trying to help a team that's in Ops mode, look for knowledge gaps. A lot of the reactivity can result from people not having a good understanding of their systems, maybe because they didn't do a great PRR. Because people don't have knowledge of the underlying things that are causing some of the problems, they may not feel confident to make longer term fixes or automate things. Knowledge gaps are really important. You can look for specific services that are causing disproportionate work. A lot of teams that are SREs, or other kinds of software operations teams, run multiple services. Sometimes quite small services can cause really disproportionate amounts of work just because maybe they were launched without the care that a fully-fledged user facing service would get. Sometimes you can find internal tools that cause huge amounts of pain. This is where a PRR style process, even if the service is already launched, taking that time to do that engagement and look for the gaps can help there.

Then another smell is a lot of non-actionable alerts noise. Again, this feeds into the knowledge gaps thing. A lot of the time, if not everyone on the team feels confident about their understanding of the system and how it works, you can end up with a superstition about deleting alerts. That means that alerts will tend to proliferate and never get deleted, or thresholds never get tweaked. That's overall bad. Being paged, or dealing with even non-paging type alerts is quite stressful, tiring, and draining, in general. Nobody ever gets this perfect, there are always non-actionable alerts. The important thing is that they shouldn't be overwhelming. It's really important to address those.

Then there's problems that recur more than once or a couple of times without a plan to address them. In the real world, we all have a lot of things to do. If something happens twice in a year, and it costs a couple of hours of someone's time here and there, that's ok, probably. If something is happening every other week, and soaking up a few hours of the on-caller person's time, and there's no plan to address that in the long term, then you're paying an unacceptable tax on your team's time. That is starting to take away from your ability to do engineering. If things are happening, and costing time, repeatedly, there should be a plan to address that. The same is true of repetitive manual work of any kind. If there is some process that needs to be done to onboard new teams, or something that needs to be done periodically to increase capacity, it is well worth trying to automate that as much as possible.

There's also another issue, which is sometimes you can have services that you get tickets or complaints or messages about, and they don't have SLOs. That can be not a great situation for a team, because what happens here is, you find yourself, if you don't have an SLO, your SLO is de facto perfection, which you want to avoid. It's always worth, if you find that situation, trying to declare an SLO, and that will give you clarity with your users about what they can expect. Ops mode, this is where consultant SREs are really great, because if you can get an experienced outside SRE with fresh eyes, they will be able to see these or smell these smells easier than somebody on the team, because you're embedded in it. It becomes very hard to see this once you've been around it for some time. Outside eyes can be very helpful here.

Managing a Short-term Crisis: Code Yellow

Another team on fire phenomenon is a short term crisis. This is a team that's normally doing ok, head above water, and then something specific, serious, urgent happens. Something that comes up and something that surprises you, typically. Something that doesn't have a quick or obvious fix. One way to deal with this is a thing called a code yellow. You have to get your management support, you declare a code yellow. Make a plan to address the issue. Typically, in many cases, this is a multi-team problem, so you can pull in people from other teams as necessary. People who are working a code yellow, then you typically drop all of your routine work and you work just on the code yellow, you delegate or defer. Then the code yellow team should communicate regularly and track execution. This shouldn't go on for more than a few weeks. This should be a short term thing, a response to a surprise, typically. Not something that's planned for, and not something that's regular.

Ownership Can Be Unsustainable

Then there's our last situation, which is, a team with an unsustainable load. If you've looked at it, you've had an expert come in to see if there's a way that you can reduce your operations load. You have problems that aren't code yellow type problems. They're not a short term reliability issue with a relatively clearly defined solution. Sometimes you can just have a team that just is spread between too many services, there's too much complexity. You can't automate that away. You can't make things uniform. This is a common pattern that you see when an organization looks fairly small, and then got big. You might have had one operations team that was centralized. Then, over time, just more things get piled on until it's unsustainable. Ways out of this, it requires management engagement. You may need to give away some responsibilities to developer teams. Maybe they have to manage some types of tickets or some alerts. You might need to get headcount and split your team so that each new team looks after a subset of the services. You'll need management support to figure out how to proceed.

You do need to act. The most on fire teams that I have ever seen in my career are the ones where a team got overloaded. Then they just really hung on. People started to burn out and leave. Things just got worse. Now you have services that are on fire with a huge ops load, and you're trying to staff your team out of new people without a lot of context, without a lot of history with these services. This is not a good place to be. You can pull out of it, but it's very hard. What you need to do is you typically need to get a couple of senior and knowledgeable people to do a lot of mentoring and people who know these services. It's not something that can be done overnight, it takes months to years.

TL;DR

PRRs are a way to do your due diligence when you're onboarding a new service. It can catch a lot of the basic things that will otherwise get you. Their great strength is that they are protected time and space to engage with the specifics of the service. Depth is more important than speed. Don't rush your PRRs. Definitely don't think of it as a hard and fast set of rules or a set of boxes to be ticked. Think about it in the context of your system, what makes sense. Teams can still get overloaded even with PRRs, with Ops work, with the short-term reliability crisis, or just with sheer complexity of the growth of their services. It's really important to address that before a team begins to burn out, because that is the hardest situation to get out of.

Questions and Answers

Sombra: Is there any point that you want to make for the talk and the video, any reiteration?

Nolan: There's a couple of themes in this talk. The first is this idea of, should we think about reliability by the numbers approach? Can we make a checklist that covers everything that you need to do? My extended document has been that you can't do that in a systematic way that works for all services. This was a very interesting discussion here about the difference between this and security. I do very much agree with the person who brought that up. I think there's two things here, when it comes to coming up with checklists and prompts like this. We can say things like, for our data we probably should have a way of restoring data in the case of data loss. It probably doesn't make sense to say it has to be backups, and the backups have to be tested, because it could be a restore process that is based on generating some of the data. Whereas for data protection, in many cases, we can be more prescriptive about what needs to be done. Like I say, as well, in the case of the GDPR, there's a lot of generalities, if you actually read the regulations. Only the other week, it turned out that people have gone to court to one of the biggest companies that provides the cookie click-through buttons saying that it doesn't comply with GDPR.

As a conference organizer, we've had multiple companies that have proposed talks about how they do their GDPR compliance. Then those talks have been pulled at the last minute, because those companies have then decided that they don't want to talk about that. GDPR is a really interesting case where you have these things that seem like a set of regulations that people could describe how to comply with.

Sombra: How you apply the control.

Nolan: Very few companies are confident enough to actually talk publicly about how they do that, and to stand behind that. Writing a set of regulations that's both specific enough to be useful, while also being broadly applicable is very difficult. GDPR, in that sense, is not really a success.

Sombra: In your talk, you talk about the benefits and the challenges of PRRs. Through my life and my many years, I have seen many incantations of them, all of them failed in their own magical way. This question is really pedestrian. Do you have any tooling that you like?

Nolan: For Production Readiness Reviews, no. The best tooling I've ever seen has just been a long Google Doc, in all honesty.

Sombra: That has been challenging because as the Google Doc gets longer, the more heuristics and the more mistakes that you find along the way. We're experimenting with toolings, and then just do scorecards on systems, so we can bake the production readiness criteria into something that then just will give you a grading. I'll tell you, at some point, how it goes, or in which way it magically fails. At this point, the Google Doc seems to be to the point where, eventually, it's according to the size of the company.

Nolan: This is the thing now, you need to think about the Google Doc not as a giant template that everyone in the company needs to follow. It needs to be specific to each service. Each service gets specific to Google Doc for its PRR. The specific details and the specific context of that service should be addressed there.

Sombra: You mentioned that the team itself is the one that discusses the PRR. You have somebody else, like elders in the company defining the document that a team would have to subscribe by.

Nolan: This is where I've very strongly argued that the team themselves should be the ones who define what their criteria are for production readiness, because they know that software best. Nobody knows the software better than the team who have built it and the team that are running it, assuming that those are different teams. Those are the teams that know where the bodies are buried.

Those are the teams that are going to be getting paged when that service goes down. It should be down to that team to prioritize and to figure out where the weak spots are.

If you have a case where a team isn't strong on production knowledge, now you're in an interesting situation. This is where the consultant model could be very useful. In that situation, the best thing to do is to take one or two engineers who are strong on reliability and that engineering work to embed in with that team. If that's a really critical service that cannot fail, that cannot lose data, I think the best thing is to embed engineers in, and have those engineers work with that team for a quarter or maybe two quarters, or maybe longer depending on the size and the complexity and the criticality of it. Because you can't do a shallow engagement where somebody just runs through and says, how was your monitoring? I've spent a lot of time reviewing systems and monitoring, and I've done some fairly detailed stuff, but I can't sit down with an arbitrary team and figure out all of the weird wrinkles in their system in an hour, or even a day, or even a week. You have to spend some time with it. You have to care about the context of that specific system.

Sombra: It serves as an educational aspect to the team too, so the team itself learns how to get stronger.

 

See more presentations with transcripts

 

Recorded at:

Apr 14, 2022

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT