InfoQ Homepage Presentations Architecting for Resilience Panel

Architecting for Resilience Panel

View Presentation

Speed:

39:42

Summary

Nora Jones, Dan Lorenc, and Varun Talwar discuss what architecting for resiliency means, sharing ready-to-use examples, and ideas that can be employed in other contexts.

Bio

Nora Jones is the founder and CEO of Jeli. Dan Lorenc is a staff software engineer and the lead for Google’s Open Source Security Team (GOSST). Varun Talwar is the co-founder and CEO of Tetrate.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Rettori: I'll ask each of you to do a quick introduction, and why are you passionate about the subject.

Lorenc: My name is Dan Lorenc. I'm a software engineer at Google on our open source security team. I've been doing open source, particularly in the CI/CD area for a while now. I've been frustrated seeing this whole subject treated way less seriously than application runtimes. Finally, supply chain attacks have been happening, they're all over the news lately, and everybody's starting to take it seriously. Security in general isn't something you can just assume you're going to do a perfect job at. Resiliency is important from the start. Stuff's going to happen. Bad things are going to happen, and if you don't plan for it, it's going to make your life even worse when it does. That's why I love this topic.

Talwar: This has been going on for me since gRPC days of 2014. Just like, I think a lot of the stuff that needs to be written in applications is just getting more difficult for devs to maintain. There's so much cross-cutting logic that each one has to develop, and language frameworks is one way to do it. Proxies are another way to do it. Somebody has to make sure that those things are getting built in a way that my network and runtime is more resilient. I actually believe in this smart application aware network term as the thing that should happen, in the long term. I don't think we are there. We are far from it. All this stuff is building up towards. I'm even more convinced after having left Google, starting the company that people are really struggling with it. It's really hard for them to architect and build resiliency in a way that is cross cutting across all applications. Super kicked about building stuff in the open. Ever since getting exposed to open source in 2014, I think I haven't gone back. I think there's no other way to build. I just want to continue doing that.

Jones: I'm Nora. I'm the CEO and founder of Jeli, which is an incident analysis platform. I've spent most of my career as a software engineer, building internal tooling, building chaos engineering tooling, building tooling for SREs. I've focused so much of that on reliability and resiliency, honestly, but it's frustrated me. Dan, I love hearing about your frustrations. I've also had frustrations in my space, like just thinking that tooling is going to solve all of our problems, and focusing only on the technical is going to solve all of our problems. When, really, if you rope people into it, and the human systems and the collaboration aspects, you're going to get a lot more ROI out of things. That's why I'm passionate about it. I'm honestly passionate about understanding failure, and helping organizations grow and improve.

Relation between People and Tech, in Incident Management

Rettori: I'm qualifying resiliency from various aspects. Resiliency before the software is shipped, that's where Dan is. Runtime resiliency, that's where Varun is. Post-runtime resiliency, and that's where Nora is. That's I think the view that I wanted to give, that while Varun is important, Dan and Nora are also important. What happens before you ship, or while you're shipping, and what happens after an incident, and how that informs all the decisions? That is key.

We often look at resiliency at the systems and not the people involved in those systems. Talk a little bit about your experience in dealing with and looking at the people process more than only the technology process in incident management.

Jones: I thought you brought up a really good point. I think a lot of the time, the software industry forgets that when we say system, people are part of the system too. The people are the ones enabling the technology to work. The people are the ones creating the technology. Yet, we don't really get taught a lot about this stuff as we're getting trained. As software engineers, we're not taught about organizational psychology or dynamics, we're taught about code. It makes sense that our minds go, how did this mess up in the code, versus, how did we mess up from a collaboration perspective? How did we mess up from a team dynamic perspective? I think some of the secret sauce here is that, when you understand those collaboration aspects, and why it made sense for someone to do what they did in the code, you enable the technology to actually improve better. You come up with better action items for your systems. You understand the technical specifics a little bit better.

Gaining Confidence in Supply Chain Security

Rettori: I'm still a bit scared, coming out of the talk, that I don't think there will ever be a solution to supply chain security or resiliency, like really concerned. Is there a way to actually have some confidence in this, apart from just testing, or there's unfortunately no magic bullet and let's go find some other jobs?

Lorenc: I can't tell you how confident you should be. I'm scared. I can tell you about that. I think that's where resiliency comes in, really to play here. It's too late to fix this. The supply chain is already leaking. We got to figure out how to recover from it, to get back to something we're a little bit more confident in. Never a magic bullet.

The Influence between Incident Management and Supply Chain Resiliency

Rettori: What are you seeing? You as the runtime person, I know you've been involved across the board. What do you see the influence between incident management and supply chain resiliency? Can the application itself be a little bit better, and not have to worry or to be so much impacted by a potential supply chain attack, or something like that?

Talwar: When you were setting this panel up as like, pre, during, and post, that was going to be my comment, which is, they should all talk to each other as well.

Jones: Exactly. That's part of the problem. It's like they end up being such separate things, rather than being like a continuous feedback loop.

Talwar: I'm definitely seeing the pull on both sides, actually. When it's about starting up new to the technology, there's obviously a strong pull towards the pre, and like, how do we connect with our supply chain, and CI/CDs, and what gets deployed there? Where is the source of truth of configurations of resiliency? Is it in my Git and my Git stuff? Is it in a separate system? Where should I change what? How should it change? A lot of the challenges are around setting up those organizational processes in terms of who changes what, where, and how does that get approved? Ultimately, then it gets to Nora's world, which is, if things do go wrong, who's accountable? How do I recover? Who's alerted? How quickly? Simple things to nail home at one point, which is, we do certain things like, ok, every service, there should be a team that owns it. It should have an owner. It's a very simple concept. You will be surprised how not implemented it is, like our lack of implementation of that. I've heard stories of, this went down, and we went down tracking, and we ended up at a service of like, who wrote this? This dude left three years ago.

Jones: That happens all the time. I've seen it at every organization I've been at, this giant service catalog that lists all the teams or all the people that's supposed to own it, and then it's someone's job to maintain that catalog. What's really interesting to me is like, what we think that catalog is in theory versus what it is in practice. Like, who is actually responding and fixing these things? How does that map our mental model? I don't think we expose that delta enough between like, ideally, you're supposed to be owning this. Yet, Varun keeps getting pulled into this because he was on this team five years ago, and he still is the only one that knows everything here.

Lorenc: The shadow org chart you're describing. The actual org structure versus who actually does things. This plays into Conway's Law a little bit, like don't ship your org chart. I think it actually gets a bad rap. I think a lot of times people ship something and then make their org chart look like what they shipped instead of the other way. It's hard to tell after the fact. Either way, it's about just getting teams that are actually responsible and have ownership and autonomy to make the changes they need.

Encouraging Active Ownership

Talwar: What can you do to encourage "Active ownership?"

One way that works is shame. We actually called, in one of the things, shame board, as a way to basically encourage people to stake ownership. If not like shame board works, if you want to be a little bit more healthy about it, and you don't want to go that far, then it can be more like a game board, or build some dynamic into it. I think I was talking to one of the companies that does costing management, and they basically made it part of the process to onboard, of any given service, to that cost management system that you have to claim ownership. Because now you've built in an incentive structure, because I don't want to be the one caught for burning monies, and I want to make sure that my cost is managed, and I keep getting resources. Now I'm going to do it. Incentive structure building in such a way is another way.

Software Portfolio Management

Rettori: I think that touches on portfolio management. I know this is an old industry term, software portfolio management as a thing, but apparently we forgot or stopped doing this for a good 10, 15 years. It was a thing 15 years ago, and then we just ignored software portfolio management. Where are the people involved? Where are the key context for all things? Then, when we need to do, it's the management, or we have no idea who they are. Do you feel like not having tight control of the catalog, is the blocker of success in this, or what are your thoughts here?

Jones: I've actually seen the opposite effect of Varun, when it's a shame mechanism. I've seen more success from actively highlighting who is taking ownership of some of these things, too. I saw in one organization, we were having a lot of issues with console, like so many incidents with console. This organization grew so quickly and console was originally implemented when there were five engineers there. Then, all of a sudden, there were like 1000, and it never got ownership, even when it got to 1000. It became wildly mixed use. Totally unintended in the ways it was used. It didn't have any ownership. What happened was, out of the five engineers that started at the company, there was only one that was still there, and he just put himself on call for everything. He had notifications in his Slack messages if console was mentioned in an incident channel, because he was the only one that knew how that worked. By digging into this and discovering that, we were able to make a case for headcount for an actual console team.

I think leadership has to get involved in this too and understand where some of their bottlenecks are. I'm sure so many people have someone in your organization that feels the need to be on call for everything, and that creates an interesting dynamic too. I would actually sometimes actively discourage someone from being on call for everything and recognize when you're pulling in people to the incident that weren't on call. Because that indicates that they are always the hero that we rely upon here, and you're actually limiting this active ownership model, and unintentionally training people that that's ok.

Lorenc: Some of the chaos engineering stuff that you've done so much work on, helps you. It helps you find these single points of failure. If you have a system that's been running for years with nobody managing it, that's cool, until it does fail. By simulating the failures earlier at a controlled time, you can find, nobody was there to bring this thing back up. That's a problem.

Jones: When I was at Netflix, we had someone on call for this very old streaming system. While we were still supporting it, they were the only person in the entire company that understood how this worked. I was like, what happens when he goes on vacation? How are we going to do at this particular point? Those are good chaos experiments to run too. It's like, how do we do without being able to ask this person questions?

Maturity of Self-Healing Systems

Rettori: How much trust can we put on self-services and systems versus human involvement in this? Self-healing will give benefits, but how mature are we for self-healing systems today? How mature do we believe self-healing systems are today?

Talwar: I don't think we are a whole lot. I wish we were. We don't have enough of a focus on that. First of all, on testing all the failure scenarios and all that. It's all about shipping. Ultimately, everything boils down to incentive structure, even in the previous question. If you reward shipping, you will get shipping. Imagine you had a dashboard of so much this application, this team has handled all of these failure scenarios, tested X, and healing capabilities, and you were to rank, order, and reward on that. You would see a change in behavior. It is somewhat that in terms of the people aspect and the reality aspect of it.

The technology aspect of it is also, from just changing gears to that, I think, from my own perspective of runtime, there's a lot of power in these newer proxies and so on. People still don't really understand simple things around, how do I set up my circuit breakers? It sounds simple, but it's actually not that simple to really set that up and operate it, and have that knowledge be well known. I think there's a long way to go.

Jones: The question here is, how much trust can we put in self-healing systems rather than human involvement? The systems need human involvement in order to be self-healing. I'm wondering, the amount of trust we can put in them is the amount of trust we're putting in the humans that are implementing the self-healing mechanism. How do they know how to make this self-healing? What data are they going off of here? It depends on how much time and space you're giving people to understand what it needs, in order to be self-healing, or if they're just literally adding in self-healing mechanisms, like after every bad thing that happens. I was at an organization a long time ago that after literally every single incident, there would be a new QA test added, and we couldn't even open the Excel spreadsheet after a while. It was such a shallow way of looking at things. I worry about that sometimes with the self-healing stuff, too.

Lorenc: I think we should also give ourselves some credit here. Self-healing has come a long way. We've just started building more complex things on top of it so we don't always notice. When is the last time you had to SSH into a server and restart Apache because it crashed? We've got these for loops to restart things. We can restart VMs automatically when things go bad. We've come a long way. I think we don't give ourselves enough credit for it in a lot of cases. There's still a long way to go.

Striking a Healthy Balance with the Reward Structure

Rettori: If we touch a little bit more on the point that you said in the reward structure. I want to expand that to the reward structure in resiliency. If we're talking about purely like a reward structure for resiliency, then we will not change anything ever. If resiliency was the sole measure, like don't touch it. That can't be the sole measure. Maybe some thoughts from you on that healthy balance versus rewarding new capabilities, rewarding resiliency as well. You could say that they somewhat go against each other. Have we gone too far in break early, break often?

Jones: It's debt that you're trading off. I heard someone make an analogy once to, some debt is good debt. If I take out a mortgage on my house or something, that's good debt. Bad debt would be, I didn't spend time building the floor of this house before I started living in it, or I didn't actually spend time investing in the plumbing before I started living in it. I think that's the problem a lot of the time is we don't know some of the tradeoffs that we're making. I think if you know the tradeoffs that we're making, and we're actively communicating them, between engineering and the business, that's when it's ok to make. It's like when we haven't invested time, or the business hasn't let us invest time to understand some of those tradeoffs, and if we are not actually making a stable floor, and someone's actually going to fall through it.

Lorenc: That was the way you framed that. The concept of error budgets is a good way of actually quantifying that and getting people on the same page. As a business, you decide how much downtime you're allowed. Nobody's going to have nine nines. That's just impossible. Maybe two nines is too few. Pick the right number. Then if you're under your error budget, you can be riskier, being conscious about it. It's like the difference between careful leverage in investments and then racking up credit card bills, not paying attention.

Jones: Exactly. It's the paying attention. It's ok to take debt if you're paying attention to the debt that you're taking.

Talwar: One thing I've seen in practical is like, ways in building and in sprint allocations, and having allocations for debt removal, officially, not under the rug, not counted one. Tech debt always goes there, and it never gets two. Being really formal about it, that we are allocating this much, and this is why, and this is what would happen if we don't. I think that awareness helps.

How to Deal with a System that's Functioning Well

Rettori: Don't touch something that is not broken. It's very interesting. It creates a big issue when it does break. It tells us that we should always be able to ship whatever we have running at any time, because we don't know what's going to break especially if there's a chain attack on it. Actually touching things that are not broken is probably a good practice there. What is broken? What does broken mean? Is an old library and support of the library broken, maybe that is? What are your thoughts?

Lorenc: It's the whole concept of chaos engineering just in general, taking something down to make sure somebody puts it back. It could be an automated system, or somebody might get paged, or something like that. That applies to supply chain too. A simple concept people will do here, we call it build horizon. There's just some policy that stuff can't be running for more than six months continuously, depending on the system. Somebody has to come along and rebuild it, and redeploy it within that time. If nobody's done it, and nobody's stepped up, the thing gets taken down and keeps escalating until somebody might eventually notice. That gives you a bunch of benefits. It shows you can rebuild everything. It gives you a fixed timeline for how long you think something could possibly be lasting. It's not broken. It's still there. Your organization has decided that it's broken as an organization to let things run forever with no one noticing, because it's debt you're accruing without even knowing about.

Jones: I also like to add on to that, like don't touch something that's not broken. You have to agree to that from the product perspective, too. If you decide product is done, we're not adding any new features onto it. We're not doing anything else with it. Yes, maybe you don't need to touch it. Maybe it can be self-sustaining for the most part. If we're adding new stuff on top of it, or if we're adding new stuff every six months, it is eventually going to break. You're atrophying that muscle of working with that system too, to the point where when you do eventually have to touch it, is the last person that touched the system even at the organization anymore? Does anyone know how this works? Chaos engineering can help us practice that muscle and build that muscle. I have seen way too many incidents right after code freeze, because people are not used to deploying anymore. They're coming back from vacation. They haven't practiced this thing in a while that used to be so second nature for them. While we did stop the incidents during the code freeze, we had an uptick of them the week after.

Talwar: I like Dan's thing of basically having some policy of like, you have to come and touch it. Usually, products are not stopping. Usually, there's a higher backlog than what you can actually deliver. You're always on it. It's just like in security. The most secure system is like, don't give it access to anybody. Just lock it down. Put a lock in. That's the most secure thing. You can't really have that. Even in security, I have often said, security through agility. As opposed to saying, it's done. It's like, new bugs, new CVEs, and stuff is always coming, and the sooner you are actually on them, the more secure you are. Then the other way round. I think the problem is just not having enough focus and incentive structures and reward structures for reliability. That is something I think that, as an industry, we need to work on.

Migrating to Microservices from a Monolith with no Ownership

Rettori: We have a question on how to lead the migration phase to microservices when you're migrating from a monolith that has no owner or has tons of them. I think that's something you might touch in somewhat of an everyday basis, Varun, by where you are. What are your thoughts on this?

Talwar: There are two things. First, someone needs to have the intent to migrate. Typically, that person, if not the owner, is somewhat associated with, has some motivation to do it. Typically, that person will lead you to, who should do it? Once you have that, then it's just about the approach. Once you have the person who is the most motivated to do it, that is usually the answer. The rest is all a technology approach. It's a very common problem, actually. There is so much of brownfield and monoliths sitting around which no ownership is there. The funny thing is, the majority business is actually running through them. Most of the core stuff, which is bread-winning and money generating is actually going through that stuff, not to the new thing written yesterday. It is somewhat scary to have that, but that is it. That's the reality.

Best Practice of Delivering New Features into a Stable Service

Rettori: A question that touches a little bit on the CI or CD part here, or even supply chain, or the build systems. Best practice to deliver a new feature into a stable service, is canary strategy an option? Is there such a thing as a best strategy?

Lorenc: Yes, best practice for delivering a feature into a stable service? There are a bunch of best practices. It depends on the service. Canarying works great if you're actually looking at the logs and metrics. A lot of people jump to something like canarying because they heard it's great and it's what all the big companies do. Then if you don't bother actually looking at the performance metrics, or you might not even have enough sample size to get a statistically significant sample in a canary. If you do that math, you need quite a few requests to really know if latency has changed, or if the error rate has changed. Especially when you take into account like warm-up times, and your app might not be performing as well right away anyway as caches get filled. If you've done that math, and you've actually instrumented and have observability everywhere, then canarying can be great, it gives you more confidence. A lot of times when you do that math, you're going to have to wait a week at 3% traffic to actually get enough data to know if your canary worked. At that point, just push it and see what happens. It really depends on your scale.

Jones: Also, depends on exactly what you're looking for with delivery of that new feature. Are you worried that it's really going to break, and that's why you're just outputting to that many people? Or, are you just trying to see how people react to it? It's like a product management and engineering question as well.

Talwar: Canary is super powerful, but most people don't do it right, primarily because of one not being clear why. Why am I releasing it in a given way? First knowing that deployment is different from release. Second, knowing that, why do I want to release in that given way that feature? The third one being, having all the observability signals, and being able to actually compare. Once you have that, canary is a great way.

Getting Buy-In

Rettori: First good step that is easy to get buy-in. I like the perspective of getting buy-in, or to start from bad incident reports towards something that is actually valuable.

Jones: I would start with a non-emotional incident, like something that was not that big of a deal. I honestly flew under the radar doing it for a little while, without getting buy-in, just so I had a couple examples first. I would pick like a pretty easy case, that's not emotional, that didn't hit the news, that didn't lose a lot of revenue, and start there. Even just little internal marketing things, like naming the calendar invite something different. If you use the word postmortem internally, I wouldn't use that. I would just say, let's do a quick learning review, or let's have a quick chat about this, and ease people into this new mechanism or new way of doing things. I think after you have a few examples and show the ROI, like look how much collaboration we got. Look how many people were learning things. Look at the feedback we got. Look how it affected future design reviews. Showing some of those things will help actually make that momentum swell.

Value Add to Jira, and Incident Reporting

Rettori: We have a team of people who hate logging even more Jiras, how do I turn incident reports from my direct messaging to something more fleshed out? I think expanding, what do we do with incident reports, because more Jiras, please, no? Something like that.

Jones: It's sad. You log into Jira, and you're like, I have to fill out this ticket again. You're just honestly checking a box, because you're in charge of doing this now. You got to sign the postmortem, and you're like, here's what I know I have to do, I have to fill out all these fields in this Jira. It's like, you're just trying to be done. You're not actually taking the time to learn anything. You're just trying to be done so that it gets filed and you've done your job. I think flipping the script in like, this is where you really do need to get management involved in like, let's maybe not do a Jira for this particular one. Let's try something completely different here. Starting again with a low risk incident can help you gain momentum back there, and change the script less from ticket filing to more, here's what we learned. Here's how this happened. Because I can't look at a Jira ticket and understand exactly how the incident unfolded, which you're missing so much of the value. I would just maybe attach reports to them. Focus on different metrics. Maybe change the fields of your Jira inputs too.

Lorenc: You can wait, too. You don't need to file a ticket for every single problem and every single incident. It sounds like the problem isn't that filing limits, it's that they're not getting fixed and they're getting ignored. You can always wait till you've got tenants in that report, find some themes, and then open some tickets. Like, this one appeared in seven of our incidents, look at how much more of an impact fixing this one would have compared to these nine others. You can get more evidence and use that to help make them feel more meaningful.

Talwar: A few things that were done at Google, which were right about retros, it's more a culture thing than anything else, is just being very public about them. They were very widely distributed, the retros. The learnings from all of those retro feedbacks were super widely distributed, and really talked about. There's a bunch of debate around how you make a blameless retro versus a blame-full thing. There's a bunch of cultural aspects there. I know people skew one way or the other. Google believed in this blameless one, but I have mixed feelings about that. I think people having accountability is good. This goes back to culture, like you do retros. Have it well written. Be very widely distributed. You build that into culture and good things happen.

Jones: I think what you said touches on a point that is widely misunderstood throughout the industry. Blameless doesn't mean no accountability, it means it's ok to name names. I see people take the word blameless and think, this means I can't name anybody's name, and I have to be really nice to everyone. It's like, no, that's not what that means. It means that it's psychologically safe to say, "Varun, could you tell us a little bit about what was going through your head when you did this?" We can put in the report, Varun deployed a change without Varun being afraid that he's going to get fired. I think Jira tickets actually really miss the mark, too, because you can't see who's reading them. With Google Docs, you can see who's viewing them. You can comment on them. You can interact with them. Jira tickets don't allow for that collaboration, and so you're missing a lot of those key pieces of the benefits from incident review. They're written to be filed, rather than written to be read. It's a cultural thing. It needs to get disseminated. It needs to get read. It needs to get tracked that it's being read.

Talwar: G Docs, we used to have a lot of comments back and forth on those incidents.

Jones: Yes, and that's good.

Talwar: That's good. Because, frankly, that commenting thread that goes on, on the side of that incident report on the Google Doc is actually way more valuable than whatever the technical fix is.

Jones: Exactly. Those things don't get captured and celebrated. You keep harping on culture, like, is it part of your culture to actually capture and celebrate and measure those things? Like, look how many people spoke or participated during this incident review?

Lorenc: Sharing widely I think is a big part of that. If you just write the incident report with whoever was involved, file 50 Jira tickets, and call it a day, you miss some of the biggest benefits, which is the product, the report itself you produced. If you send that out to the whole company, it's scary, because you have the whole blameless aspect, and you're telling the whole company I made this mistake. If 100 people read that, that's more valuable to the overall company, the lessons that people take away than just the Jira tickets themselves.

Jones: If it just gets written and then filed, and no one actually changes anything, but they all read it, you get much more of a benefit.

See more presentations with transcripts

Recorded at:

Feb 04, 2022

InfoQ Software Architects' Newsletter