InfoQ Homepage Podcasts Managing Tech Debt with Glenn Engstrand

Managing Tech Debt with Glenn Engstrand

Jun 13, 2022

Technical debt, the “interest” you have to pay when working with software, can become overwhelming if not regularly dealt with. In this episode, Glenn Engstrand discusses a structured approach to managing tech debt in a microservices architecture. By taking a proactive, long-term approach, all stakeholders are able to talk about, plan for, and safely reduce technical debt.

Key Takeaways

Not all tech debt is bad, and if you don’t have some, then you probably aren’t a leader. But, taking on too much can become overwhelming.
There are two big categories of tech debt: First, technology, or a specific version of a technology, that becomes unsupported or insecure. Second, the big ball of mud, where the cruft of a system makes small changes very difficult, and usually requires some refactoring or rearchitecting.
Getting all stakeholders, including product managers up to CEOs, to agree to invest in managing tech debt takes an ability to sell people on the idea. Usually the more quantifiable items, such as deprecated versions, are easy to agree on, which can build trust, and make people more open to discussing the more subjective challenges like rearchitecting.
Communities of practice can provide guidance for specific categories of tech debt, such as when to upgrade a framework or underlying technology, and the risks associated with not upgrading. This also makes team members more engaged because they are part of the solution.
Technical debt needs to be continually reevaluated. As time progresses, some risks get higher, as does the effort required to mitigate the risks.

Subscribe on:

Transcript

Introductions [00:21]

Thomas Betts: Hello, and welcome to another episode of the InfoQ Podcast. I'm Thomas Betts, co-host of the podcast, lead editor for architecture and design InfoQ, and an application architect at Blackbaud. Today I'm joined by Glenn Engstrand. Glenn is a software architect at Optum Digital formerly Rally Health. His focus is working with engineers in order to deliver scalable, cloud native, server-side, 12-factor compliant application architectures. Glenn has spoken at many conferences, including QCon. He specializes in breaking monolithic applications into microservices and in deep integration with realtime communications infrastructure. Glenn, welcome to the InfoQ Podcast.

Glenn Engstrand: Thanks Thomas. I'm happy to be here. Thanks for inviting me.

Thomas Betts: So you spoke at QCon Plus recently about how Optum Digital proactively is managing their technical debt. I wanted to talk to you today about that program, why it was created, the benefits you've seen and how our listeners can think about something similar at their companies. Sound good?

Glenn Engstrand: Absolutely. I think it's a very important topic and I think that this is the first place I've worked at, where they have a really sound systemic approach to it.

Thomas Betts: Well, let's start with that. So I wanted to get into talking about tech debt, but let's set the context. So what is Optum Digital? Give us a quick overview of the company. How big is it? How many engineers do you have? What do they do? What types of apps and services do you provide? That kind of stuff.

Glenn Engstrand: Optum Digital is a part of United Health Group, which is mostly about health insurance. They actually are a conglomeration of various companies that United Health Group has acquired over the years, including Rally Health. Rally Health was sort of a health and wellness startup that focused on health activity, coaching, just basically changing lifestyle so that you could avoid costly or invasive medical procedures later. And so Optum Digital, how many employees? Wow, a lot is about all I could say. I think it's in the thousands. Rally Health was around 700 at time of acquisition and we had almost a thousand microservices. Still do.

Thomas Betts: Gotcha. Obviously people are familiar with microservices sprawl becoming a thing. So part of your focus on your talk was managing tech debt for microservices. You have all of these, how do you keep them from getting out of control and how do you keep up with specific problems? That kind of describe what you're talking about?

Glenn Engstrand: I think like most companies, they had a monolith and they started splitting it up into microservices, I think back in like 2017, 2016, something like that. They really doubled down on cloud native, which meant the cost of spinning up a new service was like nothing. It was just a little YAML and, "Oh, look, now you have a new microservice." You know, we run in Kubernetes and all that, and so we have a lot of microservices. So the issue with that is, anytime you have say like an old piece of technology or an old version of a piece of technology, you don't have to change it once, you have to change it hundreds of times, possibly more. So that's why they really had to get in front of managing tech debt because of the multiplier, the force multiplier of it.

Thomas Betts: Right. Right. So there's benefits to microservices, but there's all those challenges that come within, those are the trade offs.

Glenn Engstrand: Exactly.

Two flavors of technical debt [03:23]

Thomas Betts: I'm going to step back and do a few definitions just to make sure we are all talking the same page and the listeners are all caught up. So technical debt was a metaphor originally coined by Ward Cunningham a few decades ago, if I'm correct. It was to describe the extra "interest" that we have to pay when working with software and like any metaphor, it's helpful, but it's imperfect. For example, I heard recently a clarification, we're not trying to pay down or eliminate all tech debt, but it's helpful to just keep in mind what is the cost of living with that debt. In some cases that overhead becomes too high and we have to look at things to say, "Okay, now we need to make a plan to reduce it." It's almost like refinancing, if you're going to stretch the debt metaphor. You don't refinance your loan every month, but at some point you're like, "Hey, it makes sense to take the time, pay an extra cost and it reduces our ongoing monthly payments." So how does that metaphor fit into what your experience was? Anything you want to add to that? And what was helpful for you thinking about the problem in those terms?

Glenn Engstrand: You painted the picture quite well. Even in monetary debt, it's good to have some, right? I mean, it's probably from a financial perspective better to have a house where you're building equity rather than live in an apartment until you could afford to just pay cash for a house. So a mortgage is not necessarily a bad form of debt and we all use credit cards, right? It's just, maybe it's a good idea to pay it off at the end of the month, rather than let it rack up for six months or a year until you've maxed it out and so it's the same thing with tech debt.

Of course, you're going to have some debt. If you don't, you're definitely not a leader in the marketplace. You are definitely near the last in terms of the marketplace, whatever market you're getting into, but to rack it up and then just never pay it down until it's just this outrageously high cost and you're spending all your time dealing with the complexities of the debt you've accumulated and practically no time actually implementing new features. So you're right, it's not like you need to have zero tech debt, that's not the goal, you know? Just manageable tech debt. It's just reach agreement as to how much of what you pay goes to interest, what you're comfortable with, and how much of course goes to feature development in the terms of tech debt or buying stuff in the terms of credit cards or whatnot.

Thomas Betts: You said you started with a monolith and then you got into microservices. Monoliths are one of those great examples of really big tech debt. It doesn't seem monumental until the monolith... It's not that the monolith is bad, it's usually the ball of mud and the spaghetti code, and it's too hard to manage. And the cost of making a small change means I have to go in there and get all the different teams to buy in and write more tests, or how do I know if I'm not going to break something? And that's the argument for going to microservices, I'm going to trade off the ease of having all my code in one place with having this distributed system and I don't have to have those burdens anymore.

Glenn Engstrand: That is a great example, by the way, of I guess you might call it architectural tech debt. And it doesn't stop there after you split your monolith up into microservices, because then what happens? More features come in, oh, let's just jam it on this endpoint on this service, maybe without a lot of rhyme or reason just because it's expeditious or our team owns that service. Then after a while those microservices become monoliths because they get that big ball of mud code smell all over again and then it's time to split it up. So that splitting up the monolith isn't just a one time debt reduction thing, given enough time, you'll do it over and over again.

Thomas Betts: That wasn't specifically the focus of your talk, if I recall. It was not the architectural tech debt, but can you give us some of the examples of what you were actually trying to resolve and how did you measure tech debt so you can reduce it?

Glenn Engstrand: I'd say for the purposes of that talk, there were two flavors of tech debt. There were, I'd say, tech radar style flavor, right? Where you decided to commit to a particular technology or a particular version of a technology. You know, that was a great choice at the time, but then years go by and now it's time to either switch out to another more relevant technology or just a new version that may have backwards breaking changes or a major version and that has cost to it. And so that's one form of tech debt.

Then the other form of tech debt is kind of what we were talking about, just reducing the big ball of mud. Maybe that's splitting up a large service into smaller services, maybe that's refactoring because we had years of copy and paste style changes and now it's just the code is crufty and it's time to revise it or reorganize it to be relevant to the current set of requirements or what's required of it.

Both forms of tech debt were addressed, however, usually what happens is the tech radar stuff is easier to sell to product, "Hey, this version is no longer supported. If we have a security violation, we'll get sued and there's no ramifications." You know, product people, executives, they understand that, "Sued? I don't want to get sued. Okay, what do we have to do?" Right? But then you sort of get them buying into that and then you say the same people who told you, we need to get off that bad tech and you respect their opinion is also telling you, we now need to refactor this service or split this large service up into smaller services or add some resiliency here or whatever the issue is. So it's both flavors, it's just, one is a little easier to sell but if you bundle them together in a plan, then you'll be able to enjoy the benefits of bundling, right? You'll be able to bring the other one along with it.

Selling the idea of managing tech debt [08:45]

Thomas Betts: That goes to one of my catch phrases that every problem is fundamentally communications problem, and so you could tell the story easily with one in words that the people you are trying to communicate with, they understood. But when you get into the things that sound much more technical, eyes glass over, they cover their ears like that doesn't sound important. You'll just deal with it, it's not that big a deal, but they do carry the same amount of weight. How did you get to a point where you could talk about them in the same way to that same audience?

Glenn Engstrand: You're right, it's all about communications, right? And when you're communicating with the intent of changing behavior of the person or people you're communicating to, that's called a sales job, you're selling. And a classic sales technique is get the person agreeing with you on things they can understand and we've got that agreement momentum going, and now you introduce some things that may be a little harder to understand for them, a bit of a stretch, but they're already kind of in the habit of agreeing with you. Hmm, might be just easy to go, "You know, that Thomas, he knows what he's talking about. We're just going to go with it." So that's kind of really what it is. I mean, it is still a different flavor of tech debt, but how you go about it, the remedy, it's somewhat different. But how you measure it, is different too.

It's all subjective by the way. But it's easier to measure risk for you have an old version of Log4j, right? Because they can say, "What are the consequences?" You're owned, ransomware, your site goes down. That sounds expensive, doesn't it? But what are the consequences of a huge monolith? High anxiety releases leading to people going, "Let's just not do the feature after all," leading to getting behind in the marketplace. Those are opportunity costs, it's hard to measure opportunity costs, you know? What could we have done had we been able to have a faster feature velocity? That's hard to measure. How much does it cost when our site is down per minute? That's usually easier to measure. So it is different in terms of measuring the risk, but that doesn't make it any less important.

Subjectively measuring aging, crufty software [10:46]

Thomas Betts: I remember when I was studying engineering in school, we learned about the idea of creep. That you have these sudden loads that come in, like it's easy to write the equation to say, "Is this going to break if something hits us with this much force?" The creep was, this is just under the constant load. And if you put a book on a shelf and eventually that shelf starts to sag and sometimes it's going to break, how do you calculate that because it doesn't look like it's going to be a problem for a long time?

Sometimes it will be that catastrophic failure and sometimes it's just like, "Well, you know what? We can't put anything else on the shelf because it's sagging too much." And you can see that, but you don't see it coming. It's very slow and that's why it's called creep. I feel like we talk about scope creep as features getting in, but you still have that tech debt creep almost for the second case you're describing: that the software is crufty and it doesn't call it itself out as one specific problem, it's just that sense of dread. So did you find a way to quantify the sense of dread? Was it all about risk management and saying, "Okay, well it's subjective, but this is a good guess of what we think the impact will be in the cost of that"?

Glenn Engstrand: That's exactly right. It is very subjective and you make a good point. I can't remember who originally coined this term, but it was a little bit at a time than all at once. It's a little bit than all at once. You hit some event horizon where then you have the catastrophic situation. And as it's accumulating, the cruft in this case or the creep in what you're talking about, as it's accumulating, it's kind of hard to really measure, right? Some things you can measure; it could be McCabe's Cyclomatic Complexity, right? There are automation tools to measure that. You can measure lines of code, right? Average profile lines of code. There's a lot of argument against that, but it's something to measure. You can measure performance, right? You can have load test automation and measure whatever: average, transaction time, 95th percentile transaction time. So there are things that over time, those things can creep up and you can monitor that.

You still don't know exactly when the event horizon hits, right? When will it get to a point where it just sort of collapses under its own load? That's hard to do. And so that's why you're right, the risk score is subjective. At least in the case of tech radar style risk, most people even... Product people can go, "Okay, I get why you had a high risk score, that sounds terrible." But it's a little harder to associate a risk score with cruft. We just happen to know, the engineers know that this service is very hard to maintain, it's not easy, we break more stuff than we fix every time we look at it or make a change and so what exactly is the score for that? It's a little harder to do, but it's not impossible.

One of the things we do measure is cycle time per service and cycle time is just the amount of time... we usually look at it from like a ticketing perspective, the amount of time between when the ticket is open to when it's closed. Right? So if a service tends to have longer cycle time than the other services, and it's not because there are fewer team members, right? So compensating for differences in how many people are on the team, if a service has higher cycle time, that might be an indicator that cruft is accumulating and at least it gives you a number that product can look at and go, "Yeah, okay, that's objective, kind of." Then they can honor that and then put it on the roadmap to remedy, to break it up or whatever refactoring is required.

Thomas Betts: Did that lead to any push to say we need more observability in our tooling and in our services to say, we can't tell how things are going. You mentioned load tests at one point, but sometimes it's not even running your own specific load test, it's just watching and saying, "Oh, we've noticed that responses are getting slower, and it's getting close to a point where we think it's a problem." Did you add that observability or is it something you already had in your microservices?

Glenn Engstrand: Yes, we do have a lot of that kind of stuff. We do... What's that called? Distributed tracing technology, we use a lot of that. That also gives you timestamps with it and we have a lot of orchestration, so we needed that distributed tracing for that capability. And yeah, we've integrated with APM and synthetics style monitoring technology that affords us that ability so we don't have to write a lot of stuff, we can just sort of leverage theirs. And then you're right, we set up dashboards for the production stuff. You know, we have load test automation too, we're very CICD focused and a part of that GitHub Action or Jenkins Pipeline, not only is it the unit tests and also a suite of automation tests, but also a short 10 minute load test.

The Technology Capability Plan (TCP) [15:18]

Thomas Betts: So you've mentioned a few times the idea of creating all these scores, where does all that data go? I know the answer a little bit from your presentation, it was the TCP and I only remember the acronym because it's not like TCP gets used for anything else in our industry. What was the TCP? What was the data you gave to them? And what was that group's purpose?

Glenn Engstrand: So it used to be called the Rally; of course now it's just the Technology Capability Plan. It captures everything we just talked about. It captures tech landscape type stuff, it captures cruft in specific services type stuff. Then all that gets identified by what we call... We have communities of practice, right? Just engineers who like to... Birds of a feather, they like to geek it up. We say to them, "Thanks for doing that. You're so into databases, while you're at it, why don't you also sort of decide through consensus what is the versions of databases, for example? Which databases and what versions are we going to support?" That kind of thing. And then sort of senior engineering management takes that list and says, "Okay, thank you for providing that list. We're going to be the ones who provide the risk score, right?" So in the end they get to kind of control the prioritization of it.

You know, it's subjective, some unsupported version of MongoDB or Postgres or something like that will get a high score and the score will get higher as it gets older and older. Then, maybe something that's not as important like what JSON parser you use or something like that might get a... Well, there's not a lot of risk to that, frankly, so a low score. And so those scores are collected. You know, we have a lot of automation, a lot of repo scanning type stuff automation that pulls all that up and you can see your report at any time, we have a web app. You can see your report of what your application is doing, broken down by service and then it's rolled up to the totals.

And then you're asked to, "Hey, Glenn, why is it that your risk score is so high and it hasn't gone down over the past three months? Hey, you need to be spending some time on that." So it's something that they could be kind of held accountable to. And yes, we also maintain metrics on performance and latency and McCabe's Cyclomatic Complexity stuff, all that kind of stuff, which is less. I mean, boy, let's not start saying Cyclomatic Complexity is important because you get a lot of argument about that. It's not a perfect metric by any stretch of the imagination.

Thomas Betts: Yeah, people like hard numbers because they can point at them but that doesn't mean they're useful.

Glenn Engstrand: Exactly.

Updating risk scores over time [17:49]

Thomas Betts: I want to go back to, you said something about the score will go up as something old gets older. So how often is this reevaluated? Is this a people come in and meet once a month, every quarter and discuss it? Or is it something that happens automatically?

Glenn Engstrand: Well, kind of a little bit of both. They update it every quarter. So when we lay out the TCP, we don't say, "This is good. This is good. This is good. This is bad." Right? We say, "Well, this is good this quarter and the next, but then maybe next year it's bad." Right? So you're not just innocent until we decide you're guilty, we give you a little runway. We say, "Look, you're going to be bad in six months. You're going to be bad in two years, whatever it is." And so as a technology and version changes status, right? Because it's not just good and bad, there's degrees of bad. And as it changes status, that gives us an excuse to up the score. So you know automatically your score's going to go up if you still have it and it went from a "oh, well, this isn't preferred anymore" to a "now it's unacceptable" status. Then you know you're going to take a hit for every service you're still on for that.

Thomas Betts: Yeah, and I think if you look ahead a quarter and say, okay, sometime in the next quarter, you've got to meet this thing or the score you're judged by and we can see it, you're going to come back and say, "Well, why haven't you done that?" In every place I've worked, someone always finds a way to say, "Hey, we're going to take care of it in some other way. It will get resolved or they get an exemption, but you have to justify it." And so it's the best decision you can make at the time but having to revisit that discussion on a regular basis, at some point it's worth it to just remove it as a problem, right? And people will get motivated like, okay, we're going to take a week to just update all these things or however long it takes

Glenn Engstrand: Or decommission a service, right? If a service has a lot of old tech in it and it's also crafty, maybe it's just time to break it up and the new services won't have the old tech in it. But when you do, that service will go away, the old bad tech, and therefore it won't be counted in your score anymore and so you'll have gotten a decrease, which is an improvement in the score.

What's beautiful about it is there's really no bad surprises, right? The only way there's a surprise is if you weren't paying attention for a long time, that's the only way you get a surprise. It's not like we lobbed something over the wall at you. You know, obviously the Log4j thing that happened was lobbed over the wall, but that never even hit our TCP. That was, "Okay, clean it up. Here's the list of services that have that dependency. Go." And it was all hands on deck, product wasn't like, "No, you got to deliver products. All right, no more features until that's done."

Thomas Betts: Well, and I think going back to our early discussion of what is tech debt, that wasn't tech debt. You didn't start using Log4j assuming that, oh, this is going to be a problem. That was a different issue.

Glenn Engstrand: Good point. It wasn't like that was a five year old version of Log4j it was the latest version of Log4j, so good point.

Thomas Betts: You don't want to fault the teams like, "Why didn't the tech debt catch it?" They didn't know. You know, you can't know everything.

Getting engineering and product teams aligned [20:43]

Thomas Betts: When you get to, okay, we're going to make a decision of decommission this server and build up something new, engineers like to do that kind of stuff. Product managers are less likely to do that. Does this whole process help get everybody on board? And the product manager is saying, "Okay, I understand we're not going to deliver any new features for a month while we do this transition." That's been a hard argument in a lot of places I've worked, does this help because it's written out and you see the numbers and everybody's been talking about it for a while?

Glenn Engstrand: Yeah, because when you see it written out this way, the decommissioning of a service and maybe replacing it with other services, it's presented to them as the cheapest way to solve the problems. Oh, well, we could not decommission it in which case we need to upgrade all these bad tech in there and make all those changes, and that usually takes longer because it is a monolith, right? It's a large service so that usually takes longer to do than just going ahead and breaking it up at that point. So it's kind of sold to them as, "Oh, let me get where we need to get quicker."

Thomas Betts: And then does it also help you do some of the regular ongoing maintenance, getting you out of that problem of the little bit at a time and then a lot? Do they see it's like, "Well, we're going to devote a little bit more time to these specific things"? I've never been a fan of we're going to add 10% overhead to take care of tech debt because it's just this nebulous thing and no one knows what they're doing. Does this help quantify and say, "We're going to take 10% of our time and dedicate it to these specific things because we can now measure that we're going to get these scores lower"?

Glenn Engstrand: Yeah, remember product is not disenfranchised. You know, they're right there with the engineering managers deciding what's going to happen on the roadmap. Right? So if they're like, "Okay, we're willing to let you have more, let engineering investments take up more time on the roadmap here, because we're going to need more time when it comes to crunch mode because we have a big client who needs all this work done." There's back and forth, right? It's not like they feel like they're out of control about it. And they're willing to give up a little in order to get a little.

Thomas Betts: Does that reduce, I don't know how to say this, the blame game? Engineers like to blame product and product likes to blame engineers like, "Oh, they didn't let us do this." And "Oh, why is there so much cruft in that?" You know, it kind of avoids having the secrets because engineers and architects will make the technology decisions that they need to make to get their job done and sometimes these things get buried, but if it gets called out that we're going to use Mongo. And it's fine right now, but a year from now, it's like, well, we haven't had a chance to upgrade it, now we need to talk about upgrading that and it's already been there.

Glenn Engstrand: Exactly. We have this document, we've got buy-in at all levels. So it sort of legitimizes and kind of makes it normal that yeah, you're going to make decisions today in order to get to market quicker. Good. However, some of those decisions are shortcuts, you're going to have to pay for it. And now we're back to the original monetary debt metaphor and how relevant it is in tech debt. Right? You could have just waited and bought whatever it is you want to buy like a model of something that's expensive and you put it on your credit card because you don't actually have cash for it. You could just wait until you do, but then you didn't get the benefit of the model. But then when you put it on the credit card, you don't go, "Well, I don't have to worry about that for 10 years." No, you go, "Well, that's no problem because I'll be able to pay it off in a month and therefore that's a small interest payment, I'm willing to accept that.

Thomas Betts: It's classic opportunity cost, right?

Glenn Engstrand: Exactly.

Thomas Betts: Like you can either do this now, but try and acknowledge all this stuff. I've had a lot of discussions about architecture decision records and one of the important things to do is put the date on it. It's like this is the decision made on this date. All of our decisions are made at a point in time and they're based on the best information we have available at the time. And you have to accept like here's what we considered, here are the trade offs, but this is the decision we made because this is what was going to get us out to market right now.

Glenn Engstrand: Exactly. We do ADRs too. Definitely specify the why. Definitely set up the context because that's what's going to happen, "Why did they go with that?" You know? Five years from now, they're going to ask that, right? And if you've got that, we keep our ADRs in the same repo as the code that the ADR is about. You can easily just, "Well, here it is, right here, the local directory. That's why they did it." "Okay, all right." You know, back in that timeframe, that probably was the best decision they could have made at that time and that's why it's not a blame game. I do not want that to happen.

From a corporate culture perspective, it's not a, "Well, if you developers made the right decision in the first place, we wouldn't have to rework it." No, that's not what we're talking about at all. We're talking about, "This is why we made that decision. At the time, it made sense. Over time, of course, perhaps the true costs of it come out and now it doesn't seem like such a great idea. But remember you got into the market sooner, so how many sales would you have not made if we delayed that? Right?" It's still hard to measure opportunity costs, but it's just easier to reason about it if in your culture it's open and okay to talk about it.

Benefits of the TCP process [25:39]

Thomas Betts: This is one of those things that you spend money on and it's kind of hard to measure the fact that something didn't happen. You mentioned the tech radar, that if you don't upgrade this and a hack occurs or data leaks or whatever it is, some bad thing like, oh, we don't want bad things so we can kind of measure that. It's always been more difficult in our industry to measure something that didn't happen because we took the preventative action. But what are the benefits that you have been able to see whether they're tangible or intangible? You mentioned something about the corporate culture improving, what else has been a benefit of this process?

Glenn Engstrand: The fact that it is driven by communities of practice makes engineers feel like they're in control. And it's very stressful you're being asked to do all this stuff, it's hard, perhaps long hours, and you have no say so whatsoever. That's not a good place in terms of engineer morale. So it's nice to have these communities and then executives pay attention to what the communities say and honor that. Even product has to begrudgingly honor that and they do; they do a good job of it only because the exec say, "I told you to." And so it makes the engineers feel good. You know, everyone wants to feel heard, everyone wants to feel like their opinion counts and it makes them step up to the plate. Because it's a community thing, you don't have a lot of, "Well, I just want to start using Clojure because I want that on my resume, so I'm going to rewrite the next service in Clojure." Right?

You know, you don't get that because it's not a single decision. It's a community, a group of engineers. You have to convince them first that they need to rewrite it in Clojure and that may be a little bit harder, especially that particular choice. But you get the point, people take it seriously, they want the communities to produce good decisions. And because they're truly at the decision making table, they want to honor the process, right?

So it's sort of like the old school definition of architect versus the new definition. In the old school, the architects were in ivory towers, they made all these decisions, they told the developers what to do, that's it, developers hated that. Now, the developers are making those architectural style decisions under the advice perhaps of higher abstracted architects, but still they feel like they're making the decisions and they are making the decisions and they just feel better about it. So you just have a healthier corporate culture in general.

Thomas Betts: Yeah, it gets out of the, "I'm going to just complain about this. Why do we have to do whatever or why can't we fix whatever?" We have a process that everyone gets involved in your specific area. Does that also lead to, you said job satisfaction, but also career enhancement? Somebody gets the opportunity to talk to the people who have been making the decisions and helping them and maybe decide, "Hey, that might be something I want to do more."

Glenn Engstrand: That's exactly right. In some shops, the only difference between like a senior or a principal engineer and a junior engineer is the number of tickets you work on and more pay of course. But in this it's you have kind of a higher level of responsibility, right? You might be setting technology or policy or engineering best practices in your area and working at that higher level of abstraction as well. So it definitely feels like you are raising up in whatever your career is. You are definitely raising up and it's more than just more work for more pay.

Alternatives to the TCP [28:50]

Thomas Betts: The last thing I wanted to cycle back to is what were the alternatives? The TCP sounds like a good process, it works for you. What else had you considered or were there anything else that had been tried and hadn't worked and you had to go away from?

Glenn Engstrand: The only other systemic approach to managing tech debt like this is in the Google SRE book. I first ran across that back when I was working at Adobe I think, so it was a while back, five, six years, something like that. It sounded very exciting at the time. But first of all, Google of course is a very effective technology company. They have a lot of smart people in there, but just because Google does it one way it doesn't mean it's a good fit for everybody. Right?

So that particular approach, I feel like it's very combative. It's still very "you're innocent until you're guilty, now you're guilty, now you're in trouble." So it just paves the way for finger pointing and political maneuvering and it's kind of inflexible, right?

Whereas the TCP is more like, "Let me tell you about how things are going to go in the future." Yes, you can get in trouble, but only if you ignore this for long enough period of time. So I just feel like that leads to less political posturing, it leads to people just bringing their best selves to work and not having to worry about any kind of CYA politics they have to do and stuff like that.

Thomas Betts: And what was the crux of the Google recommendation? Is that SLOs and error budgets? Is it that idea?

Glenn Engstrand: So you get these service level objectives, which could be performance related, it could be availability, how much downtime do you take and stuff like that per whatever the time period is: day, week, month, that kind of thing. And if you run out of budget, if you have more violations of your service level objective than what you are budgeted for, then you can't release any more features until the next time window: day, week, month, that kind of thing. So in theory, what that's supposed to do is let product managers go, "Well, we better let them pay down some tech debt, because otherwise we're going to violate our SLOs and no more feature development. Oh, I want feature development." And then when it happens, you're still allowed to pay down tech debt releases, you're still allowed to do that. So at a minimum that's when you get to paid down your tech debt is when you're in violation and you can no longer deliver features.

Thomas Betts: To stretch our metaphor probably too far at this point, that's the you're waiting for the debt collector to show up and bang on your door saying, "You haven't paid this." It's like, "Well, I got away with it for three months, no one made me pay my bill and now you're really mad."

Glenn Engstrand: Exactly. I'd rather educate people on how to responsibly manage debt rather than wait for the collections industry to put them on the right path.

Thomas Betts: Lot of cases like you said. We can see some of these things coming and it may not have been something we identified day one when we made that decision but we now evaluate on a regular basis and see that, oh, we have some of the stuff that we implemented a year or two years or five years ago. We need to have a plan to address that and it doesn't have to be, except for the Log4j, it doesn't have to be today.

Glenn Engstrand: A lot of the decisions you make, especially if there's a public facing commitment date, you make the decision. You know yeah, this is a good decision now, but you're right, I know eventually the goodness of the decision will expire. Heck, anytime you have a dependency on something that's versioned, of course the day will come when that version is old and no longer supported. I mean, it might be years, but still those things rack up over time.

Outtro [32:12]

Thomas Betts: Well, I think that's about all the time we have for today. If listeners want to watch Glenn's presentation from QCon Plus, that recording is now available on InfoQ and we'll have a link in the show notes. You'll also be able to find links to some of the other articles that Glenn has written for InfoQ. So Glenn, thank you again for joining me today.

Glenn Engstrand: Thank you, Thomas. As always, it's great working with you.

Thomas Betts: Thank you for listening and subscribing to the show and I hope you'll join us again for another episode of the InfoQ Podcast.

About the Author

Glenn Engstrand

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.