BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Managing Tech Debt in a Microservice Architecture

Managing Tech Debt in a Microservice Architecture

Bookmarks
39:27

Summary

Glenn Engstrand describes how Optum Digital engineering devised a method for reliably and predictably paying down tech debt for hundreds of microservices.

Bio

Glenn Engstrand is a software architect in the part of Optum Digital that was Rally Health. Glenn's focus is working with engineers in order to deliver scalable, server side, 12 factor compliant application architectures. Glenn was a breakout speaker at Adobe's internal Advertising Cloud developer's conference in 2018 and 2017 and at the 2012 Lucene Revolution conference in Boston.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Engstrand: My name is Glenn Engstrand. I am a software architect for the engagement portfolio at both werally.com, and myuhc.com. Prior to joining, I was a software architect for the user facing part of the DSP at Adobe Ad Cloud.

What is Technical Debt?

How many times has this happened to you? The product manager describes the next feature that they want added to the product. Developers give a high estimate for the time it takes to implement. The product manager asks why it is going to take so long. The developers talk about having to deal with the implications of making changes to lots of hard to understand code, or having to work around bugs in old libraries or frameworks. Then the developers ask for time to address these problems that make developing new features so slow, and the product manager declines, referring to the big backlog of desired features that need to be implemented first. If left unchecked long enough, this vicious cycle can lead to the inability to compete successfully in the marketplace, or even complete catastrophic collapse of complex software systems. Welcome to the world of tech debt.

There are really two flavors of technical debt. There's this concept in software development that reflects the implied cost of additional rework caused by choosing an easy solution now, instead of using a better approach that would take longer. There's also the concept of refining the tech stack, in order for the architecture to best satisfy the ever changing requirements of the enterprise. Technical debt could be compared to monetary debt. What does that mean? With monetary debt, some of your money has to go to paying down interest instead of buying stuff. The larger the debt, the more of your money gets diverted to the interest payments. With technical debt, some of your engineering time has to go to dealing with accidental complexity, instead of implementing desired features. The more accidental complexity, the longer it takes to implement new features. If you think that paying down technical debt, while still delivering a competitive feature velocity for a single service is hard, then try doing that for dozens or even hundreds of microservices. Not only is it harder to manage technical debt in a microservice architecture, the risks of not paying down technical debt get more serious and grow faster.

Every software company eventually has to deal with tech debt. The longer that self-limiting technical decisions are in effect, the more engineering effort that it takes to develop new features or fix bugs. I've worked in many companies in my career. From that experience, what I have learned is that the usual developer response to this problem is to depend on charismatic and strongly committed champions, and then become cynical as a response to feelings of helplessness once those individuals leave the organization.

Who is Optum Digital?

This presentation covers how Optum Digital, formerly Rally Health, learned to address these problems of tech debt in a systemic and non-confrontational way. This solution addresses both kinds of tech debt. Founded in 2010, Audax Health rebranded as Rally Health, and was acquired by UnitedHealth Group in 2017. Rally Health has recently joined the Optum Digital family.

Let's briefly cover some terminology before going any further. Also known as a software product line, a portfolio is a collection of products that in combination serve a specific need. Optum Digital is in the healthcare space. Our portfolios center on how to adopt healthier lifestyles, or how to use your medical benefits more effectively. Multiple teams get assigned to each product. Typically, those teams are aligned with a type of software client and a team for backend services too. There are also teams for more platform oriented services that function across multiple portfolios. Each team most likely is responsible for multiple software repositories. The buildable artifact for a repository could be a deployable web service, a web based single page application, a native mobile application to be installed via the specific mobile vendor's app store, or some collection of support libraries, test automation, DevOps scripts, data science assets, or developer tooling.

Our CTO originally blogged about the importance of engineering investments back in 2018, which was about two years after they originally broke up their monolith into microservices. Today, over 700 engineers develop hundreds of microservices in 4 portfolios. They treat tech debt very seriously. It's not just something to commiserate about around the figurative water cooler. The risks of tech debt spiraling out of control is very real for them. I joined Rally Health in May of 2020. Part of my responsibilities lies in evangelizing, facilitating, and other forms of soft leverage to make sure that these engineering investments get the resources that they need. This presentation means more to me than just a simple academic exercise.

What Is the TCP?

This company has a lot of dedicated and smart engineers, which most probably explains how they were able to come up with what they call the technology capability plan. I find the TCP to be a truly innovative community approach to managing tech debt. I've not seen anything like it anywhere else. That's why I'm excited about it and want to share what we have learned with you. Here is the stated purpose of the TCP. It is used by and for engineering to signal intent to both engineering and product, by collecting, organizing, and communicating the ever-changing requirements in the technology landscape for the purposes of architecting for longevity and adaptivity. In the next four slides of this presentation, I will show you how to foster the engineering communities that create the TCP. You will learn how to motivate those communities to craft domain specific plans for paying down tech debt. We will cover the specific format and purpose of these plans. We will then focus on how to calculate the risk for each area of tech debt, and use that for setting plan priorities. Finally, you will learn how to take those prioritized plans and use them to negotiate with product managers in order to get engineering time to pay down tech debt in a positive, practical, and relatively stress-free way.

Engineering Communities

Engineering communities fall outside of the typical organizational hierarchy of portfolio, product, and team. These communities of practice tend to attract birds of a feather engineers who are passionate about the same technology they meet on a regular basis, which is typically monthly. They usually have a dedicated wiki topic, chat channel, and email lists for ongoing conversations and resource sharing. The initial attraction is usually the opportunity to geek it up about their favorite technology. Depending on the strategic importance of the community, the membership can range from totally open to by invitation only. If membership is curated, then focus more on culture add over culture fit. The key to the effectiveness and authenticity of the TCP is that policy is mostly set bottom-up through these engineering communities. Each community has a single digit number of officers who are responsible for delivering the part of the plan that falls within their domain. Membership in each community comes exclusively from engineers in all parts of the organization. Representation is important, not only in terms of portfolio, but also in terms of diversity.

The Rally engineering part of Optum Digital has 15 engineering communities including databases, backend services, web SPAs, native mobile applications, continuous integration pipelines, containerization, architecture, DevOps, SecOps, and quality assurance. The monthly community meeting is videotaped and stored in a shared folder. A meeting summary is sent to the engineering wide mailing list. Each community's part of the TCP is a living document that is kept up to date based on meeting activity. Once a quarter, these documents are collected and published internally as the TCP.

Plans for Paying Down Tech Debt

Each community's plan is composed of a table and about a page of explanatory narrative. The x-axis of the table represents time. Each column represents either a quarter or a year of time. The entire table is for the next three years. The y-axis of the table is technology. It lists the PADU, preferred, acceptable, discouraged, or unacceptable versions of the relevant programming language, framework, library, or platform as a service. There are limits on the size of the table, so each community has to be strategic and focus on the technologies whose out of date versions pose the most significant risk.

What is in each time reference cell of this table is one of the following statuses. The plan status indicates that you should be planning for this upgrade. Deprecate means that you can no longer adopt the out of compliance version of the technology in new pull requests. The migrate status indicates that each effective team should be actively migrating to the appropriate version. The use status means that this version of the technology should be what you are using at this time period. The remove status means that the version of the technology could be made unavailable at any time in that time period. If you are still using it, then you're out of luck. The explanatory narrative sets the context in which each technology should be used and the impact or consequences for every team that does not follow the plan.

What are the risks to not paying down the tech debt? Basically, the narrative makes the case for why following this plan is a good idea. These community driven plans help the organization manage the aspect of tech debt that deals with using out of date, non-secure, or unsupported versions of technology. What about the other aspects of tech debt? Each portfolio also gets to submit a plan for inclusion in the TCP. Portfolio-driven plans signal the intent to pay down the other forms of tech debt, such as refactoring large code bases, or splitting up monolithic services into many smaller services. In addition to the community and portfolio sections of the TCP, there's also a visionary introduction and a section calling out the product areas that currently have the most risk.

What is Technological Risk?

Engineering communities had principal portfolio engineers come up with a plan for paying down tech debt. This results in a big list of engineering investments to make. We all work from limited resources, so we can't do it all at once. How can you tell which parts of each plan to do and in what order? How do you prioritize the list? Product management doesn't know how to do that, because engineering investments don't come from them. The answer comes from an understanding of the relative risks of not following each part of the plan. This is captured as a numerical risk score. The higher the score, the greater the risk, the more important the priority. Technologies with the use status always have a score of zero. The non-zero risk score can increase for each quarter that a technology has a migrate, deprecate, or remove status. While each plan comes from the bottom up, the risk score comes from the top down. What goes into the plan comes from community participating engineers, and how that list is prioritized comes from executive engineering management.

Optum Digital uses lots of metrics in order to steer the ship, figuratively speaking. These metrics are collected into what is known as a balanced scorecard, which is a strategy performance management tool that originally came from the Harvard School of Business. For that purpose, each technology from the various plans is rolled up by product. The per product risk score is the sum of the risk scores for that product. A product gets penalized by a risk score, if even a single one of its repos still is using or dependent on the deprecate, migrate, or remove technology. That risk score is counted only once, even if several repos are out of compliance. The median of these per product risk score aggregates is what gets recorded in the balanced scorecard. It makes sense to use automated static code analysis on your repositories in determining technology dependencies. You will also need good support for CI/CD, DevOps, and GitOps in order to easily and reliably calculate this metric. We also calculate the TCP risk score in a different way in order to help focus the teams for each product. In this metric, each technology from the plans is rolled up by repo. The per repo risk score is the sum of the risk scores for that repo. The aggregate risk scores for the repos for a product are summed up as the aggregate risk score for the product itself. In this manner, we can track risk burndown for each product, or make product comparatives on the basis of TCP risk.

Getting Engineering Investments on the Roadmap

Now that we have a prioritize plan for paying down tech debt, forged by engineers and blessed by leadership, how do we get that plan funded and on the roadmap? First, let's review what typically happens without a TCP when the engineering manager sits down with the product manager to lay out the development schedule for the next six sprints. The product manager says we need these three features. Engineering manager says, let's delay one feature in order to pay down some tech debt. Product manager replies, when the executives ask why we lost a million dollar sale, I will have to tell them, because you wanted to pay down some tech debt. Engineering manager says, maybe later then. In a non-TCP environment, it is just the engineering manager versus the product manager. The product manager can always invoke sales in order to get their way.

Let's rewind and replay that same negotiation, this time with the TCP in place. The product manager fresh out of a meeting with the executives about the importance of the TCP and how we have to lower our TCP risk score in the balanced scorecard, sits down with the engineering manager to lay out the schedule for the next six sprints. The product manager says we need these three features. Engineering manager says, if it were up to me, I would give you all three of those features. Unfortunately, all of our engineers and their executives have identified this super risky tech debt that needs to be paid down this quarter. You would have already been aware of this had you been tracking the TCP for the past year. Do you see how the dynamic changed, because the TCP is an authentic enterprise-wide consensus on tech debt and its risks, the engineering manager can use that collective bargaining power when it comes time to get engineering investments on the roadmap, without any yelling or threats.

In the highly unlikely event that the product manager continues to be inflexible with regards to permitting engineering investments on the roadmap, the chain of escalation will eventually reach the executive level. Remember that risk score being a part of the balanced scorecard? For executives, that balanced scorecard is their dashboard. It is how they see what direction that the company is going in. Getting that metric on the dashboard makes tech debt very real for them, which increases the chances that decisions will go in the favor of engineering investments like paying down tech debt.

Alternatives to the TCP

The only other systemic approach to managing tech debt that I know of is documented in the Google Site Reliability Engineering book. Here's a quick take on that approach and why I believe that the TCP is better. First, you gain consensus on a list of SLOs or service level objectives. Every time your system falls out of one of those SLOs you count that as an error. You have an agreed upon number of acceptable errors per time window. This is known as your error budget. You're no longer permitted to release any more features if your system has exceeded its error budget until the next time window. In order to avoid this situation, product managers are supposed to be more willing to divert engineering resources for paying down tech debt. Here is why I find the Google SRE approach to be highly destabilizing. The cause and effect between features and sales appears to be more real to most product managers than the cause and effect between tech debt and outages. There is a presumption that tech debt reduction releases will always make the system more stable. While it is hoped that to be true in the long run, it is not guaranteed to be true in the short term.

Since error budgets promote short term thinking, this will actually end up discouraging product managers from approving such engineering investments. It is hard to predict when error budgets get exceeded and therefore hard to plan for when to schedule engineering investment time. This approach tends to pit product managers against engineering in some antagonistic relationship. The rigidity of the approach makes it riskier to adopt, therefore harder to reach executive approval. Finally, this approach tends to politicize the paying down of tech debt, where product managers attempt to game the system by convincing executives not to count certain outages against their error budgets, or to renegotiate SLOs or error budgets in order to delay the consequences of starving out engineering investments. In keeping with the monetary debt metaphor, it gives them the option to just print more money. The TCP approach is more focused on reaching authentic consensus between product and engineering. TCP driven development is more predictable with regards to the roadmap, so there are fewer stressful surprises to all involved parties.

Summary

Will a technology capability plan solve all of your engineering problems? Of course not. Will you still have tech debt? Absolutely. Will you still need to take shortcuts in order to deliver features in customer-driven timelines? I'm sure that you will. The TCP is not intended to hinder or restrict engineering and product from doing what they do best, which is making and releasing software. What the TCP does is signal to both engineers and product managers, that there are additional costs to taking necessary shortcuts, and that those costs cannot be ignored indefinitely. With the TCP, you don't wait until outages are racking up or collectors are calling, metaphorically speaking, to start paying down that tech debt. No process, policy, technology, or tool can ever serve as a sensible substitute for Quality Engineering. The TCP documents the consensus among engineers on what is the riskiest tech debt, and when is it reasonable to pay it down. In order for the TCP to be respected, its plans must be relevant, accurate, compelling, and believable. That can happen only if its contributors are seasoned and mature professionals with strong engineering skills and integrity. I think that this quote from our TCP document sums it up. "Architecting for longevity and adaptivity requires a deep understanding of both today's realities and tomorrow's possibilities. It requires an appreciation for the technology and market forces driving it. It requires a long term commitment to focused, and incremental progress."

I leave you now with a quote from that 2018 engineering blog that I mentioned earlier. There are some people who love to bake. Then there are others who cook because they don't like to follow a predefined recipe. They like to mix and match ingredients, taste the food, smell the aroma, and sometimes throw it all away and start anew. Fast growing companies do more cooking than baking. There is a level of uncertainty that comes with the business of changing quickly. At Rally, we've been fortunate to be part of a fast growing organization that builds technology, which helps people get better and lead healthier lives. We've done it by making product and engineering tradeoffs, knowingly, and sometimes without knowing.

Questions and Answers

Richardson: There's a few questions around the TCP, or a lot of the details of the TCP are private, beyond what you've talked about. Whereas people want to start putting it in place today, what do they do? Maybe you could just describe generically, if necessary, the types of engineering communities which are exposed to these sections of the TCP.

Engstrand: It's going to be very centered around the organization's specific tech stack or tech stacks. Typically, you're going to have some frontend, some web communities, some mobile communities. You're going to have a backend community. You're probably going to have a database community. You're probably going to have a SecOps community. You're probably going to have a DevOps community, that kind of stuff.

Richardson: Just categories of the technologies, the frameworks. I suppose that could even be like, somewhere on there is probably which version of Java we should be using?

Engstrand: Exactly. I'm assuming the version of Java would be handled by your backend community. If you have a frontend Java app, then that wouldn't be the same thing. For Android, you might have a different version of Java, if you're running on Android. They would handle that version, either mobile or your Android community, depending upon how you frame it. Whether you frame it as separate iOS and Android, or whether you just say mobile. It depends.

Richardson: It's, in a sense, maybe the categories. It's like, here's all the frameworks and technologies we use, let's cluster them together. Have a community around that, who makes decisions around, fundamentally, version management.

Engstrand: You can make it an organic approach. Frame it more like birds of a feather, let's get together. Who's interested? Volunteer. What do you want to talk about? It'll tend to cluster. It'll tend to be everyone who cares about MongoDB, or Cassandra, or whatever relational database you like to use, will want to just naturally cluster together. That'll be a community of practice, whatever that is. Then from that, you can decide, ok, so you're responsible for what version of Mongo we're going to run, or what have you.

Richardson: For cross-functional tech debt, common framework used across teams, who ends up owning it?

Engstrand: The thing that comes to mind in terms of cross function, would be DevOps, or SecOps, or containerization, maybe all things Kubernetes. They would get their own community. There's some crossover. Like my build tool also makes a Docker image, what do we do now? That kind of thing. Sometimes there's a little bit of a crossover, but for the most part, the lines are pretty clearly drawn. If there is a crossover situation, like this Fabric8 Maven plugin wants to use a certain version of the Kubernetes API and the DevOps folks think it should be a different version, then, ok, that could be a situation for some discussion. For the most part, it partitions pretty well.

Richardson: When do you say that this really does happen organically? Is volunteerism sufficient, or do you have to get sponsorship and time allocation?

Engstrand: It is a little bit of both, isn't it? Hopefully, you have an engineering culture such that it's ok to geek it up. It isn't always, why aren't you coding? What ticket are you working on right this very minute? If there's a little wiggle room, that allows engineers to socialize over their favorite technologies, then they're going to naturally want to create communities of practice to share notes. Have you seen this particular code wizard or whatnot? You need to organize each community, like elect officers. Then the officer once a quarter needs to make sure, what version of Java are we supporting right now? What version of Spring works with that? Whatever internal libraries we've written, is it compatible with what we're saying we want you to move towards? There's some time need spent doing that. Then that's the part that gets collected into the TCP, if you do a quarterly release of the TCP. Maybe that's not the most pleasant part. The adults in the room have to spend the day each quarter saying, this is what we're saying is the versions we support.

Richardson: There's a comment about this is perhaps an extension of, and a formalization of communities of practice.

Engstrand: No doubt about it. That is what I'm talking about is you're using the wisdom of the crowds, but in this case, the crowds are the communities of practice.

Someone had asked earlier, how do you generate this risk score? The practice for generating the risk score, that's the curating process. A community of practice may decide that a certain unit testing framework is to be discouraged. Senior management doesn't see any risk in using the old test framework, so that might get a score of zero. Meaning, that may be discouraged, but that has the lowest priority. Upgrade everything else before you upgrade this. That's how you bubble up the wisdom of the crowds, and yet you still have a curating, prioritizing thing.

Richardson: Certainly, the emphasis in your thought seems to be on the category of tech debt, which is out of date dependencies. Is that a fair assumption?

Engstrand: You're right. I do focus on that a lot. There's also a portfolio based part of the TCP that also documents for the purposes of getting it on the roadmap for product, these are the areas of our products that are crufty. A monolith that needs to be split up. We have a lot of copy and paste over here, we need to refactor this code. That's how they get it in the TCP as well. That's a little less of a thing where someone can assign a risk score to it, or the risk score is just whatever the engineers say it is. Like an old version of Cassandra that has security vulnerabilities is no longer supported, it's easy to reach consensus on that.

Richardson: Yes, it's very black and white, probably the notion of out of date, and action if there's explicit security vulnerabilities. Then you can go, definitely, we got to do something about that. You take a step back and tech debt is a broader category of the code is a mess. This code level smells, or even architectural smells. In the previous talk by Selina, Airbnb, "We outgrew our monolith. We knew we need to migrate to services." Those are very important yet harder forms of tech debt to score, in a sense, and then convince the business to embrace the elimination of.

Engstrand: That's why we bundle it together in the same document. Once we've already got them agreeing to, these engineers know what they're talking about. Yes, this makes a lot of sense. You're right, that's an old version of Java. Then we bundle right in there. The same engineers whose opinions you value, also say, this monolith needs to be split up. People treat it seriously. This is not just somebody's opinion. This is a consensus thing, even for breaking up a monolith. It's not going to be one engineer's opinion, but it might just be that product, or that team's opinion, to break up the monolith. Still, it's a consensus. It's treated a little more seriously than just, I'm frustrated with building this or something like that. It's treated a little more. By the time you're breaking up a monolith, its many groups are ready for. Product usually has a lot of release anxiety over releasing monoliths, and so by the time you're breaking up a monolith, product is probably on board with it anyway.

Richardson: Yes, maybe. Looking at the clients I've worked with, on the one end of the spectrum, it's driven by engineering, because they recognize there are problems, but perhaps the business is less aware. Whereas some clients, it's like, the business has actually been driving the refactoring. For example, one client I worked with, the monolith was buggy as heck, because it was too big, couldn't be tested effectively. It was costing sales. In the same way that features are appealing because they lead to revenue, if you can tie tech debt to loss of revenue, which, in a sense is like security vulnerabilities. That's a loss of revenue, ultimately.

Engstrand: A TCP, what you don't want is, "Everything's fine. We can't release anymore." You don't want that. What you want is, if we don't get this thing handled in the next year, we're going to be in a bad situation. Let's fix it before there's a loss of revenue. Hopefully, understanding if we don't fix it, you wait long enough, there will be a loss of revenue. Hopefully, we can have that conversation. We don't have to get to the point where you're already hemorrhaging before you start applying the Band-Aid.

Richardson: It's like, if you can look at the trajectory, and anticipate that, yes, a year from now, we will have seriously outgrown our architecture, for example. We need to be factoring into that.

Engstrand: That's part of what the TCP does. It's not perfect. Optum Digital is not perfect. We've got warts. We've got problems.

Richardson: You're not perfect?

Engstrand: No. Believe it or not. No, it's hard to believe, but it's true. Ask, what about this? What about this old crumbling monolith you have that we're all afraid of? It's on the roadmap. I've already got it approved. We should have it in Q3. You've got something to say. You've got a story for it, not just, "I don't know." We're all in denial about it.

Richardson: It's like, I'm a developer. I've created a mess that I need to clean it up. It basically goes without saying that every organization has cruft to varying degrees.

Engstrand: It's the classic thing. It's not like tech debt is bad. If you have zero tech debt, that means you're probably pretty behind in the marketplace, because you didn't take any shortcuts. You did everything the long, slow, perfect way. You're probably not the leader in the marketplace. You're going to have some tech debt, but that doesn't mean you don't clean it up. You eventually clean it up.

Richardson: Do you have a final word of wisdom, you want to share?

Engstrand: It is all about the community. That's the most important part. Making tech debt a community effort helps in terms of your corporate culture, and it makes engineers feel responsible, like they're not just out of control, constantly, just whatever it takes to push the next feature out. Hopefully, it does it in a way that doesn't antagonize product.

 

See more presentations with transcripts

 

Recorded at:

May 05, 2022

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT