BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Managing Technical Debt in a Microservice Architecture

Managing Technical Debt in a Microservice Architecture

This item in japanese

Lire ce contenu en français

Bookmarks

Key Takeaways

  • Technical debt is the set of decisions made during software development that reduce teams' capacity to build features bringing value. If left unchecked, it can lead to the inability to compete successfully in the marketplace or the complete collapse of complex software systems.
  • Making technical debt a community effort helps corporate culture, making engineers feel responsible.
  • Building a Technology Capability Plan systematises the management of technical debt and makes what must be fixed and when clear to all stakeholders, giving negotiating leverage to engineering and promoting long-term thinking.
  • Capturing technological risk as a numeric score in a balanced scorecard makes it visible to management.
  • Alternatives to the TCP, like SLOs with error budgets, promote short-term thinking, make it hard to plan work to fix the technical debt and tend to create antagonistic relationships.

At QCon Plus, Glenn Engstrand spoke about a methodology to facilitate technical debt management. Most persons involved with software development have faced difficulties trying to get product or project managers to agree to let them spend time fixing their project’s technical debt. The method used by Engstrand at Optum Digital (formerly Rally Health) allows managing those prioritisation problems in a systemic and non-confrontational way.

What is technical debt?

Broadly speaking, technical debt is the set of decisions taken during software development that reduce teams’ capacity to build features that bring value.

The following exchange should be familiar. The product manager describes the next feature they want to be added to the product. Developers give a high estimate for the time it takes to implement, which is seen as too long. The developers talk about having to deal with the implications of making changes to lots of hard to understand code or working around bugs in old libraries or frameworks. Then the developers ask for time to address these problems, and the product manager declines, referring to the big backlog of desired features that need to be implemented first.

If left unchecked long enough, this vicious cycle can lead to the inability to compete successfully in the marketplace or even the complete catastrophic collapse of complex software systems.

It falls into two flavours: one relates to choosing an easy or fast solution instead of the best solution; another is linked to technical stack obsolescence or inadequacy. Both flavours cause engineering time to be spent dealing with accidental complexity instead of building value or fixing bugs.

Paying down technical debt while maintaining a competitive velocity delivering features can be difficult, and it only gets worse as system architectures get larger. Managing technical debt for dozens or hundreds of microservices is much more complicated than for a single service, and the risks associated with not paying it down grow faster.

Every software company gets to a point where dealing with technical debt becomes inevitable.

At Optum Digital, a portfolio – also known as a software product line –  is a collection of products that, in combination, serve a specific need. Multiple teams get assigned to each product, typically aligned with a software client or backend service. There are also teams for more platform-oriented services that function across several portfolios. Each team most likely is responsible for various software repositories. There are more than 700 engineers developing hundreds of microservices. They take technical debt very seriously because the risks of it getting out of control are very real.

Our CTO originally blogged about the importance of engineering investments in 2018, about two years after they initially broke up their monolith into microservices.

The Technology Capability Plan

The company’s engineers came up with a Technology Capability Plan (TCP) to tackle the technical debt problem.

The TCP is a community-based method to create plans for paying down technical debt. It is used by and for engineering to signal intent to both engineering and product by collecting, organising, and communicating the ever-changing requirements in the technology landscape for architecting for longevity and adaptivity. In other words, it can be used to point out when the company will be in a bad situation if specific measures are not taken in time.

Such communities are motivated to create plans for paying down technical debt, following a specific format. A risk score is then calculated for each area of technical debt, and priorities are set based on that risk score. The prioritised plans are used to negotiate engineering time for paying back technical debt with product managers positively and practically.

Engineering Communities

Engineering communities are formed transversally to the organisation; in other words, they are not associated with a specific team or product. Often, engineers are drawn to these communities of practice because of their passion for working with the same technologies, which means that communities can be open and grow organically.

However, if some communities are considered of strategic value, those can be made invite-only. If membership is curated, the focus should be on culture add over culture fit, and it should also assure representativity and diversity.

They usually have dedicated communication resources (e.g., wiki topic, chat channel, and email list) for ongoing conversations and resource sharing.

Policies are set bottom-up through these engineering communities, which is the key to the effectiveness and authenticity of the TCP. 

The monthly community meeting is recorded and shared, and its summary is sent to all engineers. Each community's part of the TCP is a living document kept up to date based on meeting activity. Once a quarter, these documents are collected and published internally as the TCP.

Building Plans for Paying Down Technical Debt

Each community’s plan contains about one page of an explanatory narrative and a table representing the desired technological evolution over time.

Each table row is dedicated to one technology’s preferred, acceptable, discouraged, or unacceptable (PADU) versions of the relevant programming language, framework, library, or platform as a service. It is naturally particular to the organisation’s tech stack(s).

Each column represents one period of time (e.g., a quarter or a year). The entire table is meant to represent the next three years.

Each table cell contains the technology’s version lifecycle status during the column’s timeframe: plan, deprecate, migrate, use, or remove. 

The plan status indicates the need to plan for an upgrade. Deprecate means that teams can no longer adopt the technology version. The migrate status suggests that each team should actively migrate to the appropriate version. The use status means that this technology version should be what you are using. The remove status means that the technology could be made unavailable at any time in that period.

The explanatory narrative sets the context in which each technology should be used and the impact or consequences for every team that does not follow the plan. These community-driven plans help the organisation manage the aspect of tech debt that deals with using outdated, non-secure, or unsupported technology versions.

Each portfolio also gets to submit a plan for inclusion in the TCP. Portfolio-driven plans signal the intent to pay down the other forms of tech debt, such as refactoring large codebases or splitting up monolithic services into many smaller services.

In addition to the community and portfolio sections of the TCP, there's also an introduction laying out the plan’s vision and a section calling out the product areas that currently have the most risk.

What is Technological Risk? 

Engineering communities and principal portfolio engineers develop a plan for paying down technical debt, which resulted in an extensive list of engineering investments to make. Knowing that resources are limited, how do you prioritise the list? Product management doesn't know how to do that because engineering investments don't come from them. The answer comes from understanding the relative risk of not following each part of the plan.

This risk is captured as a numerical risk score. The higher the score, the greater the risk, thus the higher the priority. Technologies with the use status always have a score of zero. The non-zero risk score can increase for each period when a technology version has migrated, deprecated, or removed status.

While each plan comes from the bottom up, the risk score comes from the top down. The plan comes from community participating engineers, and how that list is prioritised comes from executive engineering management. 

Optum Digital’s metrics are collected into what is known as a balanced scorecard, which is a strategy performance management tool that originally came from the Harvard School of Business.

Each technology from the various plans is rolled up by product for that purpose. The per-product risk score is the sum of the risk scores for that product. A product gets penalised by a risk score if even a single one of its repositories still is using or dependent on the deprecate, migrate, or remove technology. That risk score is counted only once, even if several repositories are out of compliance. The median of these per-product risk score aggregates is recorded in the balanced scorecard.

It makes sense to use automated static code analysis on your repositories in determining technology dependencies. You will also need good support for CI/CD, DevOps, and GitOps to quickly and reliably calculate this metric.

We also calculate the TCP risk score differently to help focus the teams for each product. Each technology from the plans is rolled up by repository in this metric. The per-repository risk score is the sum of the risk scores for that repository.

The aggregate risk scores for the repositories for a product are summed up as the aggregate risk score for the product itself. In this manner, we can track risk burndown for each product or make product comparatives based on TCP risk. 

Getting Engineering Investments on the Roadmap

Now that we have a prioritised plan for paying down tech debt forged by engineers and blessed by leadership, how do we get that plan funded and on the roadmap?

First, let's review what typically happens without a TCP when the engineering manager sits down with the product manager to lay out the development schedule for the next sprints: in a non-TCP environment, it is just the engineering manager versus the product manager. The product manager can always invoke sales to get their way. 

Let's revisit that same negotiation, this time with the TCP in place. The product manager, fresh out of a meeting with the executives about the importance of the TCP and how we have to lower our TCP risk score in the balanced scorecard, sits down with the engineering manager to lay out the schedule for the next sprints. The product manager asks for three features. The engineering manager says: “If it were up to me, I would give you all three of those features. Unfortunately, all of our engineers and their executives have identified this super risky technical debt that needs to be paid down this quarter. You would have already been aware of this had you been tracking the TCP for the past year.”

Do you see how the dynamic changed? Because the TCP is an authentic enterprise-wide consensus on tech debt and its risks, the engineering manager can use that collective bargaining power when it comes time to get engineering investments on the roadmap without any yelling or threats. 

In the unlikely event that the product manager continues to be inflexible about permitting engineering investments on the roadmap, the chain of escalation will eventually reach the executive level. Remember that risk score being a part of the balanced scorecard? For executives, that balanced scorecard is their dashboard. It is how they see what direction the company is going in. Getting that metric on the dashboard makes tech debt very real for them, which increases the chances of choosing engineering investments like paying down tech debt.

Alternatives to the TCP 

The only other systemic approach to managing tech debt that I know of is documented in the Google Site Reliability Engineering book.

Here's a quick take on that approach and why I believe the TCP is better.

First, you gain consensus on a list of SLOs or service level objectives. Every time your system falls out of one of those SLOs, you count that as an error. You have an agreed-upon number of acceptable errors per time window, known as your error budget. You're no longer permitted to release any more features if your system has exceeded its error budget until the next time window.

To avoid this situation, product managers are supposed to be more willing to divert engineering resources for paying down tech debt.

Here is why I find the Google SRE approach to be highly destabilising. The cause and effect between features and sales appear to be more authentic to most product managers than the cause and effect between technical debt and outages. There is a presumption that tech debt reduction releases will always make the system more stable. While that is hoped to be true in the long run, it is not guaranteed to be true in the short term. 

Since error budgets promote short term thinking, this will discourage product managers from approving such engineering investments. It is hard to predict when error budgets get exceeded and, therefore, to plan for when to schedule engineering investment time.

This approach tends to pit product managers against engineering in some antagonistic relationship. The rigidity of the method makes it riskier to adopt, therefore, more challenging to reach executive approval. Finally, this approach tends to politicise the paying down of tech debt, where product managers attempt to game the system by convincing executives not to count certain outages against their error budgets or renegotiate SLOs or error budgets to delay the consequences of starving out engineering investments.

The TCP approach focuses on reaching an authentic consensus between product and engineering. TCP-driven development is more predictable regarding the roadmap, so all involved parties have fewer stressful surprises. 

Summary

Will a technology capability plan solve all of your engineering problems? Of course not.

Will you still have tech debt? Absolutely.

Will you still need to take shortcuts to deliver features in customer-driven timelines? I'm sure that you will.

The TCP is not intended to hinder or restrict engineering and product from doing what they do best: making and releasing software. The TCP signals to both engineers and product managers that there are additional costs to taking necessary shortcuts and that they cannot ignore those costs indefinitely.

With the TCP, you don't wait until outages are racking up to start paying down that tech debt. No process, policy, technology, or tool can ever serve as a sensible substitute for Quality Engineering.

The TCP documents the consensus among engineers on what is the riskiest tech debt and when is it reasonable to pay it down. For the TCP to be respected, its plans must be relevant, accurate, compelling, and believable. That can happen only if its contributors are seasoned and mature professionals with strong engineering skills and integrity.

I think that this quote from our TCP document sums it up:

Architecting for longevity and adaptivity requires a deep understanding of both today's realities and tomorrow's possibilities. It requires an appreciation for the technology and market forces driving it. It requires a long term commitment to focused and incremental progress.

About the Author

Rate this Article

Adoption
Style

BT