Continuous product health can be realized by regularly prioritizing the highest impact technical debt items and knocking those off systemically. Yvette Pasqua, CTO of Meetup, recommended to continuously iterate how you’re tackling technical debt to drive more and more impactful results. Going for maximum impact items first and communicating the impact of paying down technical debt is what she suggested.
Pasqua spoke about tackling technical debt at scale at Craft 2017. InfoQ is covering the conference with Q&As, summaries and articles.
InfoQ interviewed Pasqua after her talk about quantifying technical debt, improving the health of products, what they learned while addressing technical debt, the tools Meetup uses for measuring and reducing technical debt, and how to establish a culture that supports product health.
InfoQ: Can you quantify technical debt?
Yvette Pasqua: In my research and experience, I actually recommend against spending much time trying to quantify technical debt. If you’re serious about doing a thorough job, it will take a lot of time, and often your efforts will not end up being accurate in the end because there are many different types of technical debt and most of them start off as undefined and qualitative items.
In my talk I referenced a great article Can Technical Debt Be Quantified written by Matt Holford, CTO of DoSomething.org, where he writes about quantifying technical debt and that "technical debt is not always quantifiable." In fast moving tech companies, I don’t find much value in trying to quantify technical debt. Instead, I think effort should go towards regularly prioritizing the highest impact technical debt items and knocking those off systemically as part of your organizations approach to tackling technical debt over the long run. So, my recommendation is prioritization impact over quantifying and jump right in and move as fast as you can with the highest priorities.
InfoQ: At Craft you talked about continuous product health. Can you elaborate what you mean with this?
Pasqua: We needed to get the whole company aligned and excited about tackling technical and product debt because it was such a big initiative. We knew to be successful, we couldn’t keep it just an engineering initiative and we needed to align everyone at the company about what technical debt is, why it’s important to pay down, and what our plan was for making big impact. So, we branded it something that was easy, clear, and positive for the whole company to talk about and remember: Continuous Product Health. Our short definition of it is: "Continuous attention to whole codebase and not just product features." Product Health is also something that is positive and clear that everyone wants. It was important to talk about it in this way, especially for people outside engineering, so they could envision all the time we were spending on Continuous Product Health as something that was helping the whole company reach our product strategy and company goals faster.
InfoQ: How did you improve the health of your products?
Pasqua: Some of the highest impact technical debt items that we worked on tackling have been:
- We moved from operating our own RabbitMQ servers in our bare metal data centers to using AWS SQS. We didn’t want to spend our engineering time on operating, maintaining, and scaling messaging solutions ourselves because we wanted to focus our time on iterating Meetup product towards our mission. We took a look at various managed solutions and decided we’d see the big engineering productivity, improved system simplicity, and improved reliability gains we wanted in moving to AWS SQS. So far, we’ve been very happy with the decision.
- We wanted to improve the unit test coverage of our very large monolithic codebase that historically had very poor coverage. We focused on writing tests for all new code or any code an engineer updated as part of their work. This would help us get to Continuous Deployment of our monolith faster because we wanted to be able to rely on the bottom of the testing pyramid as much as possible. We also focused not only total test coverage but on getting as close to 100% coverage for anything we were modifying as "active code". We set test coverage goals and used Coveralls to measure and get coverage metrics in front of engineers regularly and we saw it dramatically increase coverage.
- One of the biggest areas of technical debt that we paid off over the past year was moving from our bare metal data centers to the cloud. There’s a lot to talk about here but one of the things that guided our approach was to not do a "life and shift" and instead really try to use managed services wherever we could to get the most benefit from day 1 on the cloud. We also took an approach to migrate as many parts of the system over as stateless "cattle" and build infrastructure as code so that we were happily expecting parts of the system to be able to go down without state mattering and to be able to spin up or auto scale new instances. We reduced the number of "kittens" in our system to only a few that required stateful manual intervention.
I wrote about our migration in my blog post Moving Meetup to the cloud.
InfoQ: What did you learn while addressing technical debt?
Pasqua: Some of the key things we learned during our big initiative to tackle technical debt:
- Continuously iterate on how you’re tackling technical debt to make a bigger and bigger impact. Just like how you’re constantly iterating product towards product market fit, you need to continuously iterate how you’re tackling technical debt to drive more and more impactful results. At first you might not entirely succeed, but keep working as fast as you can, measure impact, fail and learn fast, and then apply and keep iterating.
- You’ll likely need to work extra hard on communicating, especially outside of engineering, the impact of paying down technical debt. You need to paint a clear picture of what that impact is to the product strategy and company goals and constantly tie it back to those things. Otherwise, people just will not understand why so much time should be spent on paying down debt.
- Go for maximum impact items first and don’t fall into the common trap of working on a lot of low effort and low impact items. Just like iterating on your product, unless you make a conscious effort to prioritize high impact items on your roadmap, teams and individuals will often gravitate to a lot of low impact and low effort items that are often easier to plan, spec, execute, and complete. As an engineering leadership team, we learned we needed to be active in strategically prioritizing the product health items that are worked on so we could guide teams towards fewer, high impact items rather than many low impact ones.
InfoQ: Which tools do you use for measuring and reducing technical debt?
Pasqua: We’re measuring reduction in technical debt by looking at our three guiding design principals we established for the systems and engineering culture we know we need to reach our company goals. They are designing for: system simplicity, platform reliability, and speed. We measure each of them via a variety of metrics such as:
- How many PRs was our team launching to production per day? We set a 2017 goal of increasing that by over 30%. That is the main way we’re measuring engineering speed.
- Speed: we measure things within our engineering tooling and pipeline that we’ve identified as bottlenecks that hold our engineers back from shipping product faster and are constantly setting higher and higher goals for ourselves. For example we measure and have projects to improve: build time, automation test duration, deployment time, and time to onboard engineers onto new frameworks.
- Reliability: we measure uptime and currently have a five 9s goal. We also measure the number of pages overall and of particular parts of the system using PagerDuty and have a goal to reduce those by about 30%.
- Simplicity: this is the hardest for us to measure. We think that one big positive affect of simplicity is fewer unintended consequences of software we’re shipping. So, one metric that’s applicable here is we measure the time engineers spend per week resolving major or critical regressions.
InfoQ: What did you do to establish a culture that supports product health?
Pasqua: There were some big things we learned and did over the past year and half that have really contributed to an engineering and company culture change towards continuously addressing product health that honestly, we’re still in the middle of changing. The biggest one was kicking it off by aligning everyone on what product health was, why it was important, and how we’d start tackling it. That included making sure we clearly and in non-technical terms communicated the ’what’ and "why" behind the projects so that everyone was aligned with the goal of the projects, what product/company goal it helped us achieve, and what engineering culture item (simplicity, reliability, speed) it had impact on. That last part is important because one thing we started doing was looking at how each potential item we worked on measured up as far as positively impacting simplicity, reliability, and speed and we prioritized items that had the highest impact on those three. It was really important to align the actual work engineers are doing with the culture change we wanted so that everyone was working towards the same goals.