Technical Debt and Team Morale when Maintaining a Large System
Thomas Bradford talked about his experience with maintaining a monolith Java based system with zero test coverage and large technical debt at the Agile Testing Days 2015. InfoQ interviewed him about the problems that they had maintaining the system and the technical debt that had been build up, why they decided to take a different approach and how they improved team morale.
InfoQ: Can you elaborate about the problems that you had maintaining the large java system? What was the biggest one?
Bradford: I came to the party late. I was hired as VP of Engineering at the beginning of last year, tasked with helping the developers work past the quality problems that had been plaguing them for nearly ten years. Specifically, they simply couldn’t modify the system or fix bugs without introducing even more.
InfoQ: Can you describe the "technical debt" that you had?
Bradford: The current incarnation of the system had been built in a very short period of time, and as a monolith. As a result, there was little to no test coverage. Additionally, the code was a mess, riddled with duplication and incredibly long methods. It was a maintenance nightmare.
Complicating this, little to no effort was being expended on getting things under control. The bugs were handed to working students to address, rather than the teams who produced those bugs, and the teams were being pushed to "go fast," supplying the students with bugs for the rest of their lives.
InfoQ: What made you decide to start doing things differently?
Bradford: During my talks with the existing technology leads, they told me that they had essentially rewritten their product two times, each time with similar results. So my recommendation to do things differently was not only strategic, but also philosophical.
Strategically, I knew we were never going to pay down the debt by maintaining this particular Java monolith, and we would simply drown in it, forever, eventually finding ourselves at a point where we’d be incapable of releasing anything, either because we bricked the system or because all of the developers quit in disgust.
Philosophically, I wanted the developers to detach themselves from the mastery they had built up over the previous years and try to approach solving problems in completely different ways, as novices.
InfoQ: Which steps did you take to increase test coverage of the system?
Bradford: Simply put, we didn’t.
There was an initial effort to get the system under control via test automation. Before my arrival, the company had contracted out the creation of a Selenium Test Suite to replace the manual test matrix that had been used up to that point. We also attempted to build up unit test coverage, but mainly to get particularly nasty parts of the system under control. Neither of these would be the ultimate solution though.
The real solution was a radical architectural refactoring -- Taking the monolith and decoupling it into standalone services -- and to do it against a completely different stack.
InfoQ: How did you deal with team morale?
Bradford: Team morale was already in the gutter when I arrived. I’m not sure it could have gotten any worse. The developers were pretty much beaten down, afraid to change their software, which is one of the worst feelings you can experience as a software engineer.
Things are getting better though. Having been given trust and autonomy, the devs are overcoming their past fears and are able to explore their creative sides, while at the same time producing software of a much higher quality level than they had in the past.
InfoQ: Any suggestions that you want to give when you’re dealing with large technical debt and developers who are mostly unhappy given the situation that they are in?
Bradford: Technical debt is something that can be controlled, and must be controlled early. The idea that "we’ll clean it up later" might make your product people happy, but the reality of the situation is that "later means never." If you find yourself in a situation like the one we faced, where the internal architecture of our monolith was simply not amenable to effectively paying down technical debt, then make the case that the company will be paying for it in either case, and that they will have to decide whether or not a short-term reduction in feature output justifies averting the long-term failure of the organization. Most companies want to stay in business.
As professional software engineers, we are responsible for the quality of our work. External actors may apply pressure for us to "go faster" or to compromise on quality, but we’re not the only ones who have to live with the repercussions of those decisions. Eventually these decisions will come back to bite an organization in its ass. It’s our job to make the case for quality, to raise the warning flags, and to insist on doing it "right", even if the greater organization cannot visualize the long-term value in doing so.