Understanding Large Codebases with Software Evolution
Adam Tornhill, author of Your Code as a Crime Scene, will speak about how software evolution can be used to understand large codebases at the GOTO Amsterdam 2016 Conference.
InfoQ interviewed him about software evolution and mining social information from code and how to use this to increase the understanding of large codebases, how to create a geographical profile of code, and the benefits that can be gained from techniques like mining social information and geographical profiling.
InfoQ: Can you explain what you mean by "software evolution"?
Tornhill: Software evolution is about understanding how our codebases grow. It turns out that codebases evolve according to certain patterns. Some of them are good, while some patterns indicate serious maintainability problems. My approach is to analyze the history of our codebase. Once we embrace the past, we get a lot of useful information. It’s information that lets us uncover productivity bottlenecks, parts that are hard to maintain, and even parts at risk for defects in our codebases.
InfoQ: Do you have examples of how to mine social information from code?
Tornhill: Sure, there’s so much information we can extract from the evolution of the code. For example, we can build knowledge maps over our codebases. A knowledge map shows which programmer has worked the most in the different parts of the code. It’s information that I use to simplify communication, ensure I get the right people into a design discussion, or just reason about the knowledge distribution in our system.
From there, we can do all kinds of interesting analyses on the data. My favorite analysis is to identify code that suffers from excess parallel development. That is, code that’s being modified by multiple programmers all the time. Such code is at risk for defects, is likely to be a coordination bottleneck, and hints at a design problem since code that changes frequently does so for a reason.
InfoQ: How does social mining increase the understanding of large codebases?
Tornhill: I analyze a lot of codebases as part of my day job at Empear. And I often find that organizational problems are mistaken as technical issues. The main reason for that is because social information is invisible in the code itself. That leads us to focus on solving the wrong problem. Let’s look at some specific examples here.
Different organizations tend to experience several common problems. For example, I’ve seen a lot of cases with tricky merges of different feature branches, struggles to predict release quality, and complaints about code that is hard to understand. However, it often turns out that the real problem is social; there’s a misalignment between how the system is designed and how you actually work with the code. Focusing on a technical solution like, for example, better merge or diff tools will only help you relieve the symptoms. Instead, the first step towards real improvement is to measure and understand the true cause behind the problem.
InfoQ: At QCon London 2015, you gave the talk "Treat Your Code As a Crime Scene" in which you explained how to create a geographical profile of code. Can you briefly describe how this works?
Tornhill: Most discussions around code quality tend to center on code complexity. However, complexity is only a problem when you need to deal with it. And if you look at data from how our codebases grow, you’ll see that our development efforts tend to be focused on relative few modules; most of our code is rarely, if ever, touched. That means we’d like to prioritize improvements to the parts of the code where we work the most. This is actually a hard problem.
When I got into forensics, I realized that crime investigators face similar open-ended, large-scale problems to ours. The techniques I presented in "Treat Your Code as a Crime Scene" are based on identifying patterns in the geographical distribution of crime scenes. From there, I apply the same basic principle to code. I use the history of the codebase to identify the parts of the code where we work the most and combine that with a basic measure of code complexity. I call the overlap between these two dimensions a "hotspot". A hotspot represents complicated code that we also have to work with often. A hotspot analysis is a great tool to prioritize improvements/refactorings and still be pretty sure that we get a real effect back.
InfoQ: Which benefits can be gained from techniques like mining social information and geographical profiling?
Tornhill: There are several big wins here. The most obvious is that information like this helps us focus our improvements where they are needed the most. Another benefit is that we can support our decisions by data. And that’s a key point where I think the software industry lags behind many other disciplines.
Finally, the data we’re able to gain from evolutionary approaches is information that we just cannot get from the code itself. For example, using the history of our code, we’re able to uncover expensive change patterns, evaluate them against our architectural principle and see how well they align with the way we’re organized. There’s just no concept of time in the static structure of our code. When we embrace the past, we add that missing dimension and are able to reason more efficiently about our ways of working. It’s an exciting field with a lot of promise.
GOTO Amsterdam 2016 will be held June 14-15. It is a practitioner-driven enterprise software-development conference designed for team leads, architects, and project management. InfoQ will cover the conference with Q&As, summaries, and articles.