Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles In the Search of Code Quality

In the Search of Code Quality

Leia em Português


Key Takeaways

  • The software development process is very hard to understand due to its complexity. 
  • The complexity leads to many ungrounded beliefs and intuitions of obscure origin. 
  • Recent research into the development process challenges many commonly held beliefs. 
  • Some results of the research are unintuitive, revealing unexpected forces in the process. 
  • In software development non-technical factors often trump the technical ones.

Recently I have encountered research on the correlation between a programming language used in the project and code quality. I was intrigued because the results were contrary to what I would expect. On the one hand the study could be flawed on the other hand many established practices and beliefs in software development are of obscure origin. We adapt them because "everybody" is doing them or they are considered best practices, or they are preached by evangelists (the very name should be a warning sign). Do they really work or are urban legends? What if we look at the hard data? I checked a couple of other papers, in all cases the results held surprises.

Taking into account how important software systems are in our economy it is surprising how scarce are scientific researches on the development process. One of the reasons could be that the software development process is very expensive, usually owned by companies that are not eager to let in researchers, which makes experiments on real projects impractical. Recently public code repositories like GitHub or GitLab change this situation providing easily accessible data. More and more researchers try to dig into the data.

One of the first studies based on data from public repositories - titled A large ecosystem study to understand the effect of programming languages on code quality - was published in 2016. It tried to validate belief - almost ubiquitously taken for granted - that some programming languages produce a higher quality code than others. The researchers were looking for a correlation between a programming language and the number and type of defects. Analysis of bug related commits in 729 GitHub projects developed in 17 languages indeed showed an expected correlation. Notable, languages like TypeScript, Clojure, Haskell, Ruby, and Scala were less error-prone than C, C++, Objective-C, JavaScript, PHP, and Python.

In general functional and statically typed languages were less error-prone than dynamically typed, scripting, or procedural languages. Interestingly defect types correlated stronger with language than the number of defects. In general, the results were not surprising, confirming what the majority of the community believed to be true. The study got popularity and was extensively cited. There is one caveat, the results were statistical and interpreting statistical results one must be careful. Statistical significance does not always entail practical significance and, as the authors rightfully warn, correlation is not causation. The results of the study do not imply (although many readers have interpreted it in such a way) that if you change C to Haskell you will have fewer bugs in the code. Anyway, the paper at least provided data-backed arguments.   

But that’s not the end of the story. As one of the cornerstones of the scientific method is replication, a team of researchers tried to replicate the study from 2016. The result, after correcting some methodological shortcomings found in the original paper, was published in 2019 in the paper On the Impact of Programming Languages on Code Quality A Reproduction Study.

The replication was far from successful, most of the claims from the original paper were not reproduced. Although some correlations were still statistically significant, they were not significant from a practical point of view. In other words, if we look at the data, it seems that it is of marginal importance which programming language we choose, at least as far as the number of bugs is concerned. Not convinced? Let’s look at another paper.

Paper from 2019, Understanding Real-World Concurrency Bugs in Go, focused on concurrency bugs in projects developed in Go, a modern programming language developed by Google. It was especially designed to make concurrent programming easier and less error-prone. Although Go advocates using message passing concurrency as less error-prone it provides mechanisms for both message passing and shared memory synchronization concurrency, hence it is a natural choice if one wants to compare both approaches. The researchers analyzed concurrency bugs found in six popular open source Go projects including Docker, Kubernetes, and gRPC. The results bewildered even the authors:

"Surprisingly, our study shows that it is as easy to make concurrency bugs with message passing as with shared memory, sometimes even more."

Although the studies we have seen so far suggest that advances in programming language have little bearing on code defects, there can be another explanation.

Let’s take a look at yet another research - the classical Munich Taxi-cab Experiment conducted in the early 1980s. Although the research is not related to IT but road safety, the researchers encountered similar unintuitive results. In the 1980s German car manufacturers began to install the first ABS (anti-lock braking system) in cars. As ABS makes the car more stable during braking, it is a natural expectation that it improves safety on the road. The researchers wanted to find out how much. They cooperated with a taxi company that planned to install ABS in part of their fleet. 3000 taxi cars were selected and in half of randomly selected cars ABS was installed. The researchers had been observing the cars for 3 years. Afterward, they compared accident rates in the group with ABS and without ABS. The result was at least surprising, there was practically no difference, even the cars with the ABS were slightly more likely to be involved in an accident.

As in the case of the research on bugs rate and concurrency bugs in Go, in theory, there should be a difference, but data shows otherwise. In the ABS experiment, the investigators had collected additional data. Firstly, the cars were equipped in kind of black boxes collecting information like speed and acceleration. Secondly, observers were assigned to the drivers to take notes of their behavior on the road. The picture from the data was clear. With ABS installed in the cars the drivers changed their behavior on the road. Noticing that now they have better control of the car and stopping distance is shorter the drivers started to drive faster and more dangerously, taking sharper turns, tailgating.

The explanation of this phenomenon is based on the concept of target risk from psychology - people behave so that overall risk - called target risk - is on a constant level. When circumstances change people adapt their behavior so that the level of risk is constant. Installing the ABS in the cars lowers the risk of driving, so the drivers, to compensate for this change, begin to drive more aggressively. Similar risk compensation was found in other areas as well. Children take more physical risk when playing sports with protective gears, medicine bottles with childproof lids make parents more careless with medicines, better ripcords on parachutes are pulled later.

Let’s come back to the studies on the code quality. What the researchers were analyzing? Commits to the code repository. When the developer commits the code? When he is sure enough that the code quality is acceptable. In other words, when the risk of committed buggy code is at a reasonable level. What happens when the developer switches to a language that is less error-prone? She will quickly notice that she can now write fewer tests, spend less time reviewing the code, and skip some quality checks at the same time maintaining the same risk of committing low quality code. Like in the case of drivers with installed ABS, she adapted her behavior to the new situation, so that the target risk is the same as before. Every developer has an inner standard of code quality and target risk of committing the code below this standard. Note that the target risk and the standard will vary among developers, but the studies suggest that on average they are the same among developers of different languages.

Natural question is what about other established techniques to improve code quality? I looked for papers on two of them: pair programming and code review. Do they work as is commonly preached? Well, yes and no, it turns out that the situation is a bit more complicated. In both cases there are several studies examining the effectiveness of the approach.

Let’s look at meta-analysis of experiments on pair programming The effectiveness of pair programming: A meta-analysis. Does it improve code quality? "The analysis shows a small significant positive overall effect of pair programming on quality". Small positive effect sounds a bit disappointing, but that’s not the end of the story.

"A more detailed examination of the evidence suggests that pair programming is faster than solo programming when programming task complexity is low and yields code solutions of higher quality when task complexity is high. The higher quality for complex tasks comes at a price of considerably greater effort, while the reduced completion time for the simpler tasks comes at a price of noticeably lower quality."

In the case of the code review the results of the researches were usually more consistent, but main benefits are not as I would expect, in the area of early defects detection. As authors of the study on code review practices at Microsoft - Expectations, Outcomes, and Challenges of Modern Code Review - conclude:

"Our study reveals that while finding defects remains the main motivation for review, reviews are less about defects than expected and instead provide additional benefits such as knowledge transfer, increased team awareness, and creation of alternative solutions to problems."

Natural question is why is there such a discrepancy between results of scientific research and common beliefs in our community? One of the reasons can be the divide between academia and practitioners, so that the results of studies find difficult way to the developers, but that’s only half of the story.   

In the mid 1980s Fred Brooks published the famous paper "No Silver Bullet – Essence and Accident in Software Engineering". In the introduction he compares the software project to a werewolf

"The familiar software project has something of this character (at least as seen by the non-technical manager), usually innocent and straightforward, but capable of becoming a monster of missed schedules, blown budgets, and flawed products. So we hear desperate cries for a silver bullet, something to make software costs drop as rapidly as computer hardware costs do."

He argues that there are no silver bullets in software development due to its very nature. It is inherently complex endeavour. In the 1980s most software ran on a single machine with a single one-core processor, the Internet was in its early infancy, smartphones were in distant future, and nobody heard about virtualization or clouds. Brooks was writing mainly about technical complexity, now we are more aware of the complexity of the social, psychological and business processes involved in the software development.

This complexity has also increased substantially since Brooks’ publication. Development teams are larger, often distributed and multicultural, the software systems are much closer entangled with business and social tissue. Despite all the progress, software development is still extremely complex,  sometimes on the verge of chaos. We must face constantly changing requirements, rising technical complexity, and confusing nonlinear feedback loops created by entangled technical, business, and social forces. The natural wiring of our brains is quite poor at figuring out what is going on in such an environment. It is not surprising the IT community is plagued with hypes, myths, and religious wars. We desperately want to make sense of all the staff, so our brains do what they are really good at - finding patterns.

Sometimes they are too good, and we see channels on the Mars surface, faces in random dots, rules in roulette wheel. Once we start to believe in something we are getting literally addicted to it, each confirmation of our belief gives us a dopamine shot. We start to protect our beliefs, as a result we close ourselves in echo chambers, we choose conferences, books, media that confirms our cherished beliefs. With time the beliefs solidify in a dogma that hardly anyone dares to challenge.

Even with the scientific method that allows us to tackle complexity and our biases in a more rational way it can be very hard to predict the result of an action in complex processes like software development. We change programming language to better and code quality does not change, we introduce pair programming or code review to improve code quality and we experience lower  quality, or we get benefits in unexpected areas. But there is also a bright side of the complexity - we can find unexpected leverage points. If we want to improve code quality, instead of looking for technical solutions, like a new programming language or better tools we can focus on improving development culture, raising the quality standards, or making committing the bugs more risky.

Looking from this perspective can shed light on some unobvious opportunities. For example, if a team introduces code reviews it makes the code produced by a developer more visible to other members of the team and hence rises the risk of committing poor quality code. Hence code review should have the effect of raising the quality of committed code, not only by finding bugs or standard violations by the reviewers (what quoted above researches were looking for), but by preventing developers from commiting bugs. In other words, to raise the quality of the code it should be enough to convince the developers that their code is being reviewed even if nobody is doing it.

The moral of the studies is also that technological factors cannot be separated from psychological and cultural ones. As in many other areas, data based researches show that the world does not function in the way we believe. To check how far our belief corresponds with reality we don’t have to wait for researchers to conduct long term studies. Some time ago we had an emotional dispute on some topic with many arguments from both sides. After about half an hour someone said - let’s check it on the Internet. We sorted out the disagreement in 30 seconds. Scientific thinking and some dose of scepticism are not reserved for scientists, sometimes quick check on the Internet is enough, sometimes we need to collect and analyze data, but in many cases it is not rocket science. But how to introduce more rationality into the software development practices is a broad topic maybe worth another article.

About the Author

Jacek Sokulski has 20+ years experience in software development. Currently works in DOT Systems as a software architect. His interest spans from distributed systems and software architecture to complex systems, AI, Psychology and Philosophy. He has a PhD in Mathematics and a postgraduate diploma in Psychology.

Rate this Article