BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles How to Use Your Existing Software Development Process Data to Find More Bugs in Less Time

How to Use Your Existing Software Development Process Data to Find More Bugs in Less Time

Bookmarks

Key Takeaways

  • Test intelligence analyses use data from software development (version history, tickets, test coverage etc) to improve the efficiency and effectiveness of test suites and processes.
  • New and modified code contains more bugs than unchanged code. Use test-gap analysis to reveal untested changes in critical functionality.
  • Executing all tests sometimes takes too long. Test impact analysis selects those test cases that run through code that changed since the last test run. Execute these tests to find new bugs quickly.
  • Automated test selection techniques optimize acceptance test suites and outperform manual selection by experts.
  • To find which areas of your code base contained the most bugs in the past, perform bug history analysis. It can reveal root causes of bugs in the development process. 

Historically-grown test suites test too much and too little at the same time

Since software systems typically grow in features from release to release, so do their test suites. This causes slower test execution times. For manual testing, this means more effort for testers, and thus directly leads to more costs. For automated testing, this means longer wait times for developers until they receive test results. We see many automated test suites that grow from minutes over hours to days or even weeks of execution time, especially when hardware is involved. This is painfully slow and indirectly leads to more costs, since it is more difficult to fix something that you broke two weeks ago than something you broke an hour ago, with so much having happened in between.

Ironically, such expensive test suites are often not even good at finding bugs. On the one hand, there are often parts of the software under test that they do not test at all. On the other hand, they often contain a lot of redundancy in the sense that other parts are tested by very many tests. Bugs in these areas then cause hundreds or thousands of tests to fail. These test suites are thus neither effective (because they do not test some areas) nor efficient (since they contain redundant tests).

Of course, this is not a new observation. Most of the teams we work with have long since abandoned running the whole test suite on every change, or even on every new release version of their software. Instead, they either execute their whole test suites only every couple of weeks (which reveals bugs late and makes them more expensive to fix than necessary) or they only execute a subset of all tests (which miss many bugs the other existing tests could find). 

This article presents better solutions that employ data from the system under test and the tests themselves to optimize testing efforts. This allows teams to find more bugs (by making sure that bug-dense areas are tested) in less time (by reducing the executions of tests that are very unlikely to detect bugs). 

Analyzing development process data helps to optimize testing

If a test suite is inefficient and ineffective, the consequences are obvious to the development and test teams: test efforts are high, but nevertheless, too many bugs slip into production undetected. 

However, since in large organizations nobody has complete information, there are typically different - often conflicting - opinions on how to fix this problem (or whose fault it is). Opinions are hard to validate or refute based on partial information, and if people focus on what bolsters their opinion, instead of on the big picture, we often see teams (or teams of teams) that struggle for a long time without improving. 

For example, we have sometimes experienced testers put blame on developers for breaking too much existing functionality when implementing new features. In response, the testers have allocated more effort to regression testing. At the same time, developers have blamed testers for finding bugs in new features too slowly. However, as testers allocate more to regression tests, bugs in new features are found even later. Unfortunately, as developers learn about bugs in new features late, their resulting late fixes come often after regression testing is complete. If such a fix causes a bug in a different location, testers have no chance to catch it with regression tests. 

Ironically, this dynamic supports both teams’ viewpoints, increasing both teams’ confidence that their point of view is correct, while at the same time making the problem worse.

These teams must stop arguing about general categories - like how much regression testing is necessary in principle - and look into their data to answer which tests are necessary for a specific change right now. Software repositories like the version control system, issue trackers, or the continuous integration system, contain a trove of data about your software that help us optimize our testing activities, based on data, not opinions.

For this, we can basically analyze all the repositories that collect data during software development to answer specific questions about our testing process.

Where were the most bugs in the past? What can we learn from them?

The version history and the issue tracker contain information about where bugs were fixed in the past. This information can be extracted and used to compute the defect density of different components. 

In one system, this revealed one component whose fix-density per line of code was one order of magnitude higher than the average fix-density in the system. This is illustrated in the upper treemap colored in blue above. Each rectangle represents a file, its area corresponding to the size of the file in LoC. The deeper the shade of blue, the more often this file was part of a bug fixing commit. 

In the center of the treemap, there is a cluster of files of which most are a much deeper shade of blue than the rest of the treemap.

The lower treemap depicts the coverage of automated tests. White means uncovered, and shades of green show increasing test coverage (darker green meaning more coverage). It is striking that the component in the center, which contains a high number of historic bugs, has almost no coverage of automated tests.

A discussion with the teams revealed a systematic flaw in the test process for this component: while the developers had written unit tests for all other components, this component lacked the test framework to easily write unit tests. Developers had written a ticket to improve the test framework. Until its implementation, they systematically skipped writing unit tests for this component. Since the impact of bugs was unknown to the team, the ticket remained dormant in the backlog. 

However, once the above analysis revealed the impact of bugs, the ticket was quickly implemented and missing unit tests were written. After that, the number of new defects in this component was not higher than in other components.

Where are untested changes (test gaps)?

Test gaps are areas of new or changed code that have not been tested. Teams typically try to test new and modified code especially carefully, since we know from intuition (and empirical research) that they contain more defects than code areas that did not change. 

Test gap analysis combines two data sources to reveal test gaps: the version control system and code coverage information. 

First, we compute all changes between two software versions (for example the last release and the scheduled next release) from the version control system, since we know from intuition (and from empirical research) that these areas are the most error-prone. 

This treemap shows a business information system of approx. 1.5 MLoC. Thirty developers had worked for six months to prepare the next release. Each white rectangle depicts a component, and each black-lined rectangle represents a code function. The area of components and functions corresponds to their size in LoC. Code in gray rectangles did not change since the last release. Red rectangles are new code, orange rectangles modified code. The treemap shows which areas changed comparatively little (e.g. the left half) and which changed a lot (e.g. the components on the right side).

Second, we collect all test coverage data. This is a fully automatable collection process, both for automated and manual testing. More specifically, we employ code coverage profiling to capture test coverage information for all testing activities that take place. While different programming languages and sometimes even different compilers can require different profilers, they are in general available for all well-known programming languages. 

This treemap shows test coverage for the same system. It combines coverage of automated testing (in this case unit tests and integration tests) and manual testing (a group of five testers who worked for a month to execute manual system-level regression tests). Gray rectangles are functions that were not executed during testing, green rectangles are functions that were executed. 

Finally, we combine this information to find those changes that were not tested by any test stage to reveal the so-called test gaps.

In this treemap, we do not care much for code that did not change. It is thus depicted in gray (independent of whether or not it was executed during testing). New and modified code is depicted in colors: if it was executed during testing, it’s in green. If not, then it’s depicted in red for new code and orange for modified code. 

In this example (which was taken on the day before the planned release date) we see that several components (comprising tens of thousands of lines of code) were not executed during testing at all. 

Test gap analysis allows teams to make a deliberate decision on whether they want to ship those test gaps (i.e. new or modified code that was not tested) into production. There can be situations where this is not a problem (e.g. if the untested feature is not used yet), but often it is better to do additional testing of critical functionality.

In the example above, the team decided not to release, since the untested functionality was critical. Instead, the release was postponed by three weeks and most of the test gaps were closed by thousands of additional (manual and automated) test case executions, allowing them to catch (and fix) critical bugs.

Which tests are most valuable right now?

If we analyze code changes and test coverage continuously, we can automatically compute which code was changed since the last test suite execution. This allows us to specifically select those tests that execute these code regions. Running these impacted tests reveals new bugs much quicker than re-running all tests (since tests that do not execute any of the changes cannot find new bugs that were introduced by these changes).

This test impact analysis speeds up feedback times for developers. In our empirical analyses, we have measured that it finds 80% of the bugs (that running the entire test suite reveals) in 1% of the time (that it takes to run the entire test suite), or 90% of the bugs in 2% of the time (more details in this chapter on change driven testing). 

This scenario applies, for example, for test execution during continuous integration.

Which tests are most valuable in general? 

Some test executions constitute an expensive resource themselves. For example, some of our customers have test suites they perform on expensive hardware-in-the-loop settings. Each test run comprises tens of thousands of individual tests and takes weeks to execute, and integrates software components from different teams. They are crucial, however, since the software cannot be released without these tests. 

A real problem for such big, expensive test runs are “mass defects”: single defects that are in such a central location that they cause hundreds or even thousands of individual test cases to fail. If a system version under test contains a mass defect, the entire test run is ruined, since further defects are hard to find among the thousands of test failures. The test teams thus better make sure that the system under test contains no mass defect before they start the big, expensive test run.

To prevent mass defects, the team uses an acceptance test suite (sometimes called a smoke test suite) that a software version has to pass before it is allowed to enter the big, expensive test run. A well assembled acceptance test suite executes a small subset of all tests that have a high likelihood of finding a defect that causes many tests to fail.

We can select an optimal acceptance tests suite (in the sense that it covers the most code in the least amount of time) from the existing set of all tests based on test-case specific code coverage information. For this, we have found that so-called “greedy” optimization algorithms work well: they start with an empty set. Then they add the test that covers the most lines of code per second of test execution. Then keep adding the test cases which, for each second of test execution, cover the most lines that have not yet been covered by the previously selected tests. They repeat this selection process until the time budget for the acceptance test suite is used up. In our research, we found that the acceptance test suites that we compute this way find 80% of the bugs (that the entire test suite can detect) in 6% of the time (that it takes to execute the entire test suite).

In one project, we compared this approach to build an acceptance test suite with an acceptance test suite that had been manually assembled by test experts. For the historic test execution data of the previous two years, the optimized assembled acceptance test suite found twice more bugs than the suite manually assembled by the experts. 

This is not as good as test impact analysis (which only requires 1% of the time to find 80% of the bugs), but can be applied when less information is available (we don’t need to know all code changes since the last measured test execution).

How to start with test intelligence analyses in your own project?

Test intelligence analyses can help to provide a data-driven answer for all kinds of questions. It can thus be tempting to play around with them to see what they can reveal about your system.

However, it is more effective to start with a specific problem that is present in the system you are testing. This makes change management more likely to succeed, since it is easier to convince co-workers and managers to solve a problem, than to play around with new tools. 

In our experience, these problems are a good starting point for thinking about test intelligence:

  1. Do too many defects slip through testing into production? Often the root cause are test gaps (i.e. new or modified code areas that have not been tested). Test gap analysis helps to find and address them before release.
  2. Does the execution of the entire test suite take too long? Test impact analysis can identify the 1% of test cases that find 80% of new bugs, and this shortens feedback cycles substantially.

Once test intelligence analyses are in place, it is easy to use them to answer other questions, too. Teams thus rarely only employ one analysis. Attacking a substantial problem, however, justifies the effort of their initial introduction.

About the Author

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Extracting data

    by David Keaveny,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I'm curious to know what tool you used to extract the heatmaps; this sort of approach looks really helpful for tracking down the areas of the codebase that are the most responsible for the largest number of defects.

  • Re: Extracting data

    by Elmar Juergens,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi David,

    thanks for the question!

    I created the treemaps (as they are called in the literature, although "heatmap" is probably a more intuitive term) using Teamscale.

    Teamscale is the tool that the company that I co-counded sells. So sorry for the plug. You can get a free trial version (that you can use for 3 months) at teamscale.com.

    The idea of treemaps is not from us,, though. I personally ran into it 25 years ago through a tool that visualizes hard disk usage, but it is probably even older. We adapted it to visualize software quality data 17 years ago in the research group where I did my PhD.

    However, in practice the challenge is getting clean data, not the visualization itself. So the lion share of the effort behind Teamscale (or in such tools in general) goes into data (getting it and cleaning it), the visualization as the treemap is the rather easy part.

    Hope this helps,
    Elmar

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT