Key Takeaways
- Avoid reducing the size of CI regression test suites, particularly at the integration and end-to-end test levels, because test set reduction can make subtle, high-impact bugs invisible by shrinking the result sample size.
- Shifting focus from individual test failures to a stochastic (i.e., probabilistic) approach based on time-series trend analysis makes it possible to effectively manage even very large regression test sets.
- Adopting this stochastic approach will give you your best bet of catching the often subtle signals of the regressions uncovered by your tests over successive test runs.
- You can also leverage redundancies in your CI regression test suite using multicontext pattern matching to quickly spot regressions with high confidence, even in a single test run.
- Improve CI lab speed, feedback times, and capacity through architectural measures like parallelization, continuous reporting, mocking and hardware-in-the-loop, rather than by cutting the average number of regression tests executed with each build.
The False Promise of Reducing CI Regression Test Suites
Should you reduce the number of unit and regression tests you regularly run in your CI for the sake of speed and fast feedback? The benefits of large-scale test suite reduction and prioritization have been debated for years now. You can readily find build servers, as well as companies offering consultancy services, that propose a variety of strategies to significantly reduce the average number of tests run with each build. The primary purported benefits are faster turnaround for developers waiting for test results and improved CI lab capacity use.
As a test architect in a DevOps team, I find the idea alluring in principle but often disappointing in practice. When it comes to unit tests, you can effectively prioritize a smaller set to run against a commit based on, for example, dependency tracking. But for most normal sized-companies, if unit test turnaround time is your major issue, then there is probably something deeply wrong with how you write unit tests, write implementation code, or isolate dependencies.
When you move to higher-level integration and end-to-end tests, you no longer have a straightforward and reliable way to reduce the test set. If you use, for example, lexical similarity to weed out apparently redundant end-to-end tests, you will catch the obvious bugs faster; but if you shrink your test result sample size by strategically omitting many tests on most builds, you then risk rendering the weak signal from subtle bugs invisible. This risk is not good because all bugs are not equal in impact, and the subtle bugs are those most likely to escape into released software. These bugs are therefore the ones most likely to cause the greatest time loss and financial and reputational damage.
Dealing Effectively with Large CI Test Suites
You should continually optimize your CI regression test suite to remove old or redundant tests. However, a large suite of relevant tests with a high degree of both code coverage and functional coverage is an asset, not a disadvantage. I will argue that two of the major problems with regularly running large regression test sets, slow feedback and CI lab capacity overload, can be overcome. We will see with deeper examination that these problems can be solved by improving your test architecture.
I will discuss appropriate measures for improving your test architecture after outlining an approach for dealing effectively with the obvious remaining problem: finding out what is important to focus on in a sea of results from a large regression test suite. I will describe a stochastic approach that actually relies on some degree of redundancy in your CI regression test set. This approach does not guarantee you will catch every bug every time, but gives you your best bet of not missing the subtle signatures of all the bugs uncovered by your CI regression test suite runs.
Taming the "Pesticide Paradox"
Ideally our regression tests would always speak loudly, telling us exactly and clearly where the issues lie. However, that clarity is often thwarted by something called the "pesticide paradox". This term refers to the dilemma that as tests succeed at keeping out bugs, the power of your existing test suite to find new bugs diminishes. This is because most of the defects they could catch have already been found or prevented. This reality leaves behind a greater share of episodic, timing-related, less deterministic bugs that only appear under specific conditions. These conditions may include heavy network traffic, high CPU load, elevated temperatures, and other environmental factors in the CI lab. These subtle bugs are often the most dangerous, given the difficulty of finding them and their therefore higher chances of escaping into released software. These are the bugs that we will see require longitudinal, time-series analysis of test results to detect.
The subtlest (and often worst) bugs are episodic, that is to say, not occurring consistently, and timing-related, often involving race conditions. If you have a means of analyzing low-level patterns of failure, you can catch these subtle bugs, in spite of the "pesticide paradox". Such bugs are most often caught by higher level integration and end-to-end tests that are affected by a multitude of often random factors and therefore run much less deterministically than unit tests. The key to exploiting the potential that lies in that variability and succeeding with this approach is to stop looking at static test results just from last night and instead shift focus to time series of test results, to see how trends develop over time.
A Stochastic Solution: Trend Analysis
The system I have built for this approach looks at the thirty-day trend for each test each time it receives a new result from our end-to-end tests. It uses a functional programming inspired Python framework that gives a text name qualifier to identify different types of significant trends (e.g., flaky, stabilize_failing, regression, etc.). In addition to this textual label, it also generates a thirty-day red-green trend band representing the pattern of passes and failures for each test. I then created a special view on my team’s dashboard that displays both elements for every test that has a problem trend:

Figure 1: Dashboard "trouble" view.
So in this visualization, our "trouble view", we filter in and focus on the tests that have trends that point towards potential regressions. After bringing this view into our dashboard, my team learned that it is not worth investigating every failure from the previous night’s test run. We are much more effective in finding regressions if we focus on the tests with bad trends that show up in our trouble view.
The reason for this visualization is that if you investigate all your failing tests in the order in which you encounter them, you will find things that you believe need to be fixed and improved. Once you start on a task, you will want to complete it. That is the psychology of task work. But the problem is that without a systematic way of judging the relative importance of tasks related to managing your CI regression test set, which the above solution provides, you will invariably focus much of your efforts on less impactful tasks, to the detriment of the most important ones. This solution, which highlights what is important now, gives you and your DevOps team the power to effectively handle very large CI regression test sets.
This approach differs from stochastic reduction techniques that attempt to make test sets more manageable by filtering out apparent redundancies. We maintain the full test set, but stochastically reduce the focus area to the failures most likely to be either regressions or test infrastructure problems acutely needing to be fixed.
The Power of Non-Gating Tests
Higher level tests, especially end-to-end tests, are naturally flaky, because they are impacted by many outside factors. We have therefore found that it is better to make such tests non-gating, so that they do not break the build when they fail. The trend-tracking approach I have outlined is sufficiently fine-tuned for identifying regressions that we do not need the hard stop of failing the build to keep regressions out of our software releases. The problems stay sufficiently "in our face" in our trouble view visualization so that we get many chances to see and address them. A strong argument for using this tool is that it shifts the focus from making the tests pass to gathering information about the software’s correctness. This shift even affects how we write tests, favoring tests optimized to capture regressions over tests simply optimized to pass.
Learning from escaped bugs
We are a large organization, have broad and deep test coverage, and test primarily on our "main" branch that is always under active development. So we catch real issues almost every day. A regression can look like the following, found today as I write this, where low-level flakiness in end-to-end tests covering one device type in our product line is suddenly replaced by the unambiguous signature of a regression:

Figure 2: Regression signature.
As you can see in this view, regressions are unmissable. Since implementing this approach, we have only identified a single instance in which a subtle bug developed a bad trend, was flagged in the "trouble view", was looked at by multiple people, and still escaped as a customer bug. The bug was so subtle that even when it was duly flagged in the "trouble view" and multiple people looked at screen captures and log files, it was missed. But the key advantage of the outlined approach, even in the rare instances when it fails, is that we capture the full record of the evolution of "trouble" statuses so that what was known on each day is fully auditable in post-mortems. You can see that in the following report for the issue that escaped into released software:

Figure 3: Postmortem analysis.
This approach provides all the elements needed to learn from bugs that escape and thereby improve our processes and tooling to ensure such lapses do not repeat.
Leveraging Redundancy: Multicontext Pattern Matching
I have been focusing on catching troubling patterns developing in individual tests over time. If, however, you are testing the same functionality in differing contexts, from varying angles, or via different products using the same shared code, then an interesting thing happens: You can find patterns that point to recurring problems and potential regressions in one night’s test run, rather than having to wait several nights for a trend to develop.
We can use paired visualizations for this situation. We can, for example, compare tag clouds of the top-level functional domain tags from last night’s failing tests with the same tag cloud from the test run for the night before that. The relative size of each tag in the tag cloud represents the scale, indicating the number of test failures in each functional domain:

Figure 4: Multi-context pattern matching.
We can see that something significant has happened with Zoom-related functionality in the products’ nightly builds. We can see that the @zoom tag for the most recent night’s test run that appears in the upper cloud grew larger, relative to its size in the previous, lower visualization. We can then click on the @zoom tag to filter the results on our dashboard and immediately access the list of corresponding failing test cases to see whether this is a regression or just a test infrastructure issue needing immediate correction.
The key to finding such patterns is having sufficient redundancy in your test set. If every functional domain were tested to the absolute minimum, with zero overlap, then just one flaky test would skew the results and our team would waste time investigating false positives. The pattern of test failures in one domain repeating with a sufficient amplitude, over, for example, several different product lines using the same shared code, gives much higher confidence that the pattern of failures stems from a real regression or test infrastructure issue.
As with the timeline-based "trouble view" approach, the paired visualization approach also scales extremely well, which helps our team deal with the cognitive load from managing a large test set. Now all that remains is to address the other difficulties with large CI regression test sets. These, as I previously mentioned, include slow turnaround times for developers waiting for results and eating up CI lab capacity.
Architecting for Speed, Scale and Effective Identification of Regressions
You can mitigate slow turnaround times by working on test parallelization and ensuring that you report results continually, rather than at the end of your test runs. Even if you have a lot of end-to-end tests running on real embedded targets, you can still succeed in providing fast turnaround times by building a sufficiently large lab and structuring your tests jobs to run massively parallel on many different embedded devices. When you do this, ensure that each job is reporting its test results continually. We post our test results to Elasticsearch, so we can watch as significant trends develop on our dashboards nearly in real time.
Even if your team does not work with embedded software, mocking away heavy dependencies like databases, file systems, and sensor data will speed up tests and thereby help with CI lab capacity. I have seen this clearly when working with tests that fed data from many different physical sensors into a complex recognition algorithm. It would have been madness to test all the corner cases handled by the algorithm on the actual embedded system; it was far simpler, more reliable, and, above all, faster, to use a unit test framework and mock all of the different possible inputs from the sensors in order to just test how the algorithm made its decision in each case.
If you are testing embedded solutions on large machines that you cannot fit a lot of in your lab because of space and expense constraints, you still have a good option for parallelization. You can use a smart hardware in the loop (HIL) approach and do the bulk of your testing on smaller subcomponents of your system housed in racks in your test lab. Then you only need the big machines for a much shorter suite of your highest level integration tests.
Wrapping Up
Because I take an analytical and strategic approach to my work, I initially found the proposition of reducing the amount of tests run with each build enticing. With unit tests, where both tests and software behavior should be overwhelmingly deterministic, I had no problem with the idea. However, if unit test test suites are implemented correctly and they test small units with few or no dependencies, they will run so fast that gains from optimizing some of them away on each build are extremely limited. My doubts grew even more when trying to apply this approach to higher-level integration and end-to-end tests.
When you get there, you are dealing with multiple complex layers of the system under test that will hide the lion’s share of subtle bugs. As I have argued in this article, the latent patterns that can identify such bugs will often only emerge over time, and may only be visible to those with specifically targeted tooling to spot them. Shrinking your test result sample size by strategically omitting many complex tests on most builds will render the weak signal from the subtle bugs invisible.
The stochastic approach I have outlined gives you much better chances of catching the bugs that pose the highest risk of escaping out into released software. That is important because although they may be subtle, when multiplied by thousands of customers, these bugs can have a huge negative impact. Moreover, subtle bugs often generate extra frustration among customers because when dealing with them support teams very often dismiss customer reports as non-reproducible.
The approach I have presented remedies that problem by catching a significantly higher percent of subtle bugs before they escape into released software. It also makes managing large CI regression sets viable for DevOps teams by continually highlighting the most important test failures to focus on and investigate.
In the end, the choice is yours. You can focus on getting through your test-set as fast as possible, if that is your priority. Alternatively, you can invest a reasonable amount of time, given the optimizations outlined here, in an approach that gives you your best bet of preventing all the defects your tests expose from making their way into your released software.