Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Lessons Learned in Performance Testing

Lessons Learned in Performance Testing


Key Takeaways

  • Performance testing is a hard discipline to get right, and it’s worth spending extra time on it to avoid basic recurring problems.
  • Always distinguish between latency and throughput tests as they are fundamentally two different aspects of the system.
  • For latency tests, fix the throughput to eliminate variance in your measurements.
  • Do not throw away performance information in aggregations, and record as much information as possible.
  • Performance regression can be prevented up to a very high level; it just needs more attention and automation.

For projects like Hazelcast, an open-source in-memory computing platform, performance is everything. That being said, we pay a lot of attention to performance and we developed quite an expertise in this area over many years.

In this article, I would like to describe a few common problems that we see frequently and share tips on how to make your performance testing routine better.

Recurring problems in performance testing

To begin with, conducting tests in an unrealistic testing environment would be the most common recurring problem we have seen. If we’re interested in the performance of a certain system that should be deployed to multiple powerful machines, we cannot make reasonable assumptions about the resulting setup from a performance test done on a local laptop. Then, when the system is deployed in production, the results are completely different. As funny as it sounds, people tend to do that more often than one would think. The main reasons are, in my experience, laziness (easy to start locally, but setup with more machines takes more time), lack of awareness that this could be a problem, and sometimes unfortunately just lack of resources.

Another recurring problem is testing an unrealistic scenario, for example with not enough load, too few parallel threads, or even a completely different operation ratio. In general, you need to test using the most realistic scenario that’s possible. That said, we always start the testing by collecting the information about the test, use case, purpose, environment, scenario and then we try to make the scenario as similar as possible.

The next one on the list would be not distinguishing between throughput and latency tests. And even more often, not even realizing if we’re really interested in throughput or latency testing. Let me elaborate. From my experience, customers tend to say that what they’re really interested in is the throughput of the application, so they want to stress the system as much as possible and based on these results they make the decision. However, when you probe deeper, you find out that in 99 percent of the cases their load is between X and Y operations per second, never more. In this case, it makes more sense to be placing a decision on a test that stresses the system using somewhere between X and Y operations per second, and watch for the latency, thus, making it a latency test instead of a throughput one.

To end the shortlist, last point would be looking only on aggregated values like averages and means. These aggregated results hide a lot of information that is essential to making a good decision. More detailed examples of what could be reported can be found in the "Measuring performance" section.

The difference between latency and throughput testing

To remind ourselves, throughput is basically counting the number of operations done per some period of time (a typical example is operations per second). Latency, also known as response time, is the time from the start of the execution of the operation to receiving the answer.

These two basic metrics of system performance are usually connected to each other. In a non-parallel system, latency is actually an inverse of throughput and vice versa. This is very intuitive - if I do 10 operations per second, one operation is (on average) taking 1/10 second. If I do more operations in one second, the single operation has to take less time. Intuitive.

However, this intuition can easily break in a parallel system. As an example, just consider adding another request handling thread to the webserver. You’re not shortening the single operation time, hence latency stays (at best) the same, however, you double the throughput.

From the example above, it’s clear that throughput and latency are essentially two different metrics of a system. Thus, we have to test them separately. To be more specific, and this is the most common issue we see when we’re talking about performance testing, for latency testing we always need to fix the throughput for example saying, "Now let’s do latency testing on 100 K operation per second". If we don’t do this, the system would vary from run to run in throughput, thus making the latencies incomparable.

Lesson learned: if doing latency testing, fix the throughput variable to really compare apples to apples.

Reporting and comparing performance results

Another area is actually how (and what) to record and report during a performance test and how to analyze the results. I would describe it as two parts: don’t throw away the information and connect the dots.

For the first part, I very often see performance benchmarks reports showing just average operations per second, or mean, or maybe some specific percentiles of the latencies. Sure, that makes the benchmarks look simple and easy to publish. However, if you’re really interested in the performance, you don’t do that. You want to see "all" the data.

For throughput tests, a common practice is to count all the operations made and then divide it by the time. That’s effectively losing the information. If you want to see the whole picture, it means showing the course of the throughput.

Even from this very simply retrieved chart, you can easily spot problems. Based on the simple aggregated "throughput was x operations per second on average" number, you could quickly jump to the conclusion that the green line is the best among the others. However, the chart looks weird. Even though the throughput of the green line is actually the largest, you see that the throughput is much less stable (but still higher) than the others. In other words, the green line is very "shaky", not a nice straight line. In this particular example, we discovered an issue within the thread scheduling mechanism.

In summary, without this chart showing the development of throughput during the test, it would be impossible to notice it from the aggregated result. We would have just been happy that the number for the green line is the highest and went for a celebration beer missing an important performance issue.

For latency tests, another common mistake is to show just an average latency. Or, in slightly better cases, some percentiles. In an ideal case, you create a histogram effectively showing all the percentiles that are giving you the full picture. If you want to do it even better, you can watch the percentiles developing over time.

To provide an example for latency tests, the chart below shows how the 50th percentile is evolving through the progress of the test.

From the chart, we can clearly see that the latency is increasing over time during the test. If we published only average or on the other hand even a complete histogram, we still wouldn’t recognize this information. In this particular example, the system was again experiencing a memory leak.

For the record, Gil Tene’s HdrHistogram is an awesome tool to generate percentiles and in general latency charts.

The second part is to look at all the charts at the same time. If you see a problem in one chart, confirm it with the other. If there’s a performance problem, it’s usually visible in multiple places, each piece of the puzzle supporting or refusing the hypothesis.

We can reuse the memory leak example. In case of a memory leak in a Java program, we should start seeing the memory go up. Then it’s expected that the throughput will go down, since more time will be spent on garbage collection rather than doing something useful. Relating to that, latency will probably go up for the same reason (as you can see in the example). We might also look at the application threads CPU times - they will spend less time on the CPU, since garbage collection takes more and more time. The more aspects you have, the better the explanation you get and the easier you can find the solution to the problem.

Finding performance bottlenecks

Isolating the root cause of performance issues is quite often a long and painful process relying on the knowledge of the code itself, experience and sometimes even luck. However, there are some, let’s say, pieces of advice that we try to follow.

The first one would be to investigate step by step, one step at a time. When tuning performance, a very common practice is, "Hey, we know that this option is usually good, setting this garbage collection also usually helps, using this switch provided us some better numbers previously, let’s try all of it". This approach leads to overlapping effects and in the end, you’re not sure what was the solution. Therefore, we always try to make one step, one switch, one code change at a time, and then test. This way you get a full understanding of what’s going on in your system, which is key.

From a practical point of view, we make use of every possible tool that we have. This is especially true for collecting the data. The more information you have, the better you understand the system’s behavior and the quicker you find the pain points. Thus, we collect everything that we possibly can: system stats (CPU, memory, disk I/O, context switched), network stats (especially important for our distributed software), garbage collection logs, profiling data using Java Flight Recorder (JFR) and anything else that comes to mind. We have also implemented proprietary diagnostics in our products that report performance information in a more fine-grained way coming directly from the internals - operation times sampling, sizes and timing of internal thread pools, statistics about the internal pipelines and buffers, etc.

As an example, the above chart compares three different implementations and shows the number of system context switches per second. As a result, we see that the green line’s implementation was doing fewer context switches, which could indicate that the system will be doing more useful work. This was definitely confirmed by looking at the throughput chart, where the green line’s implementation had higher throughput.

Preventing performance regression

Running the performance tests automatically and regularly (e.g. daily) is an absolute necessity. You need to have the data as soon as possible after the introduction of the regression. When we spot the regression, we can isolate it to just a few commits on that day that could cause it, thus, drastically reducing the time needed for the regression fix.

One part is running the tests; the second part is storing and analyzing the results. In Hazelcast, we use PerfRepo, which is an open-source web application designed for storing and analyzing performance test results. It updates the charts with every new performance result. Therefore, it’s very easy to spot the regression - you see in the chart that the line just went down. The project was actively developed by me during my time in Red Hat; nowadays the development has slowed down a bit due to lack of time, but is still completely usable.

There are obviously some limitations. You cannot test every single aspect of your software; the combination matrix is basically infinite. We try to select "the most important" configuration, but that’s very subjective and subject to a discussion.


Performance testing is a hard discipline to get right and many things can go wrong. The key is to pay attention to the details, understand the behavior, and avoid just producing fancy numbers. This, of course, requires much more time. In Hazelcast, we’re aware of this and we’re willing to pay the price. What about you?

About the Author

Jiří Holuša is a devoted open source software engineer who loves his work. Born in Red Hat, he’s currently quality engineering team lead at Hazelcast, an in-memory computing platform open source company. Digging deep, never giving up on a problem until it’s solved, and enjoying every step of the way, is how Holuša works. Besides that, he loves basically any sport and as a true Czech, he never refuses a pleasant conversation over a pint of beer. If you want to see more of Holuša talking about performance testing, you can watch his presentation Performance Testing Done Right from TestCon Europe 2019. You can contact him via twitter @jholusa.

Rate this Article