Key Takeaways
- Garbage collection logs contain key performance indicators.
- You will need a lot of data to perform a proper GC analysis; good tooling is essential.
- Many things besides GC can pause an application.
- Whatever pauses your application will pause the garbage collector.
- Analytic’s are essential for exposing what's causing performance regressions
I recently received the kind of garbage collection (GC) log you rarely see, containing a GC pause that weighed in at a full 23.785173 seconds. While running into a long GC pause isn't rare, this particular pause wasn't your garden variety long pause. Unlike most logs that come my way, this log contained enough detail for me to determine that something outside of the JVM was responsible for this pause, if indeed it actually was a pause.
I shared the log with Dr. Heinz Kabutz who commented "is this running on a 386?" While his comment was intended to be a joke, it did strike a real problem. With all the literature on garbage collection, very little of it covers much more than the most basic analytic techniques.
This is unfortunate because analytics can give us a deeper insight into what has gone wrong and what we should be doing to ensure it doesn't happen again. Lets dig deeper into this GC log to see how it told me that that the GC wasn't responsible for this long "GC pause".
In this case the log is unusually small as it contains 22 GC events expressed in 678 log lines. Analytics rarely produce any meaningful results on such a small log. In contrast, a more typical log that came in that same day contained 14,110 GC events buried in 959,962 lines. In addition to the problem of being able to process this volume of data, there are approximately 60 different JVM flags that will affect the format of the GC log, and each flag combination will do so in an irregular way. If that weren't enough, the record formats can and do change regularly in arbitrary ways.
Also, prior to Java 9, GC logging had not been thread-safe, meaning the logs would be corrupt when you used concurrent collectors. Add this all together and you end up with having to perform an analysis on logs that are neither human nor machine readable.
My motivation to author jClarity's Censum GC log analysis tool was fueled by my desire to experiment with analytics, and my need to be able to extract useful information from GC logs from as many different sources as practical.
The minimal set of recommended flags were used in this case:
-Xloggc:gc.log
-XX:+PrintGCDetails
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
The easiest way to perform the analysis is to load the log into Censum. The first view presented (Figure 1) is an overview of all the analytics that have been run. The first thing to note is that two of the analytics, marked with red x's, have failed. If we look at these analytics we'll see that Censum has already diagnosed the condition. We will run through the specifics of the failed analytics by hand in a minute, but first let's take a look at Censum's summary view as well as the "Heap after GC" chart.
(Click on the image to enlarge it)
Figure 1. Analytics summary
Both these views can help bring context to the failed analytics. The summary page (Figure 2) reveals that the parallel collector is in use. There were 21 young generational collections and one full collection. The other metrics in this view are unremarkable or well within accepted tolerances. For example the application throughput of 99.3% is well above the desired 95% threshold, and the allocation rate of 1165 KBytes/sec is well below the magic 1 GByte/sec threshold. Breaking the former threshold indicates that the GC is sub-optimally configured (which should trigger what we call a "technical tuning" of the garbage collector, but that is the subject of another article), whereas breaking the latter is an indicator that your application is memory inefficient and that should prompt the identification and breakup of allocation hotspots, again, a topic for another article.
(Click on the image to enlarge it)
Figure 2. Censum Summary View
The "Heap after GC" chart in this case reveals that the heap was expanded as part of the full GC. All of the other collections look ordinary and heap occupancy looks stable so far. Moreover the total heap size is fluctuating between 700 and 800 megabytes. The views of tenured have it sized at a fairly steady 520 megabytes. This implies that young generation is between 200 and 300 megabytes in size. By today's standards these memory pool sizes are exceptionally small, which prompted Heinz's 386 joke. There is nothing in this log that jumps out at you and says "23 second pause". Let's return to the the failed analytics to see why they failed and whether that reason could account for the long pause.
The first of our two failed analytics is a simple pause time analysis. It failed because it detected the long pause. Given the overall low overhead (expressed in this analytic as “Time Paused”), this long pause appears to be an outlier that needs further investigation.
One of the bits of information that is emitted when PrintGCDetails is turned on is a summary of CPU utilization of the GC threads. These summaries contain the amount of user time, kernel time and real time accumulated by the GC threads during the collection. It is this data that is used by the "High Kernel times" analytic. Figure 3 captures this view for this GC log.
(Click on the image to enlarge it)
Figure 3. High kernel times
The high kernel times analytic applies a number of heuristics to the CPU summaries, that can identify cases where the values are out of alignment with expected norms. Such analytics make it possible for developers, at any level of proficiency in examining GC logs, to see problems they might otherwise miss. Some of the difficulties people run into when looking at charts can simply be a result of how the data was plotted. Even when properly plotted, the result can contain visual distortions that can confuse or hide different conditions. Even without these distortions we humans suffer from biases, and we tend to see patterns where patterns do no exist. Using heuristic algorithms can help overcome these subjective deductions. Getting back to our example, by looking at the user and real time measurements we can estimate the level of parallelism in the GC cycle. We can also determine when the GC threads accumulated more kernel time than would be considered normal.
Each thread maintains a set of performance counters that allow it to track different metrics including user and kernel times. The garbage collector will harvest these counters from each of the GC threads and print the summary at the end of the GC log record. It is up to the operating system to maintain these performance counters. It does so by sampling the run time and at each sample point, determining which mode the thread is running in, and then incrementing either the user or kernel counter accordingly. Normally GC threads run in user mode, meaning that most of the counts should be attributed to that mode. However, GC threads do make system calls, and when that happens the thread undergoes a modal switch to become a kernel thread. While the thread is running in this protected mode, all samples will be attributed to kernel time. Normally these system calls are a result of the collector requesting a page in memory or writing hotspot perf data. But since requesting a new page should be quick, and the amount of perf data written to disk is minuscule, GC threads should not accumulate any measurable amount of kernel time. If, however, for any reason the operations should take longer than expected, the thread counters will be credited for additional kernel time.
For example, a commonly seen case happens when there are several JVMs running on the same hardware, where each logs a tremendous amount of data. In this case the I/O channel could become backed up to such a degree that it stalls the GC thread executing the write system call. Any CPU samples collected during this stall will be attributed to kernel time. Additionally GC threads can also be tagged with kernel time when the kernel is managing pages. Again, this shouldn't cause the GC threads to accumulate kernel time, but it will if the system itself is short on memory. Any of these problems will interfere with the collector but they will also interfere with your application's overall performance. A simple inspection of kernel times can help us determine if any of this is interfering with our collectors ability to get the job done in the shortest amount of time.
In our case Censum reported only one case where the kernel time was high. Additionally it found that the collector appeared to run single threaded 18 times. Given that there were 22 collections in total, it's a strong indication that the collector is not responsible for the long pause.
Looking at the CPU Summary view (Figure 4) we can quickly spot the long pause in the upper left hand corner. The collection happens in the middle of a cluster of events, which makes it difficult to see everything clearly. Nonetheless it is quite evident that the real time is for a much longer duration than the user time or kernel times, and this is odd.
(Click on the image to enlarge it)
Figure 4. CPU Summary
When we zoom in (Figure 4) we can see that we have approximately 0.08 seconds of user time and 0.02 seconds of kernel time. The big question is; how does 0.08 + 0.02 = 0.10 seconds of user and kernel time translate into almost 23 seconds of real time; even assuming that the 0.080 seconds of user time was accumulated by a single thread. (If more threads were involved then the discordance is even greater, as the 0.08 seconds would be divided between these threads.) Normally multiple threads will accumulate more user time than real time, so this situation is completely topsy turvy. Clearly something happened in the system that resulted in the GC threads not being able to run for 22 seconds. In other words, something outside of the JVM was responsible for this pause, in this case operating system maintenance, which means the GC log pause time report was a red herring.
(Click on the image to enlarge it)
Figure 5. CPU Summary Zoomed In
Conclusion
In the words of Lawrence Kraus; "there are two good states to be in, confused or wrong, because these states represent an opportunity to learn". In this case we were wrong about GC being responsible for the long pause, and only when we took a critical look at the data in the GC were we freed from our prejudice, our ingrained tendency to blame the garbage collector for this long pause. When we blame the collector the most common reaction is to search the web for a solution only to end up using random obscure JVM flags in hopes that one of them may be the solution to an unknown problem. By performing an analysis we managed to discover that we were wrong to blame the collector directly, and being wrong naturally leads us to ask questions, like if the garbage collector wasn't responsible for the pause then what was?
The evidence presented here suggests that the garbage threads were not active for the vast majority of the pause. If the pause was due to background I/O then the GC threads, captured by the OS, should have accumulated an inordinate amount of kernel time, but they didn't. This all suggests that the GC threads were swapped out, and incredibly, not rescheduled for more than 22 seconds! If our app wasn't paused by the garbage collector then the only possibility is that the JVM was paused by the operating system, even if that doesn't seem to make any sense.
Fact is, operating systems sometimes do need to perform maintenance, and when this happens, just as is the case with GC, the OS may need to pause everything else. Just like GC pauses coming from a well tuned collector, OS pauses are designed to occur infrequently and be very brief to the point of hardly being noticed. But some pauses can last for quite a long time. For example, it is not unheard of for a garbage collector to have to wait several seconds for a page allocation request to be honored. In these cases what is commonly happening is the system is moving data around in memory in order to get enough contiguous free RAM to satisfy that request.
Though it's not so likely in this case, there are times we sometimes see the clock being adjusted by the Network Time Protocol (NTP) in the middle of the collection. We generally only recognize this condition when the clock is rolled backwards and the log will report that the collection ended before it started. As mentioned, this is unlikely in this case because clock values are generally adjusted milliseconds at a time and not in tens of seconds.
In all cases the most important realization is that GC analysis can lead us past the “blame the collector” game, opening us up to making some amazing discoveries that can help improve the performance and stability of our applications.
About the Author
Kirk Pepperdine is the author of the original Java Performance Tuning Workshop, in which he lectures on garbage collection analytics and many other Java performance topics. He also has cofounded jClarity, a company that develops machine learning assisted performance diagnostic tooling, designed to assist in the diagnosis of complex performance regressions.