BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News jClarity Releases Censum 3.0

jClarity Releases Censum 3.0

Censum, the Java garbage collection analysis tool by jClarity, has reached version 3.0. The main new features of the new version include the ability to analyse Safepoint logs, new graphs showcasing the behaviour of the G1 garbage collector, and a set of analytics to highlight whenever applications force to much OS activity.

The focus on the new G1 collector can be understood taking into account that G1 will become the default garbage collector in Java 9 for server configurations, while the Safepoint log analysis can help users go beyond garbage collection tuning. Safepoint is a mechanism by which the JVM will completely stop an application so as to be able to perform certain maintenance tasks. As such, a full garbage collection (also known as Stop The World or STW collection) is one of the main reasons to force a Safepoint, which is why garbage collection tuning gathers so much attention. However, as users progressively learn how to avoid pauses caused by GC, they need to start taking into consideration other causes for Safepoint.

Despite the advances, performance analysis and tuning is still a rather obscure topic for most users. For this reason, InfoQ decided to reach out to Kirk Pepperdine and Martijn Verburg, CTO and CEO of jClarity, and learn more about performances tips, myths, and how Censum helps differentiate the two.

InfoQ: Part of the problem of performance tuning is that there aren't many resources that one can turn to for help, is this why you guys decided to create Censum, in order to provide such resource?

Kirk Pepperdine: Actually, work on Censum even pre-dates jClarity. It was motivated by my frustration at the sorry state of tooling given the importance of the role of memory management in application performance. There were some discussions in Sun about building a framework to parse the logs and create some charts, but it turned out that it was only going to produce (x, y) data points. I found this to be insufficient and concluded that the only way that I was going to get what I needed was building it myself. So, initially, Censum was just a tool that was built by me for me. I never really considered doing anything else with it until I ran into Ben and Martijn.

InfoQ: Is this also the reason why you created the distribution list Friends of jClarity?

Martijn Verburg: Absolutely. Friends of jClarity was started to create a community where performance tuning experts from around the world could collaborate in a neutral space in order to help out engineers who have performance tuning issues. It's a deliberately friendly and open list where "No question is stupid" and where there is a philosophy of providing expert advice backed by empirical evidence.

Performance tuning is a dark art for most people and there is a lot of folklore and poor/incorrect information on blog posts and sites such as Stack Overflow. Friends of jClarity has a number of leading performance tuning experts and engineers who actually work on core Java/JVM technologies. The answers and advice given on the list tend to be far more accurate, correct and up to date with the latest advances than elsewhere.

InfoQ: Who can or should join Friends of jClarity?

Martijn Verburg: Anyone can join by visiting the list and clicking on the Apply to join group button. If anyone has any issues then they can email our support team at: support AT jclarity DOT com.

InfoQ: For those new at performance tuning, it could be easy to assume that making faster apps is all about the CPU cycles, however, a lot of what Censum looks at relates to memory management. What other misconceptions do you think are common among new performance tuners?

Kirk Pepperdine: Performance tuning is all about understanding what resources you have and which of those resources are under or over utilised. Once you have visibility on your core resources (CPU, memory, disk, Network, JVM threads and so forth), as well as some timing measurements around the inputs and outputs to those resources, you can start to quickly identify where the bottlenecks are and how to reduce the strength of that dependency.

That application performance is CPU bound is a common misconception. It is in fact pretty unusual for any application, be it Java or C#, to be CPU bound. We've seen more than a few customers spend large amounts of money on bigger and better CPUs only to see no change in performance.

InfoQ: There is always a lot of talk about performance, which people address in different ways: some try to rewrite code in a more performant way, some play with GC flags, what's your recommendation for people who are taking their first steps? At what point does one need to start using Censum?

Kirk Pepperdine: It’s hard to know what to tune if you haven’t first determined what the root cause is. Censum is one of the tools that we use to determine root cause. Once you’ve identified root cause, what and how to tune should become obvious. It’s possible that some problems may be solved by turning a few knobs, but in our experience most will be solved by refactoring the offending component. After that, Censum can help you make the tuning decision that will ensure that GC is not in the way.

InfoQ: When talking about performance, a lot of the conversation is centred around software, but obviously hardware plays a very important role. Does Censum give you hints so as to what hardware you may need to change (eg faster CPU, more memory, etc)?

Kirk Pepperdine: Yes, Censum can help you determine how much hardware you need to run your JVM-based app smoothly from a memory management point of view. For example, Censum provides you with an estimate of your live set size, which is the amount of data that is always resident in Java heap. This allows you to set your maximum heap size, which in turn guides you to how much real RAM you need to have on board.

Another typical issue is allocation rates (the amount of memory that is allocated over a period of time). CPUs and memory subsystems these days are quite fast. You can generally allocate giga bytes of data per second without any noticeable impact on performance. However, if you were to do that a few bytes at a time, it would be a problem. Since most Java applications do exactly this (allocating a few bytes at a time), you don’t need a very high allocation rate before allocations become the main drag on performance. This is a problem that execution profilers can miss, but even if they do catch it, it’s a problem that they tend to expose in a confusing way. Since Censum tells you what the allocation rate is, you can use that information to choose an allocation profiler, or at least better understand the information coming from an execution profiler. It just simply helps you bring context to the other signals that your system may be sending you. In turn, you’ll get better scalability which implies you’ll be able to handle more load with less hardware. As an example, one of our customers used Censum to reduce hardware needs for a particular case from 80 servers to 4.

InfoQ: Censum provides a vast array of graphs and metrics, can you tell us about your development process to define these? How do you decide what's a useful graph?

Martijn Verburg: Kirk has been tuning Garbage Collection algorithms since they first arrived in Java, so he has a lot of intuition for these. Of course, the rest of the team makes proposals too. However, one of the main sources of inspiration come from answering the questions our customers have: the idea is creating new graphs that provides them with the answer straightaway.

For example, customers recently were wondering where some mysterious Full GCs were coming from in their applications. We dug in and realised that we were missing the rare case where PermGen or Metaspace overflows cause Full GCs to be executed. So Kirk came up with some new graphs which showed when these events occurred, and also built and analytic report for the end user to easily digest, something on the lines of "Go and make your PermGen size X".

InfoQ: G1 regionalises memory management, as opposed to the traditional generational memory, how hard was it to adapt Censum to this?

Martijn Verburg: Censum reads GC data from a log file that the JVM produces, it doesn't touch your JVM. A side effect is that getting access to the new data was easy! The problem came when trying to parse these new data. Parsing it proved to be more difficult as the log formats are not defined or standardized and can be changed with very little notice by Hotspot engineers.

After a few months of work, we had the parsing sorted, and were then able to focus on building the internal data model for G1 within Censum. This proved to be fairly easy as we had a well defined structure and interface for modelling memory pools and GC behaviour. Building the graphs for G1 was also fairly easy, since we could reuse a great deal of the infrastructure we had already created for the other collectors.

Building meaningful analytics has been the hardest part. Not much research has been done on G1 yet and, although people are starting to switch it on and generate information for real production systems, we did not have the same corpus of logs as we had for other collectors. This means we just didn't have as much data so as to ascertain what is valuable and what's not.

There is still work to do in terms of analytics for G1, but at least we are now able to warn users about poorly performing applications or dangerous patterns of behaviour with this collector.

InfoQ: Censum now supports Safepoint Logs, meaning non-GC STW pauses can be analysed. Given that GC pauses tend to be the main target when analysing pauses, what should we look at when thinking about non-GC pauses?

Kirk Pepperdine: It's true that garbage collection is just one of the reasons for a JVM to safepoint your application threads. There are many other maintenance steps that will also bring your application to a screeching hault. Biased locking revocation is probably the most common one, but unfortunately we can't really know for sure because we don’t really have a lot of data; to make things worse, my guess is that very few people are aware of the possibility of getting this data, and therefore even fewer do collect it. Now that Censum will visualize this data we hope to change this and let people know what they may be able to do to slow down the rate of safepointing.

InfoQ: What's next for Censum?

Martijn Verburg: We're building a streaming version which can monitor multiple JVMs in real-time and provide the same detailed metrics and analytics that Censum provides today; but also, the use of real-time data will enable alerting on hazardous events that occurred, or even predicting events that are likely to happen, something like "Your application has a 95% chance of dying with an Out Of Memory Error (OOME) in the next 2 minutes."

Censum will also be integrated with our Illuminate product. In short, Illuminate analyses an application and indicates what type of performance issue is impacting it the most, whether that be memory/GC related or something else (e.g. Disk I/O, deadlocked threads, thread starvation, waiting for an external system to respond). If the problem is related to GC, then you'll be able deep dive into that issue with the Censum feature set and get some recommendations on how to fix the issue.

Martijn Verburg is co-founder and CEO of jClarity, a Java/JVM performance tuning company that provides industry leading analysis tools driven by machine learning. He is a leading expert on software methodology and technical team optimisation with years of experience in running large distributed organisations. You can find him delivering presentations at major conferences where he challenges the industry status quo as his alter ego The Diabolical Developer. He is the co-leader of the London Java Community and leads the global effort for working on Java standards via the Adopt a JSR and Adopt OpenJDK programmes. He was recognised as a Java Champion in 2012 for these contributions.

Kirk Pepperdine is co-founder and CTO of jClarity. He has been working in high performance and distributed computing for nearly 20 years. Initially Kirk was architecting, developing, and tuning applications running on Cray and other high performance computing platforms. He now specialises in Java, where he works in all aspects of performance and tuning in each phase of a project lifecycle. Kirk was recognised as a Java Champion in 2006 in recognition of his outstanding contributions to the Java community.

Rate this Article

Adoption
Style

BT