BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Is Your Java Application Slow? Check out These Open-Source Profilers

Is Your Java Application Slow? Check out These Open-Source Profilers

Bookmarks
37:23

Summary

Johannes Bechberger focuses on understanding the basic concepts of profiling like flame graphs, usage of async-profiler and JMC, advantages and disadvantages of the different tools.

Bio

Johannes Bechberger is a JVM developer working on profilers and their underlying technology in the SapMachine team at SAP. This includes improvements to async-profiler and its ecosystem, a website to view the different JFR event types, and improvements to the FirefoxProfiler, making it usable in the Java world.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Bechberger: You probably all know the situation when you're asked by others the question, aren't you a software developer? Then, could you write me some awesome website? That's what happened to me. I'm part of the organizing team of the Night of Sciences in Karlsruhe, where PhDs and professors can show their research to the general public at an evening in half an hour talks. It was during Corona, where everyone had to be at home and where we had the event held remotely, and so we had the problem that we wanted to have some interactivity there. Essentially, we wanted to ask participants or the viewers some questions in a quiz, like, how old is the universe? After a talk on astronomy. I was asked, and what I did, I just created a small website. Essentially, we had a client that regularly asked the server, is there any question? If yes, please give me the question. Of course, it could also answer the questions using this client. Then you had an admin interface with which you can set the question. Then you had the server that stored all the information, all the data in a simple JSON file, because I thought, ok, if I just use the file system, then the file system does the whole synchronization and consistency for me. I thought it's because a premature optimization is the root of all evil, as Donald Knuth once wrote.

The event came, and I developed a software and was quite confident that it worked, because it's a simple software, a couple of hundred lines of code. The client asks every 10 seconds the server for, is there any questions? That's enough margin there. The event came in the evening, and the server was, of course, overwhelmed as one might expect, when I'm giving this example in historic. What was the problem? The problem was that the server just couldn't handle that many connections, and that many participants that wanted to participate in the quiz. Now we come back to Donald Knuth's original quote, which was a bit longer. It was originally, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." He wrote further, "Yet we should not pass up our opportunities in the critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code, but only after that code has been identified." We need to identify what went wrong, and profilers to the rescue, they are the tools that you use to see what went wrong, performance-wise. When you talk about profilers to someone who is not from computer science, you typically get a lot like, isn't this something that people in crime shows do? Isn't that something that Sherlock Holmes does, like a profiler solving crime mysteries? Essentially, that's what we're also doing here. We're not solving crime mysteries, but we are solving the problem on why is this application so slow. Essentially, we are Sherlock Holmes and Watson digging into the depths of the performance data to see what went wrong.

Flame Graphs

One of the most common ways to visualize some performance data that you get is a flame graph. Here you see the flame graph building up in the bottom. You see that the main function takes all the time because it's the main function, and on the top, you see the pseudocode building up of this. This is the simplified lookup but it's essentially what the server is executing. The main function just calls some initialization code and then calls the serverLoop. The serverLoop takes most of the time of the main function. The serverLoop has a wide bar. The width of the bar is proportional to the amount of time that it takes to execute this function. Then the serverLoop is just a simple loop that waits for a new request, gets it, and checks whether this is a QuestionRequest, and then calls the handleQuestionRequest method, and then handles also some other requests that's not important here. You see in the flame graph that the handleQuestionRequest takes a lot of time. Then, what does the handleQuestionRequest do? It just checks, is there any question enabled? Then it emits the current question. These methods shouldn't take that much time, but why do they? The flame graph also gives us this answer, essentially, because we're all the time parsing JSON. This is not a bad thing. We've seen again, to quote Mario Fusco, when you do something stupid, it punches you in your face, and it's impossible not to see it. That's the beauty of a flame graph, they are quite easy to understand. You see when you're doing something stupid, performance-wise.

If we then rewrite the handleQuestionRequest, and isQuestionEnabled, and the currentQuestion method to use something more sensible, for example, on SQLite database, then the performance of the whole application drastically improves, and we don't have any problems anymore with like 90 to 100 concurrent usage, which is really great, and which makes the application usable again. You see, the technique of profiling should be part of your toolbox, shining light on the performance issues. It should be in the toolbox, like debugging or testing, so when you come over a problem, you can open your toolbox and see what the appropriate tool is. It's important to know for each of these techniques, a few tools that you can readily apply.

Background

Who am I to tell you about profiling? I'm Johannes Bechberger. I work at the SapMachine team at SAP. The SapMachine team works on the OpenJDK, and is part of the open source teams at SAP. I'm specifically working there in the field of open source profilers. I develop profiler UIs like the one you see, but just based on Firefox Profiler. I work on async-profiler and related tools. I'm also working to get a profiling API into the OpenJDK. It's currently at JDK and is also a candidate.

What Is Profiling?

Who of you used a profiler? I chose the results of the JetBrains 2022 Developer Survey. In the survey, 40% said no. That's quite a large number because profiling is really important to know and to have in your toolbox when you're developing code, especially if the code goes to the outer world. What is profiling? This whole talk is on profiling, so I should probably define what I mean with this. I'm going with the definition of The Jargon File, which is that it's a technique to obtain a profile, and the profile is a report on the amounts of time spent in each routine of the program used to find and tune away the hot spots in it, like we did before with our introduction example.

Types of Profilers

There are of course different profilers. There are essentially two types of profilers. There are instrumenting profilers and sampling profilers. Instrumenting profilers insert instructions into your code automatically. For example, you see here the serverLoop function, by an instrumenting profiler which inserts at the beginning a log statement that says, ok, I'm entering serverLoop, and the end statement that calls a method that states, I'm exiting serverLoop. The cool thing is, that's quite accurate in the sense that you see every function entry and exit. That's also problematic, because JITs don't really like it when you increase the amount of code potentially by a large amount. It's also problematic, because it's not that fast, because, yes, you're executing a lot of statements. There's another type of that, it's sampling profilers, and they sample. They don't do anything on every loop entry and exit, but rather, they ask the JVM in regular intervals, usually, in every 10 to 20 milliseconds, what are you currently executing? What's your call trace? See it here, we ask the JVM four times. Then we combine this data. That's of course, approximate. It has its own problems regarding accuracy, but the advantage is that you're seeing really what the JVM is doing, and what your program is doing.

Safepoint

Regarding accuracy, that depends on Safepoint. What is a Safepoint? A Safepoint is, essentially, a point where all threads are in a safe state. Here, every thread checks at regular points in the code, is there a Safepoint requested? If yes, please stop, and the thread stops. It checks whether a Safepoint is requested, for example, at the beginning of a method, at the end of a method. As we see here, also at the end of a loop running. Why is this important for profiling? For profiling, there are methods that you can use to obtain a stack trace, and there are methods that are Safepoint biased, so they only work at a Safepoint, and there are methods that don't. Here in the following I use this U-boat, which is in a non-Safepoint state, or in the state where it doesn't check, can I go fast in a Safepoint, when it's lighter of color and darker when it's at a Safepoint. Safepoint are usually used to do things like garbage collection. Safepoint bias comes because, when you now say to the thread, please stop, and it only stops when it's really ready, then you're getting a stack trace or a call trace. In a fully asynchronous mode, you're just telling the thread, stop, and it just stops. You don't have to wait to check to a Safepoint, and is that a Safepoint? You get it really at the point that you're demanding it. You don't have any bias, which increases the accuracy. Also, it has the problem that you might be in undefined state, and so it's slightly less stable than going with the Safepoint bias route. Safepoint bias is especially important with multiple threads because here we want to essentially only stop one thread, but here we are asking like, please stop when you're ready. We have to wait till the second thread also is finished, to set a Safepoint. Here, fully asynchronous, we just tell it stop to the thread 1, and the thread 2 goes along.

How Profiling Works

How does then profiling work? We've seen, ok, there's some kind of Safepoint bias, and that we're doing some sampling, but what are we doing at every interval? A profiler first gets all the available threads. Then selects a random subset, because sampling all threads isn't usually possible because when you have few thousand threads, then you cannot really walk all the threads every 10 milliseconds, because this would take too much time, and the performance overhead would be too high. Then, you pre-allocate some traces on the data structure that you're storing, like your call traces later. Then for the asynchronous version, you're pinging with a Unix signal. For example, you have every thread saying, please ping, and the thread stops. You're going into a signal, and there you walk the stack. With the synchronous version, you can directly tell the thread, what's your stack trace? You're doing this with the other threads too. Then you have a set of traces and you collect them. Then you do some post-processing, for example, if I'm trying to write flame graphs as we see.

Sampling Profiler - External and Built-In

I showed you that there are two types of sampling profilers, but also there are different specific sampling profilers. There are external sampling profilers and built-in. External is quite simple. They are agents' native or Java agents that are attached to the JVM, and then use methods to obtain a stack trace that can be called from external agents. Built-in is just directly built into your JDK, or other JVMs. Here with externals, we have to distinguish between synchronous and asynchronous. Synchronous, this was with the Safepoint bias, and asynchronous is not with the Safepoint bias, but with signal handlers, as I've shown you before. One of the most common tools that uses the synchronous method is VisualVM which is based on NetBeans, which is available since 2010. It was shipped directly with the JDK from version 6 to version 8. This is the reason why many people use it, and also currently use it because it's just the first fully available, easily available tool that was there. Then, the asynchronous method were introduced for Forte Analyzers. This was a project from Sun that started in 1991, which is essentially a profiler for their systems. For this analyzer, they implemented the AsyncGetCallTrace call in 2002, and removed it three months later, but it's essentially still there. In 2013, people started to use it for their own sampling profilers, to write their own async-profiler, which is the most commonly used. It started in 2016. There's also the built-in profiler which is called Java, or later, JDK Flight Recorder, which is also non-Safepoint bias because, internally, it uses similar methods as async-profiler. There's a frontend for it, it's called Java, and later, JDK Mission Control, which started in 2012, and was open sourced in 2018. It's now one of the most extensive tools in the open source world for profiling. Then, many other tools usually use a combination of async-profiler. Many of them are also embedded, and some run JFR later on.

Obtaining a Profile

How do you obtain a profile using these tools? I will be focusing on two tools, on JFR, and on async-profiler. First, the async-profiler. It's a Java agent. It's a native agent, so you can just parse it, and tell it where it is. It's therefore platform specific. You have to download it for your specific platform from the GitHub page, you can tell it what it should sample for CPU [inaudible 00:19:24]. You can tell it that it should export the flame graph, or that it should export a JDK Flight Recorder format, because the JDK Flight Recorder format is one of the most commonly used formats in the Java profiling world. Async-profiler can also combine the results of a parallel JFR run, so it runs JFR on the JVM, and combines then this with its own sample so you get the best of both worlds. You get the sampling from async-profiler which has some minor advantages. You get all events and all the other things from JDK Flight Recorder. You can also later attach async-profiler to your JVM, for example, by using the profile sh script that's already present in async-profiler project, or by using the async-profiler loader which is a wrapper around all the async-profile functionality, and which is also platform independent, more or less. It supports the platform that the async-profiler supports. Async-profiler has many features. For example, many events can trigger sampling, like locks, perf-events, methods, so it can do some instrumentation. It's also embeddable, especially via my ap-loader project. It's hackable because it's also open source, like all the tools I'm showing. It's a relatively small code base. If you want to know more, I would really recommend you to look into the async-profiler - manual by use cases by Krzysztof Ślusarski, which tells you a lot more.

Then there's also, as I said, besides async-profiler, the built-in OpenJDK way to profile, so JDK Flight Recorder. You also just parse to your Java application, to your Java call the option StartFlightRecording, and tell it where to store it. In all the versions till JDK 12, you also had to parse the FlightRecorder option. Then, you also want to parse the DebugNonSafepoints option because usually or commonly there isn't that much debug information between two Safepoints to get to specific locations where you can only add during sampling, and so you have to enable it. It doesn't have any performance impact, so I'll just turn it on. That's usually a good idea if you want to profile the application. There's also the option to use j cmd to start and stop the flight recorder on a remote machine, or on a machine where you only get the process ID for. I don't have to start it at the beginning. Some of its features are that it's built into the OpenJDK, so it's more stable than async-profiler, usually. It works on all platforms, so also Windows which async-profiler doesn't support, and it also works on all CPU architectures. It has lots of events. You get events on garbage collection, on class loading, and can even add new events. You can add your custom events. There are so many events, like over 120, that I created a website which you find on the sapmachine.io/jfrevents, where you can see all the events and see all their properties, some examples, the descriptions if there are any. You see in which JDK versions it's supported.

JFR also supports custom events. These are quite cool. Consider, you write a Fibonacci Server for your new math as a service server where you just have an API fib, which is you parse n, and it returns you the nth Fibonacci number. We then would probably want to have some session event to record which n is parsed in, in every session. We can do this by creating a class SessionEvent which extends to jdk.jfr.Event class, and has some properties and also the construct that I'm showing here. When you then want to use it, you just create a new SessionEvent object, and then call event.begin when you want the event to start. Then your stuff that you will be doing in a handler for the Fibonacci API. Then we can call API. Then we can call event.commit, to start the event, and so we then have an event with a specific length, or time duration. It's then stored in the JFR file. The cool thing is that we can view these events in all UIs that support JFR events. For example, in this UI, we can see that one session event for the 37th Fibonacci number took like 67 milliseconds. We can also see it in other JFR viewers.

Inspecting a Profile

You can inspect a profile. I've shown you two tools already. There are different ways to view it. For one, there's a flame graph. A real tool would output something like this for the Fibonacci Server example. There are also other tools that wrap this. For example, this is by Krzysztof Ślusarski. You can go to his GitHub account and see this tool, and also, he linked in the manual by use cases for the async-profiler. Flame graphs are quite simple to use. You don't have that much interactivity, and you don't also have that much ways to configure it, so it's quite simple, but it usually helps quite a lot. There are also larger tools, and one is called JDK Mission Control, which is the UI for JDK Flight Recorder. When you open a file in it, you see some kind of automated analysis, for example, some analysis of the memory consumption. Then you get some method profiling information, some flame graphs, some information on the memory allocations, and on the garbage collection.

I started around half a year ago writing the Java JFR profiler plugin for IntelliJ, which is still a prototype, which can be used to view JFR files, and to start JFR profiling in your IDE. It's currently only supported for IntelliJ. I'm hoping to work on a VS Code version too. It's based on the Firefox Profiler UI. Some of you might have used it in Firefox to view some profiles there. You can just install it from the JetBrains marketplace. You can then, when you have installed it, right click on any main method on these arrows, and say, profile me, this with JFR, or with async-profiler, and it runs it as a profile. It can run configurations as a profile, and also arbitrary JFR files. Here you see some timelines, so you see the method tree, the call tree, and when you then click on the methods, you can jump back to your IDEs with IDE integration, and when you shift double click on the methods you have a source code [inaudible 00:29:00]. You can see where all the samples are. For example, we see it in this example, 136 samples contain just this fib line on the top of the call trace. Then, also a flame graph, of course. You can view all the events directly and get some information on all JFR events.

Impact on Performance

After seeing how you can inspect all this and how to obtain profiles, you probably also want to know what the impact on your performance is. The impact of your performance is small enough probably for most settings. JFR in its most reduced setting, where it's like for monitoring, has an overhead of typically lower than 2% in a benchmark or in an experiment that I did, it was between 0% to 5%. When you turn on the profiling configuration, which obtained more events, not all but more, and it has a smaller sampling interval, you get an overhead between 1% to 8%, in my experiment, and typically below 5%. With async-profiler, you're getting overhead typically lower than 2%. When you turn on jfrsync, I found an overhead between 3% to 10%, when you're combining async-profiler results with JFR, using the jfrsync option of async-profiler. It should be typically usable for most applications.

Take profiles with a grain of salt. Profiles are just applications also written in some code. They have the same problems as many applications. They lack sufficient tests in some areas. Also, regression testing is quite hard. Please have this in mind when you're trusting your profiler. There might be some bugs within. If you find one, please, open an issue in a Java bug tracker. This would be really great, so I can fix it. Because that's what I'm doing in doing work. For example, here's a bug that I found a few weeks back, which has a really nice and small reproducer. Here you have a main class. In the main method of the main class, it calls its method test, and test method calls another method called javaLoop. The javaLoop method is just a simple endless loop. What we would expect in a flame graph is to see on top the javaLoop, most of the time, because it's an endless loop, it runs a long time. Then we test and then we should see some internal reflection stuff. On the bottom, the main function. The problem is, profilers stumble. I think the API that the async-profiler uses and also JFR stumble after the first reflection calls. This bug surfaced after reflection call changed. It's quite problematic because it's a bug that could have easily been found, but it was not. This is not really real, but I think because in the real world, it's a bit harder to reproduce because it depends on a few conditions. For example, here it depends on tests being probably interpreted, but the problem is that's hard to find. It was just by chance that I found it. Please, when you're seeing a profile, and something isn't really correct like you see here, then it's probably a bug in the profiler. Please report it.

The Profiling Loop

To work better with this, you should use an experimentation technique. I borrowed this from the science part of computer science. I call it the profiling loop. First, when we start profiling, we want to have a mental model. Yes, it could be rough, but we should have at least some idea of the architecture of the program. Also, we should know the large libraries, at least we should know what they are, and maybe know a bit on them, for large libraries that your program or your application uses. Then you use your model to create your hypothesis in. For example, in my introduction example, my hypothesis was ok. Probably the method that the client calls remotely like, there's this API that has the client, do you have a question, and what is it? That's probably too slow. That fits well here in my mental model. It's not, I have to update the mental model so I can formulate the hypothesis in. Then with the hypothesis, I go into the evaluation. The evaluation means write some test code to do some profiling to all this. Then with the evaluation, you can either update your hypothesis, like make it more concrete. For example, in my example, the hypothesis is that it's slow, that this whole JSON parsing is too slow. That this is a bad idea, and just a new hypothesis, I evaluate, ok, this is the problem, and then might fix it. Then the hypothesis is, ok, now it's working, and then I'm back to evaluation. That tells me, yes, that's probably working. Also, you can use the evaluation to update your mental model. This also helps you when profiling has some internal problems but you might encounter back because you still have your mental model and your hypothesis, and when the evaluation is invalid in your mental model, and your mental model is incorrect, that's the normal case. Or there might also be the cases that your evaluation was just wrong.

Conclusion

Then, besides, keep in mind that profilers aren't rocket science as I've shown you. You can even write your own basic async-profiler clone that's just doing like the simple sampling in like, three days, or three afternoons even. If you want to know more, I have a blog series on it called, writing a profiler from scratch, where I write a profiler from scratch. Profiling techniques are good in your toolbox like debugging and testing, and it is really important to have tools for this, to know some tools. I will really recommend to try out async-profiler. Reading the async-profiler - manual by use cases, and looking into JMC and JFR.

 

See more presentations with transcripts

 

Recorded at:

Sep 29, 2023

BT