InfoQ Homepage Presentations Continuous Monitoring with JDK Flight Recorder (JFR)

Continuous Monitoring with JDK Flight Recorder (JFR)

View Presentation

Speed:

Download

50:14

Summary

Mikael Vidstedt gives an overview of JDK Flight Recorder (JFR), showcases the current feature, as well as some features coming in the future. JFR is a monitoring and troubleshooting framework built directly into the Java runtime. JFR has access to all the internal data of the JVM and can capture and surface data on a fine-grained level with extremely low overhead.

Bio

Mikael Vidstedt is Director of the Java Virtual Machine (JVM) in the Java Platform Group at Oracle. Over the last 20 years he has been working all the various aspects of the Java and JVM technologies, focusing on everything from compilers and optimization to operating systems and server virtualization. He has worn many hats, including that of JVM tech lead and Architect.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Vidstedt: Let's see if we can learn something about Flight Recorder today. I'm hoping that's why you're here because that's what I'm going to talk about. I'm Michael Vidstedt, I work for Oracle, I run the JVM team at Oracle and I've been working with JVMs my whole career, couple of different ones. This is Oracle's not so short way of saying that everything I'm about to say is lies, so don't take anything for real.

Let's imagine that you have a product. It's running a service somewhere and it's continuously running and you obviously want to keep it that way. One day, one night, it's probably Saturday night, 2:00 a.m., middle of the night, there's a problem somewhere and SLA gets breached, there's a problem in there somewhere. Obviously, your job now in the middle of the night is to figure out what went wrong, so obviously you start looking through all the data you can find: logs and whatever you have set up in your system. It may well be the case that that data is really available. You see the general area where it's crashing or not behaving, whatever it is. You can see that but you can't get to the actual root cause of the problem. What do you do? At some point, you may be tempted to add some more logging or add some specific code that tries to capture more of the data so that you can see what actually went wrong. That's great, the problem is obviously that over time you run it and nothing happens, so you don't get more data.

As a matter of fact, this code you just added probably has some kind of performance overhead or some problem associated with it, so at some point, you remove that again. You need to remove it because it's adding overhead to your service. Guess what happens? Saturday, 2:00 a.m. in the morning, you have a problem again. Obviously, what the flight industry has learned from this is let's capture data continuously and let's do that in the background, you don't even notice what's happening. They call it a flight recorder and that's effectively what I'm going to talk about. The flight recorder we have worked on is built into the Java Runtime itself, but it behaves very much like an actual real flight recorder and over the next 40 or so minutes, I'm going to talk a bit about what that is on our level.

The agenda looks something like this. I'll talk about the overview of what JDK Flight Recorder actually is. I will talk about events because it does turn out that those are central to how JFR works. By the way, I'm going to use JFR, the abbreviation, to refer to this functionality a lot as well. I'll talk about the fact that JDK Flight Recorder is meant for production. If you leave this room without remembering the fact that this is built for and designed for use in production, I think I'd failed, so I will try to weave that in a few different slides. I'm just going to say something quickly about how you can use JFR and obviously some future work because that's normally how you wrap up a presentation.

What Is JDK Flight Recorder?

What is Flight Recorder? JFR is short for JDK Flight Recorder. It is available now in the JDK close to you. I'm saying that because it does depend a bit on exactly what JDK you use, but chances are that the JDK you're using has JFR in it. It's been there since JDK 7. It was there in another JVM before that and I'll touch on that in a few slides as well, but it is probably in the JDK you're using. It is event-based. I'll go more into detail about what I mean about that, but it basically is the central to the concept of JFR is event and capturing events. Again, I'll talk more about that later.

It is built into the Java Runtime itself, and the nice part about that is that is an unfair advantage to us. We can piggyback on a lot of the stuff that is happening inside of the JVM and the inside of the libraries that make up the JDK and captures and you store a lot of the information that the JVM already has. The fact that we have built this into the Java Runtime itself makes it very efficient, it has very low overhead. As a matter of fact, it's designed to have less than 1% overhead in the common cases. That means, again, first iteration, production. This is meant for production, you should be able to run with it, turned on all the time and not even noticed that it is there in the background, and then if something goes wrong, you already have the data available.

The other nice part about being having this in the Java Runtime is that we can collect events from many different levels, so all the way down from the operating system in the CPU through the JVM, through the libraries, and also on the application level. You can get events from all these different levels and you can very easily correlate them, so if something goes wrong in your application level that not necessarily is breached, you can look into all the different details around what else was happening at the same time. We also have very powerful APIs that you can use both to produce your own events, but also consume the events and make sense of the manualized media around it.

This is in a nutshell and now we'll go through all of that more in detail. I could start talking more about theory but I'd like to try to just throw a demo at you and see how that works. I'm going to show you a very simple use case for how you can use JFR. In this case, we have all the events already in the Java Runtime itself and what I've done is to build a very small agent - a Java agent, in this case. It's a small program, that's where it runs in process next to the application that I'm going to run. It is monitoring the application, so it's taking the event stream, the data stream that is being produced, and it's analyzing that and it's displaying it in a simplistic way. In this case, it's going to be a text-based representation, but it could be something else, of course.

I'm also going to use a very old demo, Java 2D. How many people know what a Java agent is? Ok, some of you at least. Imagine that you have your main - the normal application main - as a way to say, "Here's the JAR file," and inside of that, I have a class with another sort of main that's called premain. It's like you are running in parallel with your applications, so you get the premain call first and you can do stuff in it. It's a handy way just to run something on the side of an application. Basically, what I'm saying is run my health report thing here, my special JFR analysis agent and run that in parallel with the application.

If I fire that up, basically, what you're seeing here is JFR in action. In the background in the JVM, in the Java Runtime, we collecting all these events. Some of them are related to GC, it's the environment we're running it with the amount of memory and all that, we have allocation rate at the top. There are details around what's going on in the JVM and then there are two different categories here.

The first one is "Top Allocations," so where is the application allocating memory. The second one is "Hot Methods," so where is execution time actually being spent. This is just a snapshot of what JFR can provide, if I show you the code, you'll see that it's very tight and streamline. I'll go more into the details about this later but just to get a feeling for the kind of information that we're collecting, hopefully, this will help put it in perspective. That was the demo.

Some history - I'm not going to go into all the details but JFR has been around for a long time. As a matter of fact, I worked on another JVM before called JRockit and we had a problem at some point where we were running into challenges ourselves understanding what was actually going on inside the JVM. We developed the JVM but we wanted to know what's going on so that we can improve on it and make it better.

Also, our support team, they were sitting next to us, and from time to time, they brought up the fact that, "We could sure use more insight into what your code actually is doing." That idea was the foundation for what later became Flight Recorder. It has changed its name over the years a bit, but it has been there for 15 years or so. In, let's say, the standard JDK, the HotSpot JVM, it's been around since JDK 7. The very first version was ported over to HotSpot in 2012 and we've made a number of improvements on it ever since. In JDK 9, we published APIs, made the APIs available for people to use both for producing and consuming events.

One of the biggest things that happened to JFR was that in 11, which is now a year ago, we open sourced JFR, so now it's available in open source and probably in every JDK more or less out there. You don't need to know all the details about when what happened but that said, it's been around for a long time so it's very mature technology. My point is, you can use it in introduction. I'll stop saying that at some point.

JFR Events

Let's look at what events actually are because, again, the events are very fundamental to how JFR works. An event is basically this; it is a small blob of data. I'll go more into the storage format of it but, logically, it has an event ID, so it's basically some kind of identifier telling which event this is, a unique identifier. There is a timestamp, so when did this happen? There's a duration, so how long did this take? Not all events have a duration, so some can be instant and therefore not having duration. There's a thread ID, so which thread generated this event or in which thread was this operation generated? Not all events have a thread ID, some of them are not tied to a thread specifically but many are.

There's a StackTrace ID, so basically associated with every event, there can be a StackTrace, I should say. But, in order to keep the footprint of this down, instead of storing the whole StackTrace with every event, we are putting the StackTraces on the side because they tend to be almost the same so we're basically storing them separately and we have an identifier to tell which StackTrace to use for this event. Then, there's event specific payload, so depending on which type of event you have, you have various other fields or data points in there as well. That's the lead strike, let's look at it from a Java perspective.

This is basically what it would look like if you wanted to generate your own event. There are two sources of events and I'll get to that on a later slide as but there are events that are generated from Java code. If you start picking up and using this and playing around with it, this is what you will see. There are also events that are being generated from inside the VM itself, so that's native C++ or CC++ code. Very similarly to the next few slides that I'm going to talk about on the Java level, there's a corresponding functionality inside of the JVM itself to capture events. If you want to create your own event, this is what you have to do. It's literally that easy. You inherit or you extend the class called event and, lo and behold, you have your own event.

How do we actually make use of that? Let's say that you have some really business critical code. This is your application, it does something really important and you want to put an event around it to get some more information. I don't know what it is, maybe it's processing HTTP request or it's calculating some prime or whatever it is. Basically what you do is you add this around it, so you allocate an instance of the event you just curated. You start the event, so you allocate the event, you start the event, and the "begin" here is basically taking a timestamp, so it's saying, "This is when the event started."

You do your business logic and then you call end, which is also just taking a timestamp, so it's basically now has the start and the end point for the duration of the event. Then, you say, "commit," and that's actually what takes the event data and puts it into the event stream and stores it off inside of the VM and we'll see more about how that works. The first question you might ask is, "Why would I call end and then commit? Couldn't we just coalesce them?" It does turn out that you don't have to call end, if you don't, commit will do the work for you. There is a case where it's useful to make a distinction between the two and that's when you want to actually have the end timestamp, but there's some kind of computation you want to do before storing the event into the stream.

In some cases, it may make sense to call end explicitly but if you don't, commit has your back. Typically, what you'll see is something like this, but this isn't very useful in itself. Now you have an event and the only thing you know is that that event happened. You may want to put some of your own data into the event. The way you do that is by adding fields to the class, not surprisingly, perhaps, and you initialize those fields as part of creating the event. You can do this whenever you want as long as it's before a commit, well enough to the allocation, of course. For example, in this case, I'm storing constant before the event is even generated, but maybe what you want to do is to take the business logic or whatever and maybe some value comes out of that so you want to store that in a field instead, that's totally fine. Do it whenever you want, as long as before the commit. That's basically it.

There are some additional things that are helpful to do with your event. There are a number of annotations that you can make use of. These things help when presenting the data in diagram. They don't modify how things are being produced, but if you want to visualize this later, you may want to provide some useful hints around what the field value actually is. I'm not sure I did my very best naming them here, it's just messaging value. Maybe you at least want to signify that in some way.

You can also annotate the event itself, the class. I'll go through a few more annotations on the next slide, but the default name of an event is actually the full package name and the class name. In some cases, that works out but in some cases, you'll find that your class is in a package called "com.internal.secret-something$," and it just isn't very nice when you want to present the data in diagram, you may want to have a name that is more easily understandable, let's say. Name the event in more human friendly way, so to speak. That's basically it, now you've created your event and you've put it into the event stream and you can consume it later and we'll see what that can look like.

A few additional annotations that you can use are these. I mentioned name and label, there is a description. Label is supposed to be very short, it's like a word or two. Description is not a full essay but maybe you want to add some color to it, so a few sentences, maybe two, so try to keep it short and interesting. There's also other metadata that you can annotate your events with. As I mentioned, most of the events have durations and one of the ways we're dealing with the amount of data is by filtering the events. Instead of just capturing everything all the time, chances are that some events are only relevant when the duration is long enough, so you can set a threshold on your event.

If you, for example, know that HTTP requests that take less than 100 milliseconds, not a problem, don't care, but the ones that take longer, those are typically the ones that I want to capture. Then, you can say, "Threshold, here's the filtering threshold for duration." Similarly, there's an Enabled annotation, which basically says this event enabled by default. There's also an annotation for saying if the stack trace should be included or not, so those are some of the configuration options for your events.

This list is in no way exhaustive, there are more annotations and more information. The Java docs in the jdk.jfr package is very helpful, so if you want to know more about which annotations exist and when and where to use them, I suggest you go look there. I mentioned that the Java Runtime itself generates a lot of events. This, likewise, is not an exhaustive list, but it does cover some of the key things that we are collecting inside of the JVM and the libraries. There are right now around the 140 different events, it is growing, but I showed you a bit earlier in the demo that we are capturing things like the environment, which CPU are you running on, how much memory do you have, that thing. Command line information, version information, both from the Java Runtime itself and operating system, things like that. There's I/O, both from the file and network side. GC, JIT compilation, more or less the expected stuff, I guess, is in there. Again, this is something we're constantly working on, expanding on.

In the background, this is how it works. I mentioned that you're getting two sources for events, one is generated from Java code. That's what we saw an example of just in the few previous slides here, that's the top cloud thing there. There's also JVM-level events, so things that we feed in from the JVM through the native code, more or less. These both go into buffers and those buffers are thread local.

That's very powerful because what it means is that there's no cross communication across sockets or CPUs or even threads. They all go into thread local buffers first and that's very efficient. They're storing things through thread local buffer and we're very good at optimizing that. What then happens is that as the thread local buffers fill up, we store them and we take the buffers and put them into a global list of buffers. This is, obviously, the uncommon case and also something that can happen in the background. This is being processed by another thread and once the global buffers fill up, we store things into what we call the repository, which you can think of as a recording file in the background.

Obviously, we stored that data to file when the buffers fill up. The other case, which is new with the event streaming functionality that we'll touch on later, is that we now also try to store things through the repository approximately every second. I'll go into this more in detail later, but we want to make sure that the repository can constantly be observed, that it's in an observed state and has the most recent data in it, but there's a trade off between putting the data there all the time and getting data in a timely manner. Basically, what you'll see going forward is roughly once a second, we're going to store the data into the repository. If you take anything away from this slide is that the overhead of producing events is, from the application perspective, more or less zero.

What does a flight recording file look like? We store things into the repository and the end data format that we use for this is the flight recording data. It's a binary representation and it's trying to be very compact. It is not compressed, which is something we're looking into potentially doing in the future, but it is very compact and this is something we've improved on over time as well. A lot of the identifiers and the stuff I mentioned goes into an event, you don't need the full 64 bits or whatever it is, it's like most of them are zero. Varint happened to be a very efficient representation of that, so we're using that extensively in the flight recorder format.

The files are also self-describing. There's a high level format you need to know about how JFR recordings work, but the events themselves are self-describing. We capture all the important information about what an event looks, which fields it has, how to interpret those fields, what types they have, all of that is captured in the recording itself. If you add the new events, for example, you don't have to have the site information about which fields you have, that's all in the recording. The recording is self-describing, it stands on its own feet. I'm not going into all that much more detail around what exactly the flight recording file looks like but hopefully, you get a rough understanding of that.

I mentioned that events can be filtered, the obvious one is on a type or a name basis, so you can say that, "These events, I want to include, these other events I don't want to include," and I also mentioned that you can filter by duration and we find that the combination is very powerful. Most of the time, as I mentioned, you're probably not interested in the nanoseconds or even necessarily microsecond events. It's when it's starting to come up in milliseconds or seconds, that's where it gets more interesting but you can filter on different durations for different events.

I also mentioned that the fact that we are capturing events from many different places can be very powerful. We capture things everywhere from the operating system, so what operating system are you using, the versions of things, the CPU load, that kind of information, through the JVM, so JIT compilation, GC, class loading in the JVM, locks, synchronization, and all the way through the libraries. If you do application-specific events, you get all of these events in the same stream and it can be very powerful to start from the top. You see that your HTTP request took a while to execute. I'm actually going to show them on this later as well. In order to figure out what actually happened, you can dive in and go as deep as you want to find what actually led up to that problem, why did it take so long in the end?

Designed for Use in Production

I have a whole section dedicated to the fact that you can use and should use this in production. A lot of this I already covered but it has extremely low overhead. The concept of having thread local buffers and the JVM piggybacking on the fact that a lot of the GC and JIT data is something that the JVM already collects and we can just take that data and store it into the event stream. That means that this does have very low overhead.

We generate the events into these thread local buffers, which means that the application won't block, it can continue executing, it's just doing some thread local storage and that's very fast. If you still aren't convinced, maybe this will help you. We have JFR default on for our Oracle Fusion apps. I'm not even sure what kinds of applications we have. We have customer relationship stuff and I'm a low level Java guy, but we have cool applications and they're all running with JFR enabled. We also see customers who extensively make use of JFR throughout their whole deployments. These are really huge companies, hundreds of thousands of machines or instances, that thing.

I tried to stop rubbing in now, but this is for production, it's not just for development. "Still, Michael, you keep talking about this performance thing. You keep telling us that it doesn't actually have overhead. Surely that can't be true. If we go back to our example here, I've now sprinkled all this code around my business logic and surely that will mess things up." To start with, let's have a look at what actually happens here and this is going to be a journey through how the JVM does optimizations.

I add a few lines of code, I will admit that. The begin thing, we'll look into what it does. Commit may be the obvious candidate for how can that possibly be efficient? Let's look at what commit looks like. The first thing I'll say is this is not what commit looks like but it's sort of what commit looks like. It actually is for various reasons, bytecode generated, not an actual method but it sort of looks like this. The first thing it does is ask, "Is the event enabled?" If it isn't, there's an early out to say, "Don't do anything else." There are other checks, so, for example, I mentioned duration. If the duration isn't big enough, if this event didn't take long enough, then we'll also early out and say, "Yes, the event isn't interesting."

There are also other checks that are being performed to see if this event actually should be stored into the event stream. These checks are all relatively cheap but obviously, they are there, and in the end, if the event is actually to be committed, then we call the equivalent of, let's say, actuallyCommit down there. Remember, even that is actually cheap, and goes into thread local buffers but let's have a look at what happens if this event isn't enabled, at least the first thing. If we can't get that right, surely the performance will not be good. So going back to our example.

This is what the code looks like right now, you have your really important business logic there in the middle and obviously, you want to make sure that that's executing and it doesn't have the rest of the overhead. The first thing we do inside of JIT compiler is we use one of our secret weapons, which is inlining. We start looking at all these method calls and we look at what they actually execute and then we put that in the method that is calling it. In the case of begin, I mentioned that the only thing it actually does is take a timestamp, so we inline the timestamp thing.

In the next step, it turns out that that is actually JVM intrinsic, it's something that the JVM knows really well how to get. It's actually a CPU register, they read the time stamp counter instruction in there, so it's very cheap and we know how to optimize that, so we put that there also.

Commit, likewise, we inline and I didn't inline all of commit here but I did the inline the top check. We're saying, "If this is enabled, do stuff." we can inline that as well. If it turns out that the event actually isn't enabled, we'll get the value false and the second secret trick we have in the JVM is dead code elimination. We know that that's not going to be executed, so we just remove it. Now we know this event, so we allocate it and we store a value in it. What we can do is use our third secret trick which is scalarization. We don't have to store things in the actual instance on the heap if nobody is going to be around to read it. This event clearly isn't used anywhere else, it's local to this stack frame, it's local to this thread, so therefore, we don't actually need to store the value on the heap. Great, now we have a stupid allocation, let's remove that because nobody will see it.

Guess what we'll do then? We'll use our second weapon again, that code elimination, and business logic. The story is similar for the other checks. In the end, basically what this comes down to is it may look like a lot of code, it may look scary, it may look like it's not performing but the JIT compiler is really good at this and basically, what you'll end up with is something that is extremely cheap. Zero if it's not enabled, really cheap if it is.

Here are some bar charts, because everybody loves bar charts. This is showing JFR disabled, which is zero as I said. I will come back to the stack depth part in a bit, but it does have an overhead. Yes, I'll admit, it's not zero when it's turned on but it's very small as you'll see. There are a few other comparisons here as there's lon4j with off, which still has an overhead. It's log4j with INFO, which as you can see that's pretty costly compared to the alternatives.

Anybody dare to guess what java.util.logging INFO and Redirected System.out, how big those will be? they didn't fit into this chart, so it's changing the scale a bit. Yes, they have ridiculous overhead. Now, it says, "Your Mileage May Vary," at the top, this does depend on what you're logging, where you're logging it, all of that. Just to get a feeling for the kind of performance you can expect from JFR, it is very close to zero and that's what we designed it for.

Less than 1% overhead is the goal, we're trying to keep it there. There are ways you can mess that up. If you actually start generating events with long stack traces, that's one of the key things that is costly for us, it is taking the stack trace of the running thread. If you have very deep stacks and you generate these events in some innermost top loop, then you will see overhead, but in the typical case, you won't. The default configuration is tuned so that it doesn't have more than 1%. There are other configurations you can run with, so if you start collecting allocation profiles, for example, or just capture a lot more data, then, yes, there is an overhead, but we get a lot of detailed information using the default profile.

Using JFR

If you want to use this, how do you get started? I'm going to cover the JDK 11+ case here, because I'd like to talk about JDK 11+. There are options for older JDKs as well but this is what it looks like in 11+. The option that you're looking for is StartFlightRecording and there are some options to that. The second example here shows how you can start the recording and ask it to store the data to a specific file. Otherwise, you will just record it in the background and you'll have to go in and get the data out explicitly. If you started as the second line says, then you store the data to a five called tmp/foo.jfr and you have your replication arguments after that. If you already have a VM up and running and you didn't specify StartFlightRecording but you still want to get data out of it, there's a command in your bin directory called jcmd or J command and it has the options to start the recording and to get the data out as a file as well, so that's the JFR.start. You can, by the way, specify the jcmd to either a process identifier or the name of the main class. Obviously, that doesn't work if you have many applications running the same main class but those are the options, let's say. There are also some options that you can specify to limit the amount of data you get out, and so on and so forth.

I'll show you a quick example of what that looks like. I'm going to still use the Java 2D demo and I'm going to run it and say, "Take the default profile, these default events, and store those into a file called /temp/j2d.jfr." I should remove this file to make it clear that I haven't cheated. I'll start this up and I'm going to run it for a short while to have it collect some information in the background. We've run it for a while and if we now go and look, we will see that we have a file.

There's another handy tool that you can use to look at this. Inside of the bin directory, if I go look in my JAVA_HOME, on the bin, I will find a JFR command. This is also new since, I want to say, 11, I could be off by a release or to. This is actually one of the challenges with the release cadence we have right now. We're releasing one release every six months and it's really hard to remember what went into which release but I think it was 11, it should be closed at least. If I run that, I'll get some handy information about what I can do with it, so it's also saying, "If you actually want to capture data, here's how you do that, you can use jcmd as well." It's telling you that if you want to print out and analyze what actually went into the recording, you can do something along these lines, so print the events.

There's large number of events now that scroll by. As you can see, they're human readable and nice. There are other ways you can get the data out, so you can, for example, say, "I want that exact same information but as JSON because JSON is cool nowadays," you will get it as JSON instead. You can look at summary information, so maybe I should have started with that. I can say, "Summary," and I will get information about what version of the JFR file format, whatever, is used here, how many chunks it has, so this is like you back to how is the data is stored. Inside of that file, we are storing things into chunks because we do want to deliver data continuously in some ways, so there are chunks that each stand on their own feet again. Inside of this file specifically, we only have one of them. We see when the recording was started, how long it was going for, and the events that went in. This is a histogram, the kind of events that came into that recording.

Those are all cool ways of looking at JFR, repeated on this slide for simplicity. You can also say categories; one of the annotations that I think I had on the slide was category. That's another high level presentation we have seen, this event is logically part of a group of other events, so we have one category called GC, for example. If you only want to print out GC events, you can say, "--categories GC," and you can define your own categories as well. I'm not going to go into the details.

Ok, so what is JFR useful for? Production. Apart from production, you can also use it for other things - the obvious thing is development. You want to know, while you're developing your application, what methods are actually hot? What is allocating memory? How can I make this run faster in a more stable? Development is another obvious use case. The one we found interesting was actually testing, so it turns out that as part of implementing JFR and especially the event streaming stuff I'll touch on in a minute, we found that using JFR to test JFR was actually extremely powerful. We can capture data that we otherwise just wouldn't be able to and then analyze whatever operation we expected to do a certain thing, actually did the thing in the end as well.

This is true for things like allocation or lock profiling, like the profile suddenly changes. You make a change and all of a sudden, you have lock contention and everything grinds to a halt - or maybe not a halt, it just goes more slowly, maybe 10% more slowly so that it's very hard to see. JFR can potentially help you figure things like that out.

That's in a nutshell what JFR has been for a long time. One of the key use cases we've been missing is continuous monitoring.

Future Work

For better or worse, the way JFR was implemented was more for, let's say, profiling, spending time, like a minute or three, collecting data from an application, dumping out that data, and then analyzing it. That's powerful for many use cases but one of the problems is if you actually want to use it continuously. If you have to wait for a full minute and then go through a number of steps, then that's cumbersome.

I'm going to talk a bit about the future. Some of the future is almost here already but I never promise that anything goes into releases so I can only say that the event streaming stuff is in mainline right now and is therefore highly likely to make it into the next release. Duke here is holding a beaker because that's what we do in the labs. If you watch carefully the beaker, it will now change to a coffee beaker thing - you wouldn't believe how much time we spent internally working on the color of that coffee, by the way.

As I mentioned, today, as in all the shipped versions of JFR, you basically have to go through the cycle of starting the recording, stopping the recording, and dumping out the file every time you want to look at the data and that's not exactly ideal. It can work well for your development use case where you're just doing profiling but if you want to actually look at your running service somewhere, that's not exactly what we want to do. We worked on something called event streaming, it has a JEP number associated with it. JEPs are JDK Enhancement Proposal, so they're a description of what the functionality is and some pointers to more data around what's going on.

The goal here was to provide functionality that you can use to continuously monitor what's happening inside of the instance and also provide the necessary APIs to make that easy. Our goal was to make it more or less a one liner to consume data and act on various things. We didn't get it down to a one liner but I'll show you on the next few slides that it's at least relatively straightforward. This is enabling you to do continuous monitoring while the recording is still in progress. We're continuously recording things and given that we now dump the data out to the repository approximately once every second, you can observe it, you can consume it and act on it approximately every second.

What can consuming look like? This is the canonical example of consuming events. It is opening the recording stream. I should have included the package here, I think it's jdk.jfr.something. We're opening up the recording, we're saying, "Enable one event," in this case, it's jdk.JavaMonitorEnter which is, locks synchronization starts and do that with a threshold of 10 milliseconds. Start the JavaMonitorEnter event with 10 millisecond duration, so any event that takes longer than that will be captured in the stream.

Then we're going to say, "If one of those events occur, if we see that in the event stream, then do this," so in this case, we're going to say, "Print it out to standard out." I'm not going to go into the details around here, but basically one of the fields in that event is called monitor class and it's the type we're synchronizing on and that call is not Java Lang class. You're getting a representation, a JFR specific representation of a class or a type. I'm not going to go into the details, but in any case, it's going to print out the event, the type associated with the event. Then you say, "Start," which is a synchronous call, it's going to block until the event stream is closed in the other end.

There are asynchronous calls as well that you can do, methods for processing this in a separate thread but it's going to do all of this and it's going to print it out. If you want to add another event, it's very similar, so in this case, we're saying, "CPU load, enabled that, do that with duration of one second and print it out when you get it."

I'm going to try a much more complicated demo. I know literally nothing about Spring Boot but I'm going to use that anyway. I've implemented the small agent, a monitoring agent, and actually this time it's not going to run in process. It's going to run on the side, so it's going to run on the same machine, my laptop, but it's going to run in a different process on the side.

What I'm going to do is to start up the Spring application, I'll show you the code in a minute. I'm saying, "Start flight recording," and I start up my Spring Boot thing. This could be any app. I'll fire that up and meanwhile in the background, I have this application going. I can't really talk to this anyway because it's Spring Boot and I don't know Spring Boot. There is a small application here, basically what it does is it sets up a few resting endpoints, I think they're called, and very specifically, they look like this. There's a /hello endpoint. If you go to this web service /hello, you will simply produce something that says hello.

There are a few other ones here and that's what we'll be demoing, so hello1, that's almost the same thing, it's just it does something CPU-intensive. Imagine that I have some weird requests here and it's consuming CPU, I'll obviously want to know what's going on. Keep that in mind, hello1 is CPU-intensive. hello2 Is GC-intensive, so it does something weird with allocation or something. It should maybe be clear from the name. hello3 will be using lock contention, so these are three use cases or three weird things that can happen.

Obviously, this is fake application but in theory, this could be your production code. I have a small agent on the side and it basically does what I just showed you on the slide, so it's opening up the recording stream. It doesn't do that in the same process, instead what it does is ask around JVMs that are running on the same machine and it's opening up the recording directly from the repositories, from the outside. Basically, what is happening now is that I have the Spring Boot instance that is producing data into the repository, into the file stream, and I have this other process that is now observing that exact data.

What we made sure with event streaming is that the data is always consumable, it's always consistent. You can always get to it and observe it on the side. What that looks like is something like this. By the way, if you're paying attention, you'll see that I'm running Java on a java file, I'm not compiling it first. This is new functionality in some release that is relatively recent that I've forgotten exactly which one it is but if you have a single main class with a main method in it, you don't need to run javac, you can just run it directly with Java.

As I mentioned, what this is doing now is looking around for processes, JVM processes on the same machine, and it's now found one, it's the Spring Boot thing, not surprisingly. What I'll do now is I'll first run hello just to see or show you what's happening. The monitoring agent did notice that the hello endpoint was being accessed. "hello1," if you remember, is a CPU-intensive one, so now it's going to run the CPU-intensive thing and hopefully, as you can see, it took slightly longer to access and this looks like a crash but it isn't. It's actually printing out that one event took a long time, the CPU event here. What I've done is, I've opened the recording stream, I've added the subscription to the CPU load event or the jdk.ExecutionSample event and I've said, "If it takes longer than this, print it out for me."

Basically, that's what we're seeing here, it found that something was taking the time and you can see that it's the CPU-intensive method much like we'd expect. "hello2" is the GC-intensive one and I'm not sure why it started accessing the error thing in there, but in any case, we see that there's a GC Pause and the reason being that we have a GC-intensive thing running. Then, finally, we have the lock contention case, which is going to tell you that there's a wait here in the lock contention method and you can also see that the class were synchronizing on this object.

These are examples of very simple monitoring, I'll show you just very quickly what the monitoring agent actually looks like, just to give you a quick view of that. It should be relatively similar to what you'd expect. Here's an event stream, we're opening it up and we're saying, "On the HTTPRequest event, if the request takes a long time, then get more information and show that," that's in a nutshell what happens. That's event streaming.

I want to say that with event streaming, in my mind, we have the first real version of JFR. It's always been powerful but the fact that we now have event streaming in there is providing that use case that we've been missing, in my mind. Obviously, there's more stuff that we can work on but that's one of the key things that we needed to do to have the full story in place.

Some of the other things we're working on is making it possible to access the events over JMX. We want to, as I mentioned earlier, keep adding events. There are other things, especially on the JDK libraries level, that we want to feed into the same stream, which crypto algorithms are being used, certificate expiration dates, that sort of thing, high level stuff. There's a product called Loom, which is looking at providing fibers and continuations. I'm not going to go into the details, but basically it turns around a bit the concept of what a thread is and since threads are so key to how JFR works, that's something that we need to look at how we can support with JFR as well.

It's always hard when you used to build in command lines but I think we can do some tuning to make that slightly easier to use. There are a couple of other things that we're looking into. Event throttling, so instead of just filtering on name and duration, maybe you want to record sample every nth event of a certain type. Deep tracing, in the sense of maybe for a short duration, you actually want to capture everything just to get a very detailed view of what's going on. This does put some stress on the code that is capturing the events and making that efficient and all that.

There are a lot of integration opportunities here. The stuff I've showed you is obviously very low level and provides the primitives for getting the data out. The next step that we'd like to see is more of the big libraries frameworks and IDEs picking up and making use of this, visualizing it, providing it to the user in a well-shaped form. We're providing the primitives for doing this. As I showed, hopefully getting specific data out isn't that hard, it's like a few lines of code, but I think what we're hoping to see is that this will be picked up by a lot of the vendors out there.

If you want to try this out, and please do try it out, we are relying on your help to figure out what to do next to make sure that everything works as you'd expect, so please do try it out. We have obviously shipped releases with JFR in it. We are also providing early access binaries of the next version of Java, so in this case, that's JDK 14. You can pick them up, you can try them out, and let us know what you think, there's an email list here that you can send your feedback to.

Just as a summary, this is actually the same slide as in a nutshell thing that I showed at the very start. Flight recorder is JFR JDK Flight Recorder, it's there right now already. Event streaming is not in the released version yet but it's coming soon. It's event based built into the Java Runtime with all the unfair advantages that come with it. It has very low overhead and it's production use. Remember that, production. You can correlate events from many different levels and we have APIs both for producing and consuming the events that we like to think are very simple to work with.

See more presentations with transcripts

Recorded at:

Feb 27, 2020

Mikael Vidstedt

InfoQ Software Architects' Newsletter

Continuous Monitoring with JDK Flight Recorder (JFR)

Summary

Bio

About the conference

Transcript

What Is JDK Flight Recorder?

JFR Events

Designed for Use in Production

Using JFR

Future Work

Related Sponsors

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ