Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Continuous Profiling in Production: What, Why and How

Continuous Profiling in Production: What, Why and How



Richard Warburton and Sadiq Jaffer talk about the ins and outs of profiling in a production system, the different techniques and approaches that help understand what’s really happening with a system. This helps solve new performance problems, regressions and undertake capacity planning exercises.


Richard Warburton is a software engineer, teacher and Java Champion. He is the cofounder of and has a long-standing passion for improving Java performance. Sadiq Jaffer is Director at Opsian. His experience has included deep learning systems, embedded platforms, desktop and mobile games development.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Warburton: My name is Richard [Warburton].

Jaffer: I'm Sadiq [Jaffer].

Warburton: We're here to talk about Continuous Profiling in Production: What, Why, and How. So, what does that mean? Well, we're going to talk a little bit about why performance tools matter, why you might need tools to understand performance problems in your application stack. We're going to talk about the differences between a development environment and a production environment, and why just looking at things in development might not be the right way to understand performance problems. Then look at a few different approaches at the high level as to how we can solve these things.

Why Performance Tools Matter

Firstly, at a very high level, why do performance tools matter? Well, firstly, when you think about software development problems, I think there are three broad categories of things that you get, and this applies to performance problems and performance aspects of your software as much as others. The first one's Known Knowns. So these are theoretically the easy aspects of your software, right? You know that reading data out of your main memory is going to be faster than reading data off a hard disc. You can have that in your mental model; that's something that you can confidently bank on.

Then you've got your Known Unknowns. These are things, aspects of your software system, where you know that there could be a potential for performance problems there, but you don't know how big of an issue it is. For example, every time you've got an unbounded size of Q in a production system, you don't know whether it'll be a performance problem, but you know it could be an issue there. You know that's there's this main memory disk speed ratio difference, but you don't know how big it is until you actually look at it in some particular systems. So these are things that you may not know all the details on, but you can plan ahead for. You can mentally model, you can think through.

Then there are the Unknown Unknowns. They're real spanner-in-the-works problems. These don't necessarily have to be hard or complicated, but they could be something like, you've spent months building a beautiful, well-thought-through scalable parallel algorithm, and then you've got a sequential logger in the middle of this system that's slowing everything down to the speed of one thread, and you didn't think about that. You wouldn't necessarily think about those kind of issues ahead of time until you hit them in production, until they're real problems for your system. The things that you, maybe with hindsight, you would think, "Oh yes, I'd get that next time." But until you hit it, it's an Unknown Unknown.

A lot of the other aspects of software performance can be planned ahead. You can architect and design for Known Knowns and for Known Unknowns, but Unknown Unknowns, you're going to need to have something that measures how that system actually works to understand how you can solve those performance problems.

Development Isn’t Production

Now, initially, when we look at a lot of aspects of software correctness, for example, tests, and we look at running automated tests in a development environment. So, Sadiq, if we do testing in development for the correctness of our programs, why can't we just do that for performance problems as well? And why can't we just do that for everything?

Jaffer: Well, over the next few slides, we're going to go through why, from a performance point of view, testing in development is not representative of production. We should take a step back just before we start this. Our goal for performance testing is to remove bottlenecks from production. We do this performance work to make production go quicker, or produce less waste. If our goal was to make development go quicker, well then we could do in a different way. But, if we're trying to make production go quicker, then we need to actually do our performance testing in production. As a problem, because actually performance testing in development is easier. The tools that we're used to using are desktop tools, the desktop profilers, JProfiler, VisualVM, what have you.

You may not have access to production. There are many organizations I've worked in, and I'm sure many of you here, where there are strict rules around who has access to production, might be organizational, might be regulatory reasons. But, as we're going to see in the next few slides, as easy as it is doing performance testing in development, it is not representative of production. And our goal here is to actually understand the performance of production.

The first problem we've got is unrepresentative hardware. Who here develops on exactly the same kind of hardware they actually deploy to in production?

Warburton: There's always one person.

Jaffer: Yes, there's always one person at the back with a really big bid. You might develop on a 2, or if you're lucky, 4-core laptop, that's optimized for power consumption, try and make that battery last as long as possible. Whereas when you deploy to production, you might be on 24-core, 32-core multiple socket machines that have very different memory topologies.

Warburton: Sadiq, doesn't that just mean though your production servers are a bit faster? If you can find the bottleneck on your laptop, and you can fix on your laptop, it's going to run better in production, right? It's bigger hardware, it's faster.

Jaffer: Bigger is not always better. So, from the point of view of production, if you imagine that you have, say, a 2-core processor in your laptop, and you have a 24-core processor in your server, what you're going to see is a shift in the kind of bottlenecks that turn up. If you’ve got a 2-core processor in your laptop, you may just not see a lock contention issue that actually is a big problem in production when you now have many, many cores. So, just because the system is bigger, doesn't necessarily mean it's anyway better representative.

You're going to see the same problem with, say, IO subsystems. So, on your server class hardware, you're likely to have very high performance, very throughput latency optimized IO NVMe SSDs and such like, whereas on your laptop you have things that are generally tuned for power, and that means the bottlenecks will shift. What might be a IO bottleneck on your laptop, will probably turn into an actual CPU bottleneck on a server, where you're no longer sat there waiting on the discs all the time.

Warburton: And that's before we get to modern cloud environments, where the hardware you get is often very opaque, and sometimes basically a mechanical potato is what some of those servers are running, very, very much slower than modern laptops and desktops for a lot of cases.

Jaffer: The next thing is unrepresentative software. Let's do a show of hands again, who is running exactly the same software on their development environment as they do in production?

Warburton: A few more people?

Jaffer: In organizations I've worked in over the last few years, development hasn't actually even happened on the same operating system as production. Often you've got Mac laptops and you have Linux running in production. But even if you're running the same operating system, the actual versions of the operating systems that you run makes a big difference. We saw that in a big way with the Spectre and Meltdown issues the last few years, where different Linux versions, the way that they actually implement these mitigations, the different kernel versions, actually had dramatic different performance characteristics around them. So even small variations in the software that you run between your development environment and production can have a big difference on where your bottlenecks lie.

Warburton: One of the patch choose day updates for Windows 10 back at the end of last May, I think it was, dropped UDP throughput by 40%, which is an absolute stonking regression in performance. And that was just a security patch.

Jaffer: The next thing is unrepresentative workloads. Assuming we were going to test in development, well, we need some way of exercising our environment. Actually that can be very difficult, especially if we're trying to look for these Unknown Unknowns, the things that we don't actually know about beforehand. It's very easy to say, "Okay, what does the distribution of requests across my different endpoints?" Most people probably have that information. Then you start to say, "Well, for a given endpoint, what's the distribution of requests between hot and cold data?" Well, that's a bit more complicated to answer.

Then there are things like, what's the interdependence between requests to my different endpoints? And that's a much more complicated question to answer. I once worked at a company where we had a misbehaving SDK in the wild that would do three parallel requests to three of our different endpoints, but those endpoints actually ended up mutating the same data, and you can imagine that you'd find all kinds of weird and wonderful contention issues there, that you wouldn't have thought to actually encode in a workload test that you're running against your development environment. So it's very difficult to actually create a workload that is representative of how production is being exercised.

And of course the last wild card we have is the JVM. If you're using Hotspot, it has a JIT, it has adaptive optimization, it adapts its optimization to how the code is actually being used. So if you can't exercise the environment you're testing it in exactly the same way as production, then there's no guarantee that actually it's been optimized in the same way in production. So you might actually be looking at something that performs, again, differently to your production environment. So, in summary, don't be the dog. It's cool, but you don't really want to be the dog.

Profiling vs Monitoring

Warburton: Having said, we need to understand how systems work in a realistic production environment with a real load, what kind of information can we gather from that production system to understand it? Well, firstly, people look at metrics. By metrics, I mean, some pre-configured numerical measure on a regular time interval, you pull some data out of that system. So, freeing things like what's your CPU time usage? What are your page load times for individual requests? Some time series metric data.

Metrics are often incredibly useful. They can narrow down problems incredibly rapidly, and tell you whether you’ve got bottlenecks on different components in your system, where within your system some of those problems can lie. They're also incredibly cheap to collect. When I say cheap to collect, I don't mean how much actual cash you necessarily directly spend on these things, I mean, what's the performance overhead for gathering these things. These things are often very cheap to collect and often very cheap to aggregate, query and store over, and lots of open source tooling around system metrics. So, very useful. But, system metrics aren't necessarily a solution in and of themselves. They're quite bad at telling you where inside your code base a problem specifically lies.

Looking at system metrics extensively can often lead to these kind of murder mystery style debugging situations. Now, that sounds like it's a really exciting [inaudible 00:11:47] role play situation, but unfortunately, the reality can be a little bit different. People go around, start systematically looking through these metrics, and often people are encouraged to collect as many metrics as their system can store, which can often be a lot. So you have a lot of different possible causes of a problem and people start looking for anomalies and going, "Are you the metric that caused this problem? Are you have the metric that caused this problem? Are you the metric that caused this problem?" Before eventually finding something that's unusual or problematic. It's systematic guesswork in a way.

Logging is another way you can extract useful information from production systems. So, sometimes people have manual logging that they've instrumented their code base with, which can often be incredibly useful, incredibly detailed information about the system. But, A, that's manual work that you need to do yourself, B, logging in detail can rapidly become a big bottleneck or a big overhead, and often the cause of a lot of performance problems in itself. GC logging is another case where the performance information can be useful, often much, much lower overheads and very useful, but again, that's only for a specific subsystem, and often GC logs are quite hard to interpret anyway.

A lot of modern application performance monitoring APM tools use a different approach. They use a very coarse grain instrumentation approach. What they do is, they have some agent that sits within your software system, they weave in some information into your byte code. It sounds very fancy, doesn't it? Weave it in there. And take timings for a certain operation. So, measure the time at the beginning of the operation, measure time at the end of the operation, do a bit of subtraction, pull out the time that you can get for an operation, then you know how long it is. Instrumentation can be a very, very useful way to see into your application in the way that metrics won't give you visibility into a black box.

But, instrumentation itself has a couple of big problems. The first problem that it has is potentially the more detailed your instrumentation is, the more instrumentation code you have to add into your application software itself. And the more code you add, the slower your system becomes. The second thing is, in practice, because people just look at very core screen pictures, it has to be things that they've thought of a priori before their system goes into production in order to understand where that instrumentation should lie within the stack.

Then there's production profiling. So, what is profiling? Profiling is attributing some resource usage from a system to a component within your software stack. So, what methods are using up CPU time? What lines of code are allocating objects? It could be where your CPU cache misses are coming from; you could profile for almost anything. Now, profiling has some nice wins here. So, firstly, it can be automatic. That's to say, once you've got a profiler that works for a JVM application, you don't necessarily need to customize your application to add any specific aspect to it. Production profiling can also be done very cheaply. We'll look later on in this talk about some of the technical approaches to make it really cheap. In practice, profilers are often not very cheap. If you try and hook up J VisualVM to production, you are in for a world of hurt.

I said earlier that instrumentation can be a little bit blind in a lot of real world situations. So let's look at that real example of that that we had from a customer of ours. They had a problem where they had a slow http endpoint periodically, like every five seconds the endpoint would be really slow. They were using an APM tool, so, that was instrumenting all their http request loads. The graph on that was beautiful, green, you never got a bad request. If you just look at that graph, everything's fantastic. If you talk to the customers, they're really annoyed that they can't log into their system periodically.

What was the root cause of that problem? Well, the root cause was Tom Cat, it has a cache for the http resources, and it expires that cache periodically. When it tries to reload those resources, they go and do a load of class path scanning, so the whole system grinds to a halt for two seconds whilst this happens. And that's something that an instrumentation-based system didn't pick up, because they were putting their instrumentation on the server request itself rather than looking at what the underlying system was. So, they'd made this ahead-of-time assumption about where problems were, but didn't really follow through in the real world.

The other aspect of this is the overhead. These are tools that are meant to be helping you solve performance problems. So, the more overhead the tools add, the worse the problem becomes. They can rapidly become The Problem rather than help you solve problems. And that's the case with instrumentation. If you have instrumentation-based approaches, which are very fine grained and very detailed, they rapidly add so much overhead that you have to get a lot of gains in order to win back that overhead from them.

Continuous Profiling

Surely there's a better way, right? Not just looking at metrics, but we want actionable insights. We want something that we can look at and say, "Where in our code base do we need to fix a problem?" What about thinking about profiling in production? So, that's what we call continuous profiling. How would we consider using continuous profilers in terms of your interaction with them? Suppose you get a problem that you need to investigate, narrowing it down to a specific time period, or some machines which are exhibiting this problem. Looking at a type of profile that you might want to gather, we'll see the difference between CPU time and wall clock time in a sec. Looking at where the dominant consume the resources, so where in your code base, what's really using up that time, and then fixing that bottleneck and deploying and iterating. So that's how this kind of stuff fits into the application developer approach, a very agile or iterative approach.

Well, I said you can think about different kinds of time in terms of what your profiler can extract from your system. CPU time and wall clock time, I would say, are the two biggest forms of time that you can think about it. CPU time is time that you actually spend on CPU actually executing code, and wall clock time is the time between the beginning and end of the whole operation that you're trying to measure. I like to think of this a bit like in a coffee metaphor, maybe that's because I drink too much coffee. But yes, if you go to any good coffee shop, or even a Starbucks at lunch time, you'll find that there's a big queue at the beginning of that system. So, your CPU time is a bit like the time it takes the actual Barista to make the coffee for you, and the wall clock time is a bit like the whole process time waiting for someone to prepare that coffee, waiting to be served on the other side, the whole operation time, begin to end.

Now, in order to understand problems, you need to understand both the information offered by CPU time and also wall clock time. CPU time is very good because it allows you to diagnose computational hotspots, see what's actually using up the CPU, see where the actual inefficiency in your algorithm is. CPU time is also really good if you're looking at a production system, because one of the things that production systems do, which systems in development under a load test don't do, is they spend most of their time idle, and CPU time will already tell you the actual things which are executing rather than you're waiting around a lot. Wall clock time is very, very useful because it helps you diagnose problems that are about not using your CPU, time you spend waiting on disk IO, time you spend waiting on lock contention issues, things like that. So it's really helpful in order to understand the full range of problems, to look at both.

There are a few different kind of visualizations you have for profiling data. Hotspot-type profiling, I would say, is the simplest kind of visualization for that data. So this is just a list of methods in your system sorted by the amount of time that they're using. Some hotspot visualizations show you also, for example, where within a method time is being spent, like line number information, or for example, you can get a hotspot view with a bottom up of stack traces that tells you what's the context for that. Another the common profiler visualization is a tree-view-based visualization. The tree-view-based visualization basically arranges all the methods in your profiling data in a big tree, within the tree, what is the parent is calling the child relationship. For example, in Java you might see Java line thread at the top of the tree and then child methods coming down off that.

Flamegraphs are a newer visualization for profiling data. Is anyone familiar with Flamegraphs or use Flamegraphs? Quite a few people, that's good to see. If you're not familiar with Flamegraphs, the way Flamegraphs work is each box within the Flamegraph represents a method. You can see Flamegraphs either the more flamey way going up, this is top down approach for Flamegraph, which some people call icicles. In a top down view, the methods which are calling their children are placed in boxes above and then boxes that go down are the children, and then the width of the box indicates the total time within that method. So if you have a method that's wide and then has a very narrow child, that can often indicate a lot of self time within that method, and the width of the box indicates how much that method and its children you are using, that's what I mean by total time.

We talked about production profiling, what information we can get out of profilers, but how do they work under the hood?

Jaffer: We're going to cover basically how profiling works in Java. We'll start with the earliest type of profiler, and we'll go through some of the reasons why maybe people are resistant to putting profilers into production, and then we'll see why that's become less of an issue with the modern techniques for creating production-safe profilers. So, the earliest type of profiler is what's called instrumenting profiler. Instrumenting profilers add instrumentation to the code that's being tested in order to measure the consumption of whatever resource of interest we're interested in. If we were interested in, say, wall clock time, we could add instrumentation that would measure the time at the start and at the end of a method, so we'd add instructions there, and then using that whereas the program runs we can actually get information for how long it took to execute individual methods.

Now, the problem with instrumenting profiler, is that in adding instructions to the actual program we're testing, it modifies the behavior of that program. It modifies the behavior in a way that's not uniform across all programs. Let's take the example that we were just talking about, about a way of instrumenting method entry and exit. If you imagine a lovely, well-refactored code base full of short methods, and then imagine a horrendous code base that basically has only a half dozen methods that are tens of thousands of lines long, clearly the impact of instrumentation is going to be different between those two code bases. The size of your methods dictates how much of an overhead that instrumentation is going to add. So the overhead varies by the type of program we're actually testing, which means the results you get are actually very inaccurate, and add to that it's actually very slow.

Can we do any better? Well, instead of instrumenting the program under test, what we can do is we can do a statistical or sampling profiling. Now, sampling profiling works by taking the program under test, stopping it periodically, and then measuring the consumption of whatever resource we're interested in. If we were interested in wall clock time, we could take a program, we could stop it every, say, a hundred milliseconds, and we could say, "What are you doing?" And then we can record that sample, and then over time we can aggregate all those samples together and we can build up a statistical picture of the performance of the program.

Here we have a diagram which shows how we would go about profiling this particular program here. This is web server thread run, it calls control to do something which calls person, which then constructs a new person. The green line is the point where we basically interrupt the program and we grab a sample. So the first sample we grab would be web server thread run, and then the next one we'd grab is in the new person constructor, then we can aggregate all this together and we can produce, say, a Flamegraph, or trivia, or hotspot, whatever kind of profiling report you'd like.

That's all nice in theory, but, there's a problem. In Java there is a way of getting the state of the threads in Java application threads, and that's called get all stack traces. Before we discuss why get all stack traces is pragmatic, we need to take a little detour via safe points. How many people are aware of safe points, what they are? Just a show of hands. Safe points are a mechanism by which the JVM can bring Java application threads to a halt, normally to do some kind of work that would be difficult to do or problematic to do while the application was still running. So, things with bias locks, and GC operations, and safepoints are, as the name says, they're safe points to do this work.

The way safe points are implemented and interpreted is, there are safe points after every instruction with the JIT Hotspot, safepoint, what I call safe point polls are added to compiled code. So, what is a poll? A poll is simply a read from a known memory location, and these instructions are added at the beginning and at the end of compiled methods and also inside certain types of loops. All that happens is these simple tests read from a known memory location and then when the JVM wants to bring the application to a halt, it can unmap or it can protect that memory location and all these threads trigger a segmentation fault, and the segmentation fault handler just stops them and gets on with whatever else needs to be done. So, that's how safe points work.

Now, the problem with get all stack traces, which is the JVM Ti or JVM method for getting the stack traces of the threads in the system, is that it requires a safe point, it requires being at a safe point to get that information. That's the point where it can safely say what the actual threads are doing. So, we said at the beginning and at the end of a method and inside certain types of loops, so, before we had our idealized sampling, now we have our sampling using the naive method, which requires being at a safe point, and you can see now, that instead of the green arrows which should have been our samples, actually you're seeing these red points, those are the actual sample we would end up taking because that's the nearest safe point. You can see that actually using the naive implementation, you actually end up with bias samples. You end up seeing actually the safe points that are nearest to your actual, whatever you're trying to measure. But that's problematic, because it means the results you're getting are not right.

Then there's a second problem that makes this even worse. So, one of the optimizations that the JIT does that's actually very effective is that it does inlining, very aggressive inlining. When it actually compiles code, if you've got a method and it calls another small method and then it passes certain tests, instead of that being actually done as a method call, the contents of the method being called are actually inlined into the caller. In doing that, you can optimize certain things away, one of which is the safe point, but other things like having to store arguments and registers and such like. But there's a big performance win from doing it, especially with something like Java where, say, for example, you might have lots of getters and setters and such language do well from being inlined.

Now, inlining removes safe points, and this complicates our problem even more. So, if we look again, now we've lost the safe points actually in these purple methods which have been inlined into their caller. So now actually we're seeing an even more distorted picture. We're not even just seeing the method necessarily that was actually causing the problem, we're potentially attributing that to a caller or a parent, and that's problematic.

Warburton: Another thing that can also cause this kind of problem where you don't have safe points with the regularity that you would expect, is loops. Regular loops have safe point poles within their loop header, and when you have accounted loops, so something that looks like for int i=0 or i

Jaffer: Sure. The other problem we've got is that not only are the results that we're getting biased, but in making that call, we are triggering a safe point. Now, as we said, safe points bring application threads to a halt, and they're at the beginning, and they're in certain types of loops, and at the end of methods. And as Richard pointed out, sometimes you might actually go for quite a while. In very nice, well optimized code, you might actually go for a reasonable amount of time without seeing a safe point. Now, the problem with that is that if you have- so, this is a diagram that represents a system, you've got four threads in there, we've triggered a safe point. The first thread hits a safe point very quickly as the second thread and the third thread joins them a little while afterwards, but the full thread, actually, it goes for quite a while before hitting a safe point. When it finally does hits a safe point, then each is parked and the VM can get on with whatever operation it wants to.

Now, that time there, between the first one and the last one, that's wasted time in the system. All this area that these threads are not running is extra latency that's in your system and actually wasted time. Why are we going through this? Well, it means that that actually limits the rate. Even if you are happy with the biasing in the naive method of actually getting profiling information out of the JVM, the fact that you have to park those application threads limits the frequency with which you can do it. So the data you're getting is biased, and actually the frequency with which you can do it is relatively low without impacting the performance of the application. And that's why you can't take a lot of the desktop profilers and expect to hook them up to production and see that they don't actually ruin production's performance.

What can we do about it? We've said why get all stack traces expensive do frequently is inaccurate. Also, it only gives us wall clock time, it's very hard to get CPU time out of that. What do we do about that? Well, as we said, we're sampling profiling, we need to do two things. We need to interrupt the application. and then we need to sample the resource of interest. And it turns out, there's another way you can do this. For interrupting the application, we can rely on what are called operating system signals. How many people are aware of operating systems signals? Just some hands as well. Signals are a way of the operating system asynchronously delivering a message to a process. If you've ever killed a process, you're sending a signal to a process, sync term, or sync qwerty, one of them.

Now, you can use OS signals to deliver a message to a process, and the kernel will actually make sure that message is delivered to one of the threads in that process. It's a very lightweight method, it's used a lot. It has some particular quirky performance characteristics in terms of, you have to be very careful, because this signal can be delivered to a process, to a thread, at any point. You have to be very careful what you do inside the code that actually handles that signal, because that signal handler, which is triggered when the thread receives a signal, that could be called inside a memory allocation, you could be called inside holding a mutex, you could be in any kind of state. You can't really reason much about where you are. So you have to be very careful about what you do inside of that signal handler.

But within those constraints, it's possible to actually sample resources of interest that are actually useful. One of those is made possible by a core inside the JVM that's actually not an official one, but it's async get call trace. What that does is actually say, I'd like the current call stack for the thread that I'm on. It's designed to be run inside of signal handlers, that's what the async means. It means it is safe to be called inside of a signal handler. It's not going to be allocating memory, it's not going to be holding any locks or anything that might deadlock the application. So we can actually grab the current state, grab the stack, and then we can also, if we're careful, reach inside the JVM and do other things at that point in time. But we have to be very careful what we do, because we can't rely on the state being consistent.

But with those two, we can actually sample from a Java application without a safe point bias, and we can do it at reasonable sampling rates with very, very low overhead. When a signal is limited to a thread, only that thread is run as a signal handler. The rest of the system carries on running, there's no safe point, there's no waiting around. It's not an approach used by any of the existing desktop profilers, but there are actually, and we'll come into in further slides, some good open source profilers that do actually do that.

We have a way of profiling a very low overhead in a way that's usable in a production environment. By choosing our sampling rate, we can basically make up overhead as little as we like. So, why don't people actually do that? Well, our experience over the last few years has been that people are put off by practical issues as much as they are technical ones. What are the practical issues around using some of these advanced profilers in a production environment?

Well, the first thing is, they generally require access to production. You have to get your ops people and say, "Hi, I'd like you to add this agent here to our Java flags." You've actually got to have access to production or you're going to get somebody to do something into production. And in many organizations that doesn't happen, the developers don't have access to production. Also, even if you do have access to production, the downsides of actually messing with the production environment normally end up being pretty catastrophic for you, so people tend to shy away from doing that.

Warburton: In some organizations there might even be legal barriers, Chinese walls, between people who can write software that runs there and operations people as well.

Jaffer: Sure. And then the process also involves manual work. So, if you're doing ad hoc production profiling, you're saying, “I've got a problem, I want to hook up a profiler, get some information, analyze it," that normally actually requires running that on a server, collecting the information, copying it locally, running your analysis, then actually making your fix, and then repeating the whole process, right? So there's a lot of manual work involved. The other problem is that the actual profilers that can do this in production tend to be open source without commercial support. And that, for many organizations, can also be a problem running that in a production environment.

How could we work around this? Well, I got thinking, as in what if we, instead of just profile in response to a problem, if the overhead is low enough, why don't we just profile all the time? It removes some of the problems that we have. We don't have to have access to production each time we want to do an ad hoc profiling session, because we can just do it all the time. Also, if we're profiling all the time, does it actually open up any new capabilities for us? Well, it does. For one, if we're profiling all the time, if your CPUs decide to peg at 3:00 o'clock in the morning and the next day you're doing a postmortem, you can go back and say, “Well, we have that historical profiling data. What happened? What was actually running at that point in time?” And if you're already doing ad hoc profiling, you can't do that.

The other problem with doing ad hoc profiling in response to a problem is that you get this profile information, but how do you know what normal looks like? Because you're by definition looking at an abnormal system. With your historical data, you can go back and look at what normal looked like, and compare the two. So, it opens up some new capabilities. You could say, our new version of application performs poorly, how does that compare to our profiling information from the previous version?

Then it also lets you start putting your samples in context. We talked about version, but you could start attaching environmental parameters. I've got Amazon AWS C3s, how do they compare to the C4s? I'm running this version of the JVM, how does it compare to this other version of JVM? We can start attaching that information to our profiling data. And that's something that you can't get with just ad hoc profiling.

How to Implement Continuous Profiling

How can we actually implement continuous profiling? Well, there's a great paper from Google, called Google-wide Profiling, Continuous Profiling Infrastructure for data centers. It's out there on their website, get it as a pdf out there. It basically details how their system works. They have collectors that take this data from their production environments. They have another system that collects the binaries and doing symbolization, what have you, and that's then stored in databases which their users can then run queries against, and you can actually generate these profiling reports, the kind of reports that we looked at earlier. They have a system for doing this, and it's been running in Google for years.

But what if you wanted to build it yourself? Well, we talked about some open source profilers. These open source profilers that use the method we looked at before that has low overhead, or something very similar, there's async profiler, there's honest profiler, which was written by Richard here. There's also Java's flight recorder as well, and these are safe for hooking up to production, but it's worth pointing out, they are profilers that hook up to production for ad hoc profiling. You still need then some method to actually collect and store that information then report it in some way. I know in some organizations I've worked in the past, we've come up with ways of doing that, storing that information, and then be able to get it if we need it for analysis.

Alternatively, what we do at Opsian, we have a continuous profiling platform, which basically implements this. You hook up the profiling agent to your JVMs, it runs on your JVMs, they send their profiling data over to indexing and aggregation service, which basically aggregates all that, and then makes it available for reporting via a browser-based interface, and basically implements the same kind of system we saw the last few slides.


In summary, it's possible to profile in production with low overhead. You can do it in production low overhead. To overcome some of the practical issues around doing it, you should profile all the time, and it makes life a lot easier. By profiling all the time, it not only solves your practical issues from before, but actually opens up a whole bunch of new capabilities as well.

Warburton: In conclusion, we talked about why performance mattered, why we needed tooling to understand both the Known Unknowns but especially the Unknown Unknowns with regards to our system. Things we couldn't necessarily think of in advance. We talked about how developments in production environments often differ so much that you can have very different root causes to performance problems, that you can have very different load tests and you can find it hard to replicate those problems. We talked about how things like metrics are incredibly useful for gathering from a production system, understanding a problem, but aren't necessarily going to give you the root cause analysis, aren't going to give you the vector into your system. Instrumentation was an approach that could give you that, but had a very high overhead if you wanted to do it at a fine grain level, and at a coarse grain level had a less useful information. And continuous profiling could provide a lot of insight.

We think we need a bit of an attitude shift on profiling and monitoring. We want profiling to be done in a systematic way, not an ad hoc way. So, that means keeping it done continuously all the time on your system and not having to do things like hookup tooling when there's a problem, or a one off event that you need to resolve. Being proactive about these things rather than reactive, having things in place to understand and resolve these issues. In other words, please do production profiling all the time, that's what we'd love you to do. If you use our tooling, that's great, but even if you don't, that's also great. Just please do it. It's really useful and I'm sure you will find it is very helpful to your systems as well. Thank you very much.

Questions & Answers

Participant 1: If I listened correctly, you provide a solution where we can inject some agent into our application and you will collect these metrics and we can access them via some web-based interface. But the question is, is it possible to install some Paulsen implementation of this?

Warburton: Is there an on-prem version of this, is that the question?

Participant 1: Yes. Because sometimes it's just from perspective of some ...

Jaffer: Yes, it doesn't have to. The aggregation can happen and storage happen inside your premises or in the cloud.

Participant 2: Have you ever considered using the output of your profiling in any form of automated testing to try and highlight code change problems within algorithms?

Jaffer: We were having an interesting discussion earlier today. And it's a really good question. We were having an interesting discussion earlier today about, could we use the information to, say, identify where in very large code bases, “Hey, you're actually using this particular method, and that's only available in a really old version of this library you're using. And if you use this newer method, it would be a lot quicker.” So, yes, there are possibilities for that. There are possibilities for, say, security, where you're actually using this method and actually that's deprecated because it's got a vulnerability issue in it kind of thing. So, yes, you could use the data you get out of a profiling system for that kind of stuff.

Warburton: One thing, there was also - maybe I've interpreted it in a different way, which was, could you have like J unit tests, you mean, can you profile J unit tests? Is that the question you're asking?

Participant 2: No.

Warburton: No. Okay, great.

Participant 3: Is Opsian just gathering the statistics and presenting them, or does it include its own profiler, or is it built on somebody else's profiler?

Jaffer: It includes its own profiler. We basically did a lot of development work on top of Richard's, Honors profiler, which has a long history of being used in production. So, yes.

Participant 4: Great talk. Have you any experience of aggregating this kind of profiling in distributed computing like Spark, or Sarup, and stuff like that?

Warburton: We do, yes. We do in terms of aggregating jobs that are similar. A lot of what we've seen are more people like web request response-oriented workloads, rather than something like Spark. But, in general, when you're looking at things like profiling data and where the bottlenecks are, often you can quite nicely aggregate over different machines. Things like knowing what an individual thread ID that had a specific problem or something like that don't come up so well, but if you're looking for bigger co-deficiency problems, it works.

Jaffer: When we've looked at things like Spark before, if your problem is a CPU problem, say you're converting stuff inside a particular part of your job, then it's pretty obvious. If your problem is deep within Spark, of how it's transferring the data, it's a bit more problematic because of the async nature of the whole thing. So it really depends on you bottleneck.

Participant 5: Do you risk with implementing continuous profiling shifting some things to the right? What I mean by that is, we often want things to be caught because we only shift things to left. When people consider profiling and performance early and testing, do you risk people going, "Oh, I've got continuous profiling in production, I'm not going to bother. I'm just going to wait and see how it works in production." And do people become complacent with continuous profiling?

Warburton: That's a really good question. I don't really have any nice head-to-heads that were blind scientific study on that kind of thing. But my general experience on these things is that people basically do that anyway, and then when things hit production, they're blind. Even when they do have performance testing ahead of time, that can help them identify and solve certain problems, but there are often things that they miss anyway. So, I think there's a risk of complacency with either approach, and that's probably more like cultural team honesty, reflection-type value-type issue rather than the technology solution tool-type issue, in my opinion at least

Participant 6: My question is about your profiler. Do you provide some detailed information about the rest which are not currently doing some Java stuff? I mean, native libraries get on calls and so on, as it is done in a sync profile for example.

Jaffer: That's wall clock, in terms of when threads are not actually actively doing work, they're just blocked on something?

Participant 6: Yes, but you want to know where. Because if you will just look at Spectorace and Java, you will see that it does something.

Jaffer: So you're thinking the actual lock itself that is blocked on or?

Participant 6: I mean, some more deep information. Where the Java thread is spending time inside kernel for example.

Jaffer: Oh, inside the kernel. No. As you were saying, things like async profiler you can go into the kernel further. We don't with our product. Mainly because, actually for one, there are security issues around once you start getting into into that, and it starts to get a bit more problematic saying to people, "Hey, install this on your production servers. Oh and by the way, it needs to be root kind of thing, or it needs to have certain elevated privileges." But, the other issue is performance as well. So, we want to make sure that you stay. What we found is, if you can get yourself within a small enough envelope, that basically there isn't a criticism of actually it's going to slow down production. Then it's a much easier process of getting this into production. Does that answer your question?

Warburton: I think we're pretty much out of time. But thank you very much. It's been a pleasure to speak to you.

Jaffer: Thank you.


See more presentations with transcripts


Recorded at:

Apr 19, 2019