BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

To Execution profile or to Memory Profile? That is the question.

Posted by Kirk Pepperdine on Dec 04, 2013 |

I recently had a group of developers performance troubleshooting a problem-riddled application from my performance workshop. After dispensing with a couple easy wins, the group was faced with a CPU that was running very hot. The group reacted exactly in the same way that I see most teams do when faced with a hot CPU; they fired up an execution profiler hoping that it would help them sort things out. In this particular case, the problem was related to how the application was burning through memory. Now, while an execution profiler can find these problems, memory profilers will paint a much clearer picture. My group had somehow missed a key metric that was telling them that they should have been using a memory profiler. Lets run through a similar exercise here so that we can see when and why it is better to use a memory profiler.

Profilers work by either sampling top of stack or instrumenting the code with probes, or a combination of both. These techniques are very good at finding computations that happen frequently or take a long time. As my group experienced, the information gathered by execution profilers often correlates well with the source of the memory inefficiency. However it points to an execution problem which can sometimes be confusing.

The code, found in Listing 1 defines the method API findByName(String,String). The problem here isn't so much in the API itself but more in how the String parameters are treated by the method. The code concatenates the two strings to form a key that is used to look up the data in a map. This misuse of strings is a code smell in that it indicates that there is a missing abstraction. As we will see, that missing abstraction is not only at the root of the performance problem, but adding it also improves the readability of the code. In this case the missing abstraction is a CompositeKey<String,String>, a class that wraps the two strings and implements both the equals(Object) and hashCode() methods.

public class CustomerList {

  private final Map customers = new ConcurrentHashMap();

  public Customer addCustomer(String firstName, String lastName) {
    Customer person = new Customer(firstName, lastName);
    customers.put(firstName + lastName, person);
    return person;
  }

  public Customer findCustomer(String firstName, String lastName) {
    return (Customer) customers.get(firstName + lastName);
  }

Listing 1. Source for CustomerList

Another downside to the style of API used in this example is that it will limit scalability because of the amount of data the CPU is required to write to memory. In addition to the extra work to create the data, the volume of data being written to memory by the CPU creates a back pressure that will force the CPU to slow down. Though this bench is artificial in how it presents the problem, it's a problem that is not so uncommon in applications using the popular logging frameworks. That said, don't be fooled into thinking only String concatenation can be at fault. Memory pressure can be created by any application that is churning through memory, regardless of the underlying data structure.

The easiest way to determine if our application is burning through memory is to examine the garbage collection logs. Garbage collection logs report on heap occupancy before and after each collection. Subtracting occupancy after the previous collection from the occupancy before the current collection, yields the amount of memory allocated between collections. If we do this for many records we can get a pretty clear picture of the application's memory needs. Moreover, getting the needed GC log is both cheap and with the exception of a couple of edge cases will have no impact on the performance of your application. I used the flags -Xloggc:gc.log and -XX:+PrintGCDetails to create a GC log with a sufficient level of detail. I then loaded the GC log file into Censum, jClarity's GC log analysis tool.

(Click on the image to enlarge it)

Table 1. Garbage Collection Activity Summary

Censum provides a whole host of statistics (see Table 1), of which we’re interested in the “Collection Type Breakdown” (at the bottom.) The “% Paused” column (the sixth column in Table 1) tells us that the total time paused for GC was 0.86%. In general we’d like GC pause time to be less than 5%, which it is. So the number suggests that the collectors are able to reclaim memory without too much effort. Keep in mind however that when it comes to performance, a single measure rarely tells you the whole story. In this case we need to see the allocation rates and in Chart 1 we can see just that.

(Click on the image to enlarge it)

Chart 1. Allocation Rates

If we look at this chart we can see that the allocation rates initially start out at about 2.75 gigs per second. The laptop that I used to run this benchmark under ideal conditions can sustain an allocation rate of about 4 gigs per second. Thus this value of 2.75 gigs represents a significant portion of the total memory bandwidth. In fact, the machine is not able to sustain the 2.75 gigs as is evidenced by the drop over time in allocation rates. While your production servers may have a larger capacity to consume memory, it is my experience that any machine trying to maintain object creation rates greater than 500 megabytes per second will spend a significant amount of time allocating memory. It will also have a very limited ability to scale. Since memory efficiency is the overriding bottleneck in our application the biggest wins will come from making it more memory efficient.

Execution Profiling

It should go without saying that if we’re looking to improve memory efficiency we should be using a memory profiler. However when faced with a hot CPU, our group decided that they should use execution profiling; so lets start with that and see where it leads. I used the NetBeans profiler running in VisualVM in its default configuration to produce the profile in Chart 2.

(Click on the image to enlarge it)


Chart 2. Execution Profile

Looking at the chart we can see that outside of the Worker.run() method, most of the time is spent in CustomerList.findCustomer(String,String). If the source code were a bit more complex you could imagine it being difficult to understand why the code is a problem or what you should do to improve performance. Lets contrast this view with the one presented by memory profiling.

Memory Profiling

Ideally I would like my memory profiler to show me how much memory is being consumed and how many objects are being created. I would also like to know the casual execution paths; that is, the path through the source code that is responsible for churning through memory. I can get these statistics using the NetBeans profiler once again running in VisualVM. However I will need to configure the profiler to collect allocation stack traces. This configuration can be seen in Figure 1.

(Click on the image to enlarge it)

Figure 1. Configuring NetBeans memory profiler

Note that the profiler will not collect for every allocation but only for every 10th allocation. Sampling in this manner should produce the same result as if you were capturing data from every allocation but with much less overhead. The resulting profile is shown in Chart 3.

(Click on the image to enlarge it)

Chart 3 Memory Profile

The chart identifies char[] as the most popular object. Having this information the next step is to take a snapshot and then look at the allocation stack traces for char[]. The snapshot can be seen in Chart 4.

(Click on the image to enlarge it)

Chart 4 char[] allocation stack traces.

The chart shows three major sources of char[] creation of which one is opened up so that you can see the details. In all three cases the root can be traced back to the firstName + lastName operation. It was at this point that the group tried to come up with numerous alternatives. However, none of the proposed solutions were ever as efficient as the code produced by the compiler. It was clear that to have the application run faster we were going to have to eliminate the concatenation. The solution that eventually solved the problem was to introduce a Pair class that took the first and last name as arguments. We called this class CompositeKey as it introduced the missing abstraction. The improved code can be seen in listing 2.

public class CustomerList {

 private final Map customers = new ConcurrentHashMap();

 public Customer addCustomer(String firstName, String lastName) {
   Customer person = new Customer(firstName, lastName);
   customers.put(new CompositeKey(firstName, lastName), person);
   return person;
  }

 public Customer findCustomer(String firstName, String lastName) {
   return (Customer) customers.get(new CompositeKey(firstName, lastName));
 }
}

Listing 2. Improved Implementation using CompositeKey abstraction

CompositeKey implemented both hashCode() and equals() thus eliminating the need to concatenate the strings together. While the first benchmark completed in ~63 seconds, the improved version ran in ~21 seconds, a 3x improvement. The garbage collector ran 4 times making it impossible to get an accurate picture but we can be said is that the application consumed in aggregate just under 3 gigs of data as apposed to the more than 141 GB consumed by the first implementation.

Two ways to fill a water tower

A colleague of mine once said that you can fill a water tower one teaspoon at a time. This example proves that you certainly can. However it’s not the only way to fill the tower, you could also open a large hose to fill it very quickly. In those cases it’s unlikely that an execution profiler would pick up on the problem. However the garbage collector will see the allocation and the recover and certainly the memory profiler will see the allocation in shear byte count. In one application where these large allocations were predominate, the development team had exhausted the vast majority of the gains there were going to get by using an execution profiler, yet they still needed to squeeze more out of the app. At that point we turned on the memory profiler and it exposed one allocation hotspot after another and with that information we were able to extract a number of significant performance gains using a memory profiler. What that team learned is that not only was memory profiling giving them the right view, it was giving them the only view into the problem. This is not to say that execution profiling isn’t productive. What it is saying is that sometimes it’s not able to tell you where your application is spending all of it’s time and in those cases getting a different perspective on the problem can make all the difference in the world.

About the Author

Kirk Pepperdine has worked in high performance and distributed computing for nearly 20 years. Since 1998 Kirk has been working all aspects of performance and tuning in each phase of a project life cycle. In 2005 he helped author the foremost Java Performance tuning workshop that has been presented to hundreds of developers worldwide. Author, speaker, consultant, Kirk was recognized in 2006 as a Java Champion for his contributions to the Java community. He was the first non-Sun employee to present a technical lab at JavaONE, an achievement that opened up the opportunity for others in the industry to do so. He was named a JavaONE Rockstar in 2011 and 2012 for his talks on Garbage Collection. You can reach him by email at kirk@kodewerk.com or on twitter @kcpeppe

 

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

allocation is still execution by William Louth

Kirk makes a living looking at gc logs, writing tools that look at gc logs, or giving training to help people look at log files, so it is not surprising he wants people to focus on memory profiler and discount execution profiling. But that seems to miss the point that allocation is still execution - a point I'd expect most that routinely code & deploy into production to recognize immediately.

The problem I suspect is that Kirk is still caught in the 90's profiler scene, thinking that execution measurement is only ever (wall) clock time. Modern performance measurement solutions such as JXInsight/OpenCore (www.jinspired.com/products/opencore) or Satoris (www.jinspired.org/satoris) support multiple measures (meters) at entry and exit points to such me execution constructs. You can have clock.time, gc.time, cpu.time, waiting.time, blocking.time,......yes you can sometimes have your cake and eat it.

Anyway the code listed above looks like it was written by a complete amateur....it has more problems than just performance. Even those not skilled in writing high performance code would not need much in the IQ department to see and fix this code straight off without ever starting up a tool.

Smart execution profilers help you find the best starting point in the execution flow to fix a problem which is not always the same place where the hotspot is.

www.jinspired.com/site/case-study-scala-compile...

For more information on the art in performance measurement and the multiple realities offered:

www.jinspired.com/site/reality-reactivity-relev...

Code executes. That execution has costs. That cost can be direct or indirect, constant or variable.

Re: allocation is still execution by Kirk Pepperdine

Yes William, did you happen to read this sentence. "This is not to say that execution profiling isn’t productive." And I think the execution profile included in the article clearly shows that memory allocation *is* execution.. well most of the time. That said, execution profilers may not be the best tooling to provide clues as to why code maybe running hot and that memory profilers are better at exposing root cause of memory problems. I don't feel the need to counter the inaccuracies in your other claims as it's obvious that you didn't carefully read the article.

Finally, how hypocritical is for you to (incompletely) pointing out how I make a living and then go on to flog your products.

Re: allocation is still execution by William Louth

I don't think so. I showed your viewpoint is very much biased (and I not just by experience). I've spent years designing and engineering products that are execution profilers that also measure (meter) time, allocation and gc time. There's no underlying difference between the two from the definition of memory profiler presented here. I see memory profilers as heap dump analysis tools...period. I've never had to use a separate memory profiler (allocation tracking) in all my years though I have viewed heap dumps, sent from customers, in such tools as Eclipse MAT.

From a performance overhead point of view the difference can indeed be significant. Memory allocation profiling (and its even more expensive cousin - allocation tracking) create so much overhead that in a multithreaded environment you are looking at a completely different system. For many developers this means resorting to crude forms of collection and analysis such as gc logs which are far removed from the underlying causes...but being productive is not always a primary concern for some.

If you ever bothered to read an article in some depth you would realize that the best an intelligent execution profiling solution can do is bring the developer to the appropriate point in the code from which to begin the change...(that is already understood from analysis presented). In the general case of a database transaction this would be further up in the call stack than the driver or ORM library. A good profiler eliminates non-hotspots as it measures, reasons and learns in the process. Eventually all that is left is a few key call sites that "root" calls and "cause" consumption. And I wrote an article on this recently.

www.jinspired.com/site/instrumenting-every-line...

I don't know of what inaccuracies you allude to, except the ones you in the article itself.

And finally Satoris at its heart is an open source initiative to standardize on a common instrumentation API across real and simulated runtimes in whatever language.

www.jinspired.org/satoris/open-api

Personal attacks in comments will be deleted by Floyd Marinescu

William,

As Editorial steward of InfoQ, I am speaking on behalf of our editorial team to warn that any further comments from you that have any subtle or obvious personal attacks in them will be deleted.

InfoQ value's technical debate and respects the opinion of all; our mission is to facilitate the spread of knowledge and innovation in software development, and the fact that this is a completely practitioner-driven site in all respects is what makes this possible. An environment where personal attacks are given in response to someone offering their expertise will make it hard for us to encourage more community participation, and will not be tolerated.

thank you,

InfoQ editorial team

allocation is still execution by Ian Ringrose

I have often wished that I could get an execution profiler to add a fake time to a method that is based on how much money it allocates.

I expect that the example would run OK on .net as it’s GC is VERY fast at coping with short lived object, and these short lived objects tend to never leave the CPU cache. However I have seen lots of CPU issues in .NET that did come down to memory problem.

Re: allocation is still execution by Kirk Pepperdine

An interesting thought. What I've found is that the it's easier to find a problem if you have the right perspective. I've not tried to say that execution profilers won't find the problem, it's more that memory profiler offers better perspective for these sorts of problems.

Re: allocation is still execution by William Louth

A meter in OpenCore as well as Satoris can be mapped to any thread specific counter including some built-in counters within the JVM that track object allocation though this does not come without some cost once enabled. We previously supported this in our own custom native agent but recently offered support for a counter made available in the com.sun.* package.

www.jinspired.com/site/jxinsight-opencore-6-4-1...

Google engineers have claimed at the recent JVM Language Summit to use this feature with little overhead but from our benchmarking it can easily drop the throughput of Java server, such as Apache Cassandra, by 10x. I'm going to assume their code never allocates Java objects ;-).

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

7 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT