BT

Do Java 6 threading optimizations actually work?

Posted by Jeroen Borgers on Jun 18, 2008 |

Introduction - Threading optimizations in Java 6

Much attention has been given by Sun, IBM, BEA and others to optimize lock management and synchronization in their respective Java 6 virtual machine offerings. Features like biased locking, lock coarsening, lock elision by escape analysis and adaptive spin locking are all designed to increase concurrency by allowing more effective sharing amongst application threads. As sophisticated and interesting as each of these features are, the question is; will they actually make good on these promises? In this two part article I will explore these features and attempt to answer the performance question with the aid of a single threaded benchmark.


Locking is Pessimistic

The locking model supported in Java (as is the case with most threading libraries) is a very pessimistic one. If there is even the most remote danger that two or more threads will utilize data in a way that interferes with each other, we are forced to the draconian solution of using locks to prevent this from happening. Yet research has shown that locks are rarely, if ever, contended for. Translation, a thread asking for a lock will rarely have to wait to acquire it. Yet the act of asking for a lock will trigger a sequence of actions that can result in accumulation of significant overheads, one that cannot be avoided.



We do have some choice. Consider for example usage of the thread safe StringBuffer. Ask yourself this, if you've ever used StringBuffer knowing that it would only be accessed from a single thread, why didn't you use StringBuilder instead?

Knowing that most locks won't be contended or will rarely be contended isn't very helpful because if there is even the smallest probability that two threads could access the same data, then we are pretty much forced to protect access using a lock via synchronization. It is only when we look at the lock in the context of a runtime environment can we finally answer the question, did we really need the lock? To obtain an answer to this question, the JVM developers have started experimenting with HotSpot and the JIT. From this work we now have adaptive spinning, biased locking and two forms of lock elimination known as lock coarsening, lock elision. Before we start benching, let's take some time to review these features so that we all understand how they work.

Escape analysis - lock elision explained

Escape analysis is examination of the scope of all references in a running application. The analysis is carried out as a normal part of the HotSpot profilers work. If HotSpot (via Escape Analysis) can determine that references to an object are limited to a local scope and none of those references can "escape" to a broader scope, it can direct the JIT to apply a number of runtime optimizations. One such optimization is known as lock elision. When references to a lock are limited to some local scope, it implies that only the thread that created the lock will ever have access to that lock. Under those conditions the values in the synchronized block will never be contended for. This implies that we never really needed the lock and it can safely be elided (omitted). Consider the following method:

publicString concatBuffer(String s1, String s2, String s3) {,
StringBuffer sb = new StringBuffer();
sb.append(s1);
sb.append(s2);
sb.append(s3);
return sb.toString();
}

Figure 1. String concatenation using local StringBuffer

If we examine the variable sb we can quickly determine that it only lives with in the confines of the concatBuffer method. Further more, references to sb never "escape" from the scope in which it has been declared. Thus there is no way another thread could ever gain access to our copy of sb. With the knowledge in hand we know that locks protecting sb can be elided.

On the surface it looks as though lock elision allows us to write thread safe code without any synchronization penalty for using in cases where it really wasn't needed. If it really works is a question we will investigate with our benchmark.

Biased locking explained

Biased locking is derived from the observation that most locks are never accessed by more than one thread during their life time. On those rare occasions when multiple threads do share data, that access is rarely contended. To understand the advantage of biased locking in light of the observation we first need to review how locks (monitors) are acquired.

Acquiring a lock is a two step dance. First you need to acquire a lease. Once you have the lease in hand you are free to grab the lock. To acquire the lease the thread is required to execute an expensive atomic instruction. Releasing the lock traditionally releases the lease. In light of our observation it seems as if we should be able to optimize access in the case where a thread is looping over a synchronized block of code. One of the things we could do is coarsen the lock to include the loop. This would cause the thread to access the lock once instead of loop count times. However this isn't a good solution because it may lock out the other threads from being able to have fair access. A more sensible solution would be to bias the lock towards the looping thread.

Biasing a lock to a thread means that the thread is not required to release the lease on the lock. Thus subsequent lock acquisitions are much less expensive. The thread will only release the lease if another thread attempts to acquire the lock. The Java 6 HotSpot/JIT implementation optimizes for biased locking by default.

Lock coarsening explained

Yet another threading optimization is lock coarsening or merging. Lock coarsening occurs when adjacent synchronized blocks may be merged into one synchronized block. A variation on this theme is to combine multiple synchronized methods into one. This optimization can be applied if the same lock object is used by all methods. Consider the example shown in Figure 2.

public static String concatToBuffer(StringBuffer sb, String s1, String s2, String s3) {
sb.append(s1);
sb.append(s2);
sb.append(s3);
returnsb.toString();
}

Figure 2. String concatenation using non-local StringBuffer

In this case, the StringBuffer has non-local scope and can be accessed by several threads. Consequently escape analysis will determine that its' lock cannot be safely elided. If the lock happens to be accessed by only one thread, biased locking can be applied. Interestingly, the decision to coarsen can be made independent to the number of threads contending for the locks. In this case an instance lock is acquired four times: three times for the append method and one time for the toString method, right after another. The first action is to in-line the methods. Then we can replace all four calls to acquire the lock with a single one that surrounds the entire method body.

Then net effect is that we will end up with a longer critical section which may result in other threads stalling and reduced throughput. Because of this, locks inside a loop are not coarsened to include the looping construct.

Thread suspending versus spinning

When a thread waits for a lock to be released by another thread, it usually is suspended by the operating system. Suspending a thread requires the operating system to swap it out of the CPU often before it's time quantum has been consumed. When the thread that has the lock leaves the critical section, the suspended thread needs to be woken up. The thread will need to be re-scheduling and context switched back into the CPU. All of this activity can place extra strain on the JVM, OS and hardware.

The observation that is useful in this instance is; locks are normally held for very brief periods of time. This implies that if we wait for a bit, we may be able to get the lock without being suspended. To wait all we do is put the thread into a busy loop (a spin). This technique is known as spin locking.

Spin locks work well in cases where lock durations are very short. On the other hand, if the lock is held for a longer time, spinning is wasteful as the spinning thread consumes CPU without doing anything useful. When introduced in the JDK 1.4.2, spin locking was split into two phases, spin for (default value of) 10 iterations before being suspended.

Adaptive spinning

The JDK 1.6 introduced adaptive spinning. With adaptive spinning the duration of the spin is not fixed anymore, but determined by a policy based on previous spin attempts on the same lock and the state of the lock owner. If spinning succeeded on that same lock object in the recent history and the thread holding the lock is running, then spinning is likely to succeed again. It will then be applied for a relatively longer time, e.g. 100 iterations. On the other hand, if spinning is very unlikely to succeed, it is left out altogether thereby not wasting any CPU cycles.

StringBuffer vs. StringBuilder benchmark

The road to deciding on exactly how to measure the effects of all of these nifty optimizations wasn't so smooth. First and foremost is the question of what should the benchmark look like? To answer this question I decided to look at some of the common idioms that people use in their code all the time. The one thing that stood out was the age old question of what is the cost savings of using StringBuffer instead of String.

The familiar advice is; use StringBuffer instead of a String if you are going to be mutating it. The reason for the advice is quite clear. String is immutable and if we are doing work that requires mutations, StringBuffer is a less costly alternative. Interestingly enough the advice fails to recognize StringBuffer's new (since 1.5) unsynchronized cousin, StringBuilder. Since the only difference between StringBuilder and StringBuffer is synchronization, it seems as though a benchmark that measures the performance differences between the two should reveal the cost of synchronization. The quest starts with the first question, what is the cost of uncontended locking.

The essence of the benchmark (as seen in listing 1) is to take a couple of strings and concatenate them together. The initial capacity of the underlying buffer is large enough to hold the three strings to be joined. This allows us to minimize the amount of work in the critical section which should bias the measurement towards the cost of synchronization.

Benchmark results

Below are results for the sensible combinations of the three options of EliminateLocks, UseBiasedLocking, and DoEscapeAnalysis.

Figure 3. Benchmark results

Discussion of results

The purpose of using the unsynchronized StringBuilder was to provide a baseline measurement of performance. I also wanted to see if the optimizations would have any effect on StringBuilder performance. As can be seen, StringBuilder performance remained constant throughout exercise. Because these flags are aimed directly a optimizing the use of locks, this result is as expected. At the other end of the performance spectrum we can see that using the synchronized StringBuffer without any of the optimization runs about 3 times slower.

Moving from left to right across the results presented in figure 3, we see a decent boost in performance that can be attributed to EliminateLocks. However that boost pales in comparison to the one provided by biased locking. In fact, with the exception of column C, every run with biased locking turned on ended up providing approximately the same boost in performance. But what of column C?

In the process of treating the raw data, it was noticed that 1 run in 6 took significantly longer. The difference was large enough that it looked as though the bench was reporting on two completely different optimizations.After some thought, I decided to report the higher and lower values separately (runs B & C). Short of a deeper investigation, I can only speculate that there is more than one possible optimization that can be applied (most likely two)  and that there exists some race condition where one where biased locking wins most of the time, but not all of the time. If the other optimization wins, the application of biased locking is either prevented or delayed.

The curious result is that produced by Escape Analysis. Given the single threaded nature of this benchmark, I was fully expecting Escape Analysis to elide the lock thus rendering StringBuffer performance equivalent to that of StringBuilder. Quite clearly that didn't happen. Yet another problem; timings taken on my machine varied from run to run. To further muddy the water, I had several of my colleagues run tests on their systems. In several cases the optimizations did not speed things up all that much.

Premature conclusions

Though the results exhibited in figure 3 are less than what I hoped for, they do indicate that the optimizations do remove most of the locking overhead. However, having the test runs made by my colleagues produce different results would seem to challenge the veracity of the results. Is the benchmark measuring locking overhead or is our conclusion premature and is there something else going on? In part 2 of this article, we'll take a deeper look at this benchmark to try and answer these questions. In the process we will discover that getting results is easy. Determining if the results have answered our question asked is a much more complex task.

public class  LockTest {
private static final int MAX = 20000000; // 20 million

public static void main(String[] args) throws InterruptedException {
// warm up the method cache
for (int i = 0; i < MAX; i++) {
concatBuffer("Josh", "James", "Duke");
concatBuilder("Josh", "James", "Duke");
}

System.gc();
Thread.sleep(1000);

System.out.println("Starting test");
long start = System.currentTimeMillis();
for (int i = 0; i < MAX; i++) {
concatBuffer("Josh", "James", "Duke");
}
long bufferCost = System.currentTimeMillis() - start;
System.out.println("StringBuffer: " + bufferCost + " ms.");

System.gc();
Thread.sleep(1000);

start = System.currentTimeMillis();
for (int i = 0; i < MAX; i++) {
concatBuilder("Josh", "James", "Duke");
}
long builderCost = System.currentTimeMillis() - start;
System.out.println("StringBuilder: " + builderCost + " ms.");
System.out.println("Thread safety overhead of StringBuffer: "
+ ((bufferCost * 10000 / (builderCost * 100)) - 100) + "%\n");

}

public static String concatBuffer(String s1, String s2, String s3) {
StringBuffer sb = new StringBuffer();
sb.append(s1);
sb.append(s2);
sb.append(s3);
return sb.toString();
}

public static String concatBuilder(String s1, String s2, String s3) {
StringBuilder sb = new StringBuilder();
sb.append(s1);
sb.append(s2);
sb.append(s3);
return sb.toString();
}

}

In part II we'll take a deeper look at techniques that we used to validate that the benchmarking results were valid and we will then answer the real question.

Running the Benchmark

I ran this benchmark on my 32 bit Windows Vista laptop with an Intel Core 2 Duo using Java 1.6.0_04. Please note that all optimizations are implemented in the server vm. That is not the default vm on my platform. It is not even available from the JRE, but only from the JDK. To ensure that I was using the server vm I specified the -server option on the command line. The other options are:

  • -XX:+DoEscapeAnalysis, off by default
  • -XX:+UseBiasedLocking, on by default
  • -XX:+EliminateLocks, on by default

To run the benchmark, compile the sources and then use a command line similar to

java-server -XX:+DoEscapeAnalysis LockTest

About Jeroen Borgers

Jeroen Borgers is a senior consultant with Xebia - IT Architects. Xebia is an international IT consultancy and project organization specialized in enterprise Java and agile development. Jeroen helps customers on enterprise Java performance issues and is instructor of Java Performance Tuning courses. He has worked on various Java projects in several industries since 1996, as a developer, architect, team lead, quality officer, mentor, auditor, performance tester and tuner. He works exclusively on performance assignments since 2005.

Acknowledgments

This article could only be achieved with the help of several others. Special thanks go to:

Dr. Cliff Click, former lead architect of Sun's server VM and currently working with Azul Systems; for helping in analysis and pointing at valuable resources.

Kirk Pepperdine, Java performance authority; for helping me in the process in addition to contributing and extensive editing.

David Dagastine, Sun JVM performance team lead, for explaining and pointing me in the right direction.

Several of my Xebia colleagues for performing the benchmarks.

Resources

Java concurrency in practice, Brian Goetz et all.
Java theory and practice: Synchronization optimizations in Mustang
Did escape analysis escape from Java 6
Dave Dice's Weblog
Java SE 6 Performance White Paper

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Lock elision only in Java 7 by Christoph Kutzinski

I remember reading something that lock elision is only coming in Java 7. Java 6 is just doing the escape analysis without taking advantage of the analysis results.
(I just found this article where this is mentioned, too).


Can anyone confirm this/give more details?

Premature conclusions.... by William Louth

Have you considered running the various benchmarks within the SPECjvm2008 benchmark suite with the options above? It might save you time in determining whether your numbers are correct and meaningful in a more realistic context.

William

Re: Lock elision only in Java 7 by Ben Loud

If you look at the OpenJDK HotSpot mailing list, you can see that the Escape analysis has been heavily worked on (eg mail.openjdk.java.net/pipermail/hotspot-compile...). It looks like it has been signifiantly improved, and even cooler is that its now being used to elimiate allocations entirely. I expect things to be MUCH better in JDK7 (and probably also in a future JDK6 update)

Re: Premature conclusions.... by William Louth

By the way if you are going to construct a micro-benchmark then at least try not to measure other things such as the garbage collector and the memory allocator.

Re: Premature conclusions.... by Kirk Pepperdine

Hi William,

Thanks for the comments on the article. I think if you read the end of the article it gives a hint that there *is* something else going on. Stay tuned for part duex ;-)

Kirk

Re: Premature conclusions.... by William Louth

Thanks Kirk for responding. By the way I do not like the fact that the classes tested share a common super class especially with micro benchmarks. I prefer for code to be duplicated across test classes. Again lets have these figures for SPECjvm2008 (freely available) to see how it stacks up in the wild.

Microbenchmark by Michael Bien

I am wondering why both loops haven't been optimized away completely ;) The result is never used and both methods have no "side effects". I always store the results in arrays in my own benchmarks because they spent sometimes no time after warming up.

little bit off topic:
If escape analysis really work that good, you also get a profile of the object life cycle for free. Couldn't this be used for fancy GC optimizations? I mean, why putting the not escaping objects in the heap if you already know how long you plan to reference them... (-> stack allocation?)

great article!

Re: Premature conclusions.... by Dieter Guendisch

William is absolutely right. Putting two benchmarks into one single class is useless as you (or at least I :) never know in which "state" the jvm is after running the first micro-benchmark.
Btw. I just tried to switch the order of the micro-benchmarks: LockTest2 which just runs the StringBuilder first and the StringBuffer benchmark second with the following result: while the StringBuilder execution time was stable (LockTest and LockTest2 each took about 5secs for StringBuilder) the outcome of StringBuffer was completely different: LockTest always took about 7secs, while LockTest2 took always about 13secs for StringBuffer!
That said, it would be interesting to see your benchmarks with the "loop" over these two micro-benches being implemented in some batch-script and not the same java program :)

Anyway, very interesting article, thanks for it !!!

Re: Microbenchmark by William Louth

Hi Michael it could never effectively optimize away the loops because constructors and method calls typically result in other indirect calls being made as well as static field initialization and class loading events. The object reference might be discarded but the object construction is rarely as simple as some primitive operation even in the case of these basic Java classes. There are other (temporal) object state issues that would have to considered. In our mind the model in the seems futile and localized but there are side effects which must be represented in terms of the runtime behavior that we ignore when viewing at face value the code.

Re: Premature conclusions.... by William Louth

Hi Dieter, the change in the timing for the stringbuffer could be caused a number of things including the garabage collection of stringbuilder objects created prior to the tests execution.

I am still not sure of the point in presenting such inaccurate figures especially as someone might read this article online and not the correction coming. The article content is good but the benchmark reporting and the final tease at the end is distracting and weakens the rest of the content.

Re: Microbenchmark by Michael Bien

Thank you for your answer William,

I know I oversimplified a lot and I have no experience to know what is possible in real time optimizations and what not.

But I am pretty sure if the Hotspot compiler can't do that because of profiling vs. gain issues you could do that in the bytecode compiler. In worst case a post processor ANT task could insert a thread local Stack into the bytecode in such cases.

public static Vector3f average(Vector3f v1, Vector3f v2) {
Vector3f tmp = new Vector3f(); // candidate for stack allocation
tmp.add(v1, v2);
tmp.scale(0.5f);
return tmp;
}

You are right the main problem is to distinct value objects from objects which have side effects but I think this is possible. The first step are method local escape analysis, the next steps would be to determine if add(), scale(), static{} and object construction have side effects.

(the main goal is to reduce gc stops not to increase execution performance... but both are related)

just a thought

The mistake of the cord by Tajima Kaz

Possibly, the following cord, does one line fall out?

public static String concatToBuffer(StringBuffer sb, String s1, String s2, String s3) {

sb.append(s1);

sb.append(s2);

sb.append(s3);

return

}


I think that "return sb.toString();" is necessary.

My post mistake by Tajima Kaz

Sorry, cord -> code

Re: The mistake of the cord by Jeroen Borgers

Thanks Kaz.
Mistakes are easy to make :-)
Will be corrected.


--Jeroen.

Great article by Leandro Iriarte

Hello Jeroen,

Great article! I would like to put a piece of it in my site (with your name and link obviously), and also I would like to traslate it into spanish (and I can send a copy to you if you want).

Mi site is www.leandroiriarte.com.ar

Please reply me, I'll be waiting for your auth.

Thanks

Re: Great article by Jeroen Borgers

Hi Leandro,

Sure, no problem. Please send me a copy.
Sorry for the delay, I was on vacation.

Regards,
Jeroen.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

16 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT