Concurrency bugs take many shapes and forms. We have race conditions, reordering of code, visibility issues of fields, live locks, deadlocks and of course the performance related bugs, such as contention and starvation. One thing that they have in common is that they are damn elusive.
In this InfoQ article, I will examine two threading bugs that I discovered, show how I found them and then explain the solution (if there is one). One happened in a customer project and the other was discovered in the JVM itself. But first, a quick discussion about the Java Memory Model (JMM). Depending on the feedback from this article, I might follow up with Hunting II, Hunting III and maybe even Good Will Hunting.
Java Memory Model (JMM)
Astute readers would have heard of the Java Memory Model (JMM). It is a theoretical description of how Java is supposed to behave with memory shared between different threads. Imagine it as the rule book for what has to happen when two threads want to read and write to the same memory location. All Java Virtual Machines have to be implemented according to the JMM. However, it might surprise you that we do not have any tests to verify that a JVM adheres to the JMM. What makes this even more difficult is that even concurrency experts cannot agree on how the rules should be interpreted. A couple of years ago, the question was asked on the concurrency interest list: "Why can constructors not be synchronized?" In other words, why won't this compile?
public class MyObject { private List<String> objects; public synchronized MyObject() { objects = new ArrayList<String>(); } }
The discussion started with someone pointing out why it was not necessary, because the object had not been constructed yet. However, if the object is not properly published (see Brian Goetz' book “Java Concurrency in Practice”), then we could get concurrency bugs. About a hundred emails later, we still could not get consensus whether this was a problem or not, according to the JMM.
Testing code is hard. How do you explain to your mother that in Java for a and b both positive, you can have "a + b = c" and "c < a" and "c < b"?
import org.junit.*; import static org.junit.Assert.assertTrue; public class CrazyMath { @Test public void testSanity() { int a = 1_000_000_000; int b = 2_000_000_000; int c = a + b; assertTrue(c < a); assertTrue(c < b); } }
If testing single-threaded code for correctness is hard, then the multi-threaded case is exponentially harder. This explains why there are zero test cases for checking whether a particular JVM adheres to the JMM. You might think that there are no concurrency bugs in your JVM, but you'd be wrong.
Unbreakable Hard Spin
Let's start with one that was with us for a very long time, from Java 1.6.0_14 and only fixed in 1.7.0_40. I reported it in 2010, but it took another three years for the bug to be repaired. I first mentioned it in my Java Specialists' Newsletter. My code is correct according to the JMM, but it somehow gets itself into a state where no safe point can be reached, and thus it continues forever in a livelock. You cannot exit normally and even jstack refuses to give us a thread dump. You might need to run it a few times before it fails. It also fails more quickly if you press CTRL+C whilst it is running. It failed on any OpenJDK version from 1.6.0_14 up to and including 1.7.0_25, on all platforms I tried.
import java.util.*; /** * Whilst writing the Interlocker, we discovered this bug in the * JVM server hotspot that has been around since 1.6.0_14, * released 2009-05-28. The JVM might end up in an unbreakable * hard spin if you run this code. It has been logged as a bug. * It was finally solved in 1.7.0_40, released 2013-09-10, more * than four years later. * * Described in http://www.javaspecialists.eu/archive/Issue188.html * * @author Dr Heinz M. Kabutz */ public class VirtualMachineLiveLock { private static final boolean BROKEN = true; public volatile boolean evenHasNextTurn = true; private class InterleavedTask implements Runnable { private final boolean even; InterleavedTask(boolean even) { this.even = even; } public void run() { while (!isDone()) { // busy wait - tends to hang up the process. while (BROKEN ? even && !evenHasNextTurn || !even && evenHasNextTurn : even ^ evenHasNextTurn) ; if (isDone()) { return; } call(); evenHasNextTurn = !even; } } } private final int upto = 50; private volatile int count; public boolean isDone() { return count >= upto; } public void call() { int temp = count + 1; sleepQuietly(); count = temp; System.out.println("temp = " + temp); } private void sleepQuietly() { try { Thread.sleep(100); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } private void check() throws InterruptedException { Thread[] threads = { new Thread(new InterleavedTask(true)), new Thread(new InterleavedTask(false)), }; for (Thread thread : threads) { thread.start(); } for (Thread thread : threads) { thread.join(); } } public static void main(String[] args) throws InterruptedException { Timer timer = new Timer(true); timer.schedule(new TimerTask() { public void run() { System.out.println("VM still alive (a bit)"); } }, 1000, 1000); VirtualMachineLiveLock vmll = new VirtualMachineLiveLock(); vmll.check(); } }
Now that we've seen the code that produces the error, I would like to go through the thought process that allowed me to discover it and to make it reproducible. It will hopefully also explain why these bugs are so elusive.
It all started with a question by one of my subscribers: How can you alternately call a method from two threads? Imagine you have two threads, first thread 1 calls a method, then thread 2, then thread 1 again, and so on. One thread should never call the method twice in a row.
My first attempt was to satisfy the requirements of the question. In order to do that, I needed a test that would tell me when I had concocted code that did what we wanted. It does not help me when my code is fast, but incorrect. But I also wanted it to be as fast as possible. I thus wrote a task that would do very little, in order to measure the communication overhead of my two threads taking turns calling the method.
I then wrote several solutions to this problem, starting with the most obvious approach using semaphores. This was easy to do and the code ended up being the most elegant. You essentially have two semaphores with permits 0 and 1, and each task acquires and releases them in opposite orders. I then wrote further solutions based on wait-notify, Condition and Exchanger. All of these had roughly the same overhead.
The last solution I wrote was with a lock-free spin loop. Effectively, each thread would simply spin, looking at a shared volatile field, until it was his turn to execute the task. This solution had the least overhead when the tasks that we wanted to execute took very little time. Obviously if the task took too long, then the additional CPU overhead of spinning would make our lock-free version slower. My test looked like this:
public volatile boolean evenHasNextTurn = true; // ... while (!isDone()) { while (even && !evenHasNextTurn || !even && evenHasNextTurn) {
I ran it and it was blazingly fast. However, once when I ran it, it hung up so badly that I had to kill the process with "kill -9". Now it would have been very easy to simply refactor the code and move on. That's what most programmers would do. For example, the test could have been written as even^evenHasNextTurn. Instead, I spent a few hours trying to figure out how to reproduce the bug. When you're battling deadlines, it is tempting to move code around a bit so that the error disappears and then to move on to the next deliverable. Unfortunately these types of bugs tend to reappear over time, especially when the system is under stress, such as when you have some promotion running on your website. At the worst possible moment, your concurrency bug will be back.
The bug was in the JVM, rather than my code. I've been waiting since 2010 to publish it, because a malicious coder could insert this into his code and jam up your application server. Since you cannot connect JConsole or jstack or JVisualVM to it, nor can you generate a stack trace with CTRL+Break or CTRL+\, it can be quite tricky to discover where this is coming from.
As Java programmers, we often think that all bugs are in our code. But the JVM was also written by people and we all make mistakes. The only reason that there are less bugs is because more people are using the JVM than your code and so the bugs tend to get rooted out more quickly.
OutOfMemoryError - Serialized HashMap
One of the interesting bugs we discovered recently was a race condition that could happen in the Java 7 HashMap when the same key was inserted by multiple threads at the same time. HashMap is not thread-safe. We all know that. The JavaDocs state: "Note that this implementation is not synchronized." Before I show code that demonstrates the issue, I'd like to take you through the process we followed to discover this bug in some production code.
The system we were looking at used a lot of memory, so the maximum heap size was set to about 30 GB. However, even when it would use only 10 GB, it would sometimes cause an OutOfMemoryError.
For some reason we were not given a stack trace to help us diagnose where exactly the error was coming from. Thus we had exhausted the memory even before we had filled it up. Rather strange, isn't it?
Since we did not have a clue what the source for this could be, we enabled -XX:+HeapDumpOnOutOfMemoryError. This generated a huge file of about 10 GB. The problem with such a big file was that it becomes difficult to view in a heap profiler. With a bit of debugging, we discovered the offset of our humungous (no relation to G1GC) object. Our next step was to extract it from the file using normal Java I/O. When we opened the file in a hex editor, we could see that some of the bytes kept on repeating themselves over and over. After the first fifty or so bytes, we saw something like this:
q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q ~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~ q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q ~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~q~ q~q~q~q~q~
This led me to think that we had an infinite loop in a data structure and that somehow serialization was involved. Pointers in serialized files have this kind of pattern.
To confirm that this really was a serialized Java object, we constructed a file with the same first bytes as this large object and tried to read it with an ObjectInputStream. It failed unfortunately as we did not have all of the classes on our machine that our customer had in production, but it did start reading the file as an ObjectInputStream, which meant that the heading bytes matched. All we had was the heap dump. The source code and the classes were not available to us. Often when we have to analyze bugs in production systems, we have to rely on what the customer is willing to show us, because we usually do not get physical access to their systems.
Seeing that we had now confirmed that this was indeed a serializing Java object, we used a unix utility called “strings” to extract all the text out of the serialized Java object. The text that jumped out at us was a java.util.HashMap, that did not seem to contain very much at all, except for an infinite loop. This pointed to a java.util.HashMap being at the core of our woes. We suspected that someone was serializing a corrupted HashMap to a ByteArrayOutputStream and that this was eventually running out of space. We had not run out of memory entirely, after all, we still had 20 GB free on our heap, but the single object we were trying to construct was too large. Java's arrays are limited to slightly less than Integer.MAX_VALUE. Because who would ever want to construct arrays with more than two billion values? (And who would ever need more than 640 KB of RAM?)
Our next step was to try to break a HashMap in the lab, in order to prove that this was most likely the root of our problems. In this experiment, we try to put the same keys (System.currentTimeMillis()) into an ordinary HashMap from four threads at the same time. Since our processors are fast, it is likely that they will try to add the new entries within one millisecond. If we do this often enough, we should see corruption of the HashMap pretty quickly. In order to test whether we have reached an infinite loop within our data structure, we see whether iteration sees twice as many objects as we have entries in the map. If we do, we can assume that we are seeing multiple entries twice and thus have probably caused an infinite loop inside the map. We then try to serialize the map to a ByteArrayOutputStream and see what error we see:
import java.io.*; import java.util.*; /** * @author kirk, heinz */ public class ThreadBug implements Serializable, Runnable { private HashMap<String, String> map = new HashMap<>(); private volatile boolean infiniteLoop = false; public void run() { while (!infiniteLoop) { putField(Long.toString(System.currentTimeMillis())); } } public void putField(String fieldName) { map.put(fieldName, "dummy"); testForInfiniteLoop(); } private void testForInfiniteLoop() { try { Iterator<?> it = map.entrySet().iterator(); int current = 0; while (current < map.size() * 2 && it.hasNext()) { it.next(); current++; } if (it.hasNext()) { System.out.println("Detected infinite loop in iteration"); infiniteLoop = true; } } catch (ConcurrentModificationException ex) { //ignore } } public static void main(String[] args) throws InterruptedException { ThreadBug tb = new ThreadBug(); Thread[] threads = new Thread[4]; for (int i = 0; i < 4; i++) { threads[i] = new Thread(tb, "threadbug-" + i); threads[i].start(); } for (int i = 0; i < 4; i++) threads[i].join(); System.out.println("Threads all exited"); System.out.println("Serializing"); ByteArrayOutputStream baos = new ByteArrayOutputStream(); try { ObjectOutputStream oos = new ObjectOutputStream(baos); oos.writeObject(tb); System.out.println("Serialization complete, " + "byte[] size = " + baos.size()); } catch (Throwable ex) { ex.printStackTrace(); System.err.printf("baos.size() = %,d%n", baos.size()); } System.out.println("ThreadBug done"); } }
Please take a moment to run this code on your machine and confirm that it happens in Java 7. Also, spend a few minutes going over the code to understand what it does. Your efforts will be rewarded with a better understanding of what is going on. Interestingly, this bug does not seem to happen with HashMap in Java 8. They might have changed something, but even so, the Java 8 HashMap is still not thread safe and should never be shared without correct synchronization. We might have to write the test a bit differently in order to see this effect occur in Java 8. On replacing the java.util.HashMap with a java.util.concurrent.ConcurrentHashMap, the problem was quickly eliminated.
Conclusion
Over the last 1.5 decades, I have seen a lot of concurrency bugs, some in the JVM and others due to coding errors. I hope you enjoyed this article and that we will be able to do more on the same topic in future.
About the Author
Dr Heinz Kabutz is best known for his creation of The Java Specialists' Newsletter, read in 135 countries by 75000 Java experts. In his newsletter, he examines advanced aspects of Java that have helped Java developers around the world to produce better code. Heinz writes Java code on contract, does Java performance consulting and runs advanced core Java seminars.