Creating Highly-Scalable Components in Java
With multi-core processors becoming main-stream, there is significant pressure on application developers to write correct highly-scalable applications to take advantage of the underlying hardware. Further, legacy applications have to be ported to run on the new architectures. An efficient way to ensure scalability of applications is to use highly-scalable components to build the applications. For example, in various applications, java.util.concurrent.ConcurrentHashMap can replace a synchronized HashTable and make the application more scalable. Thus, providing a set of highly scalable building blocks that can be directly introduced into applications to introduce parallelism will be very useful.
We have created a set of highly-scalable concurrent Java components as part of the Amino Library Project. In this article we will describe some of the ideas used in creating this open-source library.
As software engineers, we have to parallelize our applications in order to make them scale well on multi-core machines. One way to parallelize applications is to break them up into subtasks, where each subtask while executing communicate and synchronize with other subtasks of the application. This is often referred to as task-based parallelism. We can categorize applications based on the communication patterns of their subtasks.
- Embarrassingly Parallel – These are applications where subtasks never or very rarely communicate with each other. Mostly, each task only manipulates its private data. Some examples of these types of problems are:
- Monte Carlo simulation program
- Quick sort function, it's easy to use fork/join pattern to parallelize it. It is easy to get good scalability when we are dealing with embarrassingly parallel applications.
- Coarse-Grained and Fine-Grained – These are applications where the subtasks need to communicate with each other. Based on the frequency of communication between subtasks, applications are said to exhibit coarse-grained or fine-grained parallelism. An example of this type of computation is the producer/consumer problem. Producer generates the data that drives the consumer. The sending of the data from the producer to the consumer entails communication. Compared to embarrassingly parallel applications, it is much more difficult to get good scalability when dealing with applications which require frequent communications between subtasks.
Components optimized for scalability can be very helpful to solve these difficult problems. For example, if there is a highly scalable queue component, it becomes relatively easy to make producer/consumer scale well. In this article, we offer several general purpose techniques to make software components scale better.
Profile for Scalability
Application profiling is an important aspect of the development process. Profiling is done to understand the execution characteristics of the applications. Profile data is often used to improve the performance and scalability of applications. Most profilers follow a simple execution pattern. There is an instrumentation phase. During this phase, depending on type of information required, different levels of the computation stack are instrumented. Hardware counters can be used to monitor hardware events. Operating system (OS) can be monitored by instrumenting various OS events. If there is a virtual machine (VM) as part of the computation stack, it can be instrumented to get access to events in the VM. Finally the application can be instrumented to get events from the actual application. A good profiler will instrument everything on-the-fly without requiring the application to be re-compiled. Most profilers typically instrument the VM and the application layer. During execution of the instrumented application, a trace is generated. The trace captures various interesting information like function execution time, resource utilization, hardware performance counters, lock usage information, system time, user time etc. Depending on the type of the profiler, computation can be done using the trace information, before it is written out to the file-system. Finally, there is a display component which is used to visualize the trace data.
Profilers targeting concurrent applications need to provide detailed information about threads, lock contentions and synchronization points. Some of the important profilers that we have used as part of the Amino performance analysis work include:
- Java Lock Monitor(JLM) – This tool is used by Java developers to understand the usage of locks in their applications. During runtime, the instrumentation component gathers various lock statistics and this information is used to figure out lock contentions. Tables containing detailed information about locks and contentions can be generated by the tool. IBM Java Lock Analyzer (JLA) is another tool that can be used to provide lock information. It provides information similar to JLA, except it has a graphical user interface to display the lock and contention data. Both tools are supported on x86 and Power.
- THOR – This tool can be used to do detailed performance analysis of concurrent Java applications. It does an in-depth analysis of the complete execution stack, starting from the hardware to the application layer. Information is gathered from all four layers of the stack – hardware, operating system, jvm and application. The user is provided information about lock contentions and where in the code the contention occurred, bottlenecks, thread migrations between cores and performance degradation due to contentions. This tool is supported on x86 and Power.
- AIX Performance Tools – IBM’s AIX operating system comes with a set of low-level performance profiling and debugging tools. Some of the important ones include:
- Simple Performance Lock Analysis Tool (SPLAT) – This is a lock analysis tool available with the AIX operating system. The tool can be run from the command line and it can analyze various kernel-level locks. It can also analyze and display contentions of user-level locks (read/write and mutexes). The application execution has to be done with the trace option enabled. The trace data is used by the SPLAT tool.
- XProfiler – This is an X-Windows based profiling tool used to do function level profiling for C applications. The viewer can be used to display the compute intensive functions in the user code. To use the XProfiler the code has to be compiled with special flags (-pg option).
- prof, tprof and gprof – The various prof commands can be used to profile the user application. The prof command can be used to get a flat profile of all the external symbols and routines called in an application. The gprof is a superset of prof and can be used to get a call-graph profile of the application. Finally, tprof is the command used to get both macro and micro profiling information about an application. tprof can be used to get timing data about individual instructions, subroutines, threads and processes. It can be used to get processor usage of user mode routines, library routines, Java class, Java methods, kernel routines, kernel extension, etc.
Traditional performance tuning methods. i.e. those used for sequential applications, are applicable to parallel applications. After we've done what can be done using traditional approach, it may be useful to check how shared data are accessed by multiple threads in the process. By using tools like JLM or JLA, we can find how threads contended with each other during access of a shared resource..
For example, if an application has 100 threads and all of them need to put/get elements from/to a java.util.HashTable, performance of this application won't be good due to thread contention. Each thread will need to wait for a long time before it can access the HashTable.
If we use a tool like JLM with the above example, it will show that the monitor associated with the hash table object is hot. One way to get rid of this hotspot would be to replace the hashtable with a highly scalable component with the same basic functionality. In this particular case, we can easily replace the hash table by a java.util.concurrent.ConcurrentHashMap. ConcurrentHashMap divides its internal data structure into multiple segments. Replacing the HashTable by a ConcurrentHashMap in the application, allows threads to contend for multiple sub-components rather than a single large component. There is a good chance contention rate in the application with the ConcurrentHashMap will be much lower than before.
There is another way to reduce the contention rate for accessing shared components. If a component has operations with reverse semantics like the pushes and pops on a stack, the two operations can succeed without touching the central data structure. This technique was introduced by Danny Hendler, Nir Shavit and Lena Yerushalmi and has been implemented for certain datastructures in the Amino libray. The performance improvement of this method in a stack is shown in Figure 1.
Figure 1. Performance comparison: EBStack and TreiberStack
Using Lock-Free/Wait-Free Algorithm
Traditionally, a lock-based approach is pervasively used to ensure consistency of shared data, and exclusive access to critical section. Locks are relatively easy to understand but lock based algorithms introduce a large set of challenges. Some of the well known problems introduced by locks are dead-locks, live-locks, priority inversion, lock-contentions etc. Lock-contention tends to reduce scalability of components and algorithms.
Lock-Free and wait-free algorithms have a history of more than two decades now. They have been thought of as an approach that solves most of the problems associated with locks. These kinds of algorithms allow concurrent updates of shared data structures without using any locking mechanisms. It not only solves some of the basic problems associated with using locks in the code but it helps create algorithms that show good scalability. At the beginning, these lock-free and wait-free algorithms were of pure theoretical interest. But with progress both in the algorithms community and development in newer hardware support, lock-free techniques have gotten more and more usage in real products, such as OS kernels, VMs, thread libraries, etc.
Since version 1.5, there are lock-free algorithms implemented in JDK such as ConcurrentLinkedQueue, AtomicInteger, etc. They often scale better than corresponding lock-based components. When we began implementing new components as part of the Amino library, we chose to base it on the latest non-blocking algorithms from the research community. This made the Amino datastructures highly scalable and efficient. Sometimes, especially at low core counts, lock-free datastructures can have worse throughput than lock-based datastructures but in general they have better throughput characteristics.
Reduce CAS operations Whenever Possible
Since JDK 1.5, there is a highly efficient lock-free FIFO queue in java.util.concurrent package developed by Doug Lea. The algorithm for the queue is based on concurrent manipulation of a singly-linked list and was originally developed by Michael and Scott. Its dequeue operation requires one compare-and-swap (CAS) while the enqueue operation requires two successful CASs in order to complete. This may seem simple enough for enqueue and dequeue operation. But profiling (see Fig.2), shows that CASs take up a big part of the execution time and the enqueue requirement of two successive successful CASs increases the chances of failure. On modern multiprocessors, even a successful CAS operations cost an order-of-magnitude longer to complete than a load or a store, since they require exclusive ownership and ﬂushing of the processor’s write buffers.
Figure 2. Execution time of CAS operation
From Figure 2, the percentage of execution time of CAS operation is 46.08%, nearly half of the whole execution time. There is one-instruction delay of profile data. The actual time is accumulated on the following instruction "SETE AL".
Mozes and Shavit (MoS-queue) in their lock-free queue algorithm, provided a novel way to reduce the number of CAS operations in enqueue. The key idea is replacing the singly-linked list, whose pointers are inserted using a costly compare-and-swap (CAS) operation, by an "optimistic" doubly-linked list whose pointers are updated using a simple store, yet can be fixed if a bad ordering of events occur which causes the doubly linked list to become inconsistent.
We have done a benchmark to compare performance of ConcurrentLinkedQueue from JSR166y and Mos-Queue. The result is presented in Figure 3. As you can see for large number of threads, the performance of Mos-Queue is better than the JDK ConcurrentLinkedQueue.
Figure 3. Performance comparison: ConcurrentLinkedQueue and Mos-Queue
A more detailed explanation of Mos-Queue can be found in Mozes and Shavit's paper "An Optimistic Approach to Lock-Free FIFO Queues".
Reduce Memory Allocation
The Java Virtual Machine (JVM) has a very powerful and efficient memory management system. The garbage collector (GC) used in the JVM can compact live objects, hence there are exists no holes in the heap after GC. Since free space will be continuous after GC, memory allocation is no more complex than increasing a pointer.
This claim is also true for multi-threaded applications, if memory consumption is not intensive. The JVM assigns a thread-local buffer to each thread. During allocation of memory, the thread-local buffer is used first. . Global heap is not touched until the thread-local buffer is exhausted.
The feature of allocating memory in a thread local way is very helpful for application performance, but it doesn't work if allocation is too frequent. According to our experience, thread-local buffer will be exhausted quickly if frequency of allocation is high.
ThreadLocal class may be helpful if temporary object is needed in a loop. If we store the temporary object inside a ThreadLocal object, we can reuse it in each iteration of loop. Although the ThreadLocal class has overhead associated with it, most of the time it is better than doing frequent allocation inside the loop.
Figure 4. Performance comparison: ThreadLocal and Allocation
In this article, we have given introduction to several important principles for creating highly-scalable components in Java. In general, these principles are often helpful, but they can't replace careful testing and performance tuning.
- JLA, IBM Lock Analyzer for Java
- "CAS-Based Lock-Free Algorithm for Shared Deques" by Maged. It is a highly efficient algorithm of lock-free deque.
- "Simple, fast, and practical non-blocking and blocking concurrent queue algorithms" by Michael and Scott. It is a highly efficient algorithm of lock-free queue.
- "An Optimistic Approach to Lock-Free FIFO Queues" by Mozes and Shavit. It is another highly efficient algorithm of lock-free queue.
- In the Website for Amino project, get the sources of Amino project website.
- Browse the technology bookstore for books on these and other technical topics.
- A Scalable Lock-Free Stack Algorithm, Danny Hendler, Nir Shavit and Lena Yerushalmi, SPAA 2004, pages 206-215.
About the authors
Zhi Gan is a software engineer at IBM's China Development Lab, joined IBM after receiving a Ph.D. in computer security from Shanghai JiaoTong University. Dr. Gan has extensive experience in SOA (service-oriented architecture), AOP (aspect-oriented programming), and Eclipse. His current focus is mainly on parallel software development in Java/C++.
Raja Das is a software architect in IBM Software Group. Currently, he is developing libraries and frameworks for multicore/manycore systems. Previously, he was the WebSphere® Partner Gateway product architect. Dr. Das's interests include programming languages, parallel software, and systems.
Xiao Jun Dai is a software engineer at IBM's China Development Lab, joined IBM after receiving an M.D. in computer science from Institute of Software, Chinese Academy of Sciences. He has extensive experience in agile methodology and programming languages. Mr. Dai's current focus is mainly on concurrent programming and multi-core platform.
Frequent Memory Allocation
Multiverse Software Transactional Memory
Re: Frequent Memory Allocation
If you really use malloc/new in C/C++ and you do not preallocate memory, performance in C/C++ will suffer even more than it does in Java.
Today's JVM implementations can allocate memory much quicker than a typical malloc implementation but as soon as you allocate quickly in parallel in a lot of threads performance can go down, because the new space will be overflowed frequently, leading to more fully GC's because objects get promoted to old space.
Figure 4. Performance comparison: ThreadLocal and Allocation
ThreadLocal and Allocation
THOR is now a part of Multi-core SDK, which can be downloaded at www.alphaworks.ibm.com/tech/msdk
Re: ThreadLocal and Allocation
Thanks for the good question! Let me explain it a little more.
It's often that a function needs to create some temporary variable. If such function is frequently executed, memory system will have big pressure.
A work around here is to keep a static variable for storing value of the original temporary variable. But under parallel execution, we need a thread-local version of "static variable" for preventing data race. In the past, thread-local is notoriously slow. But according to our test, modern JDK improved performance of ThreadLocal class a lot. So it become an useful option now.
Re: Frequent Memory Allocation
On AIX platform, it's possible to get better scalability by using an env variable:
The downside of above approach is the space efficiency. That's the reason why it hasn't be used as default allocator.
Could you please publish your test programs?
Re: Could you please publish your test programs?
Here is the project website. You can download all source code.
Craig Motlin Sep 01, 2014