10 tips on how to prevent business value risk
One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.
The content has been bookmarked!
There was an error bookmarking this content! Please retry.

Posted by Zhi Gan, Raja Das, Xiao Jun Dai on Aug 27, 2009
With multi-core processors becoming main-stream, there is significant pressure on application developers to write correct highly-scalable applications to take advantage of the underlying hardware. Further, legacy applications have to be ported to run on the new architectures. An efficient way to ensure scalability of applications is to use highly-scalable components to build the applications. For example, in various applications, java.util.concurrent.ConcurrentHashMap can replace a synchronized HashTable and make the application more scalable. Thus, providing a set of highly scalable building blocks that can be directly introduced into applications to introduce parallelism will be very useful.
We have created a set of highly-scalable concurrent Java components as part of the Amino Library Project. In this article we will describe some of the ideas used in creating this open-source library.
As software engineers, we have to parallelize our applications in order to make them scale well on multi-core machines. One way to parallelize applications is to break them up into subtasks, where each subtask while executing communicate and synchronize with other subtasks of the application. This is often referred to as task-based parallelism. We can categorize applications based on the communication patterns of their subtasks.
Components optimized for scalability can be very helpful to solve these difficult problems. For example, if there is a highly scalable queue component, it becomes relatively easy to make producer/consumer scale well. In this article, we offer several general purpose techniques to make software components scale better.
Application profiling is an important aspect of the development process. Profiling is done to understand the execution characteristics of the applications. Profile data is often used to improve the performance and scalability of applications. Most profilers follow a simple execution pattern. There is an instrumentation phase. During this phase, depending on type of information required, different levels of the computation stack are instrumented. Hardware counters can be used to monitor hardware events. Operating system (OS) can be monitored by instrumenting various OS events. If there is a virtual machine (VM) as part of the computation stack, it can be instrumented to get access to events in the VM. Finally the application can be instrumented to get events from the actual application. A good profiler will instrument everything on-the-fly without requiring the application to be re-compiled. Most profilers typically instrument the VM and the application layer. During execution of the instrumented application, a trace is generated. The trace captures various interesting information like function execution time, resource utilization, hardware performance counters, lock usage information, system time, user time etc. Depending on the type of the profiler, computation can be done using the trace information, before it is written out to the file-system. Finally, there is a display component which is used to visualize the trace data.
Profilers targeting concurrent applications need to provide detailed information about threads, lock contentions and synchronization points. Some of the important profilers that we have used as part of the Amino performance analysis work include:
Traditional performance tuning methods. i.e. those used for sequential applications, are applicable to parallel applications. After we've done what can be done using traditional approach, it may be useful to check how shared data are accessed by multiple threads in the process. By using tools like JLM or JLA, we can find how threads contended with each other during access of a shared resource..
For example, if an application has 100 threads and all of them need to put/get elements from/to a java.util.HashTable, performance of this application won't be good due to thread contention. Each thread will need to wait for a long time before it can access the HashTable.
If we use a tool like JLM with the above example, it will show that the monitor associated with the hash table object is hot. One way to get rid of this hotspot would be to replace the hashtable with a highly scalable component with the same basic functionality. In this particular case, we can easily replace the hash table by a java.util.concurrent.ConcurrentHashMap. ConcurrentHashMap divides its internal data structure into multiple segments. Replacing the HashTable by a ConcurrentHashMap in the application, allows threads to contend for multiple sub-components rather than a single large component. There is a good chance contention rate in the application with the ConcurrentHashMap will be much lower than before.
There is another way to reduce the contention rate for accessing shared components. If a component has operations with reverse semantics like the pushes and pops on a stack, the two operations can succeed without touching the central data structure. This technique was introduced by Danny Hendler, Nir Shavit and Lena Yerushalmi and has been implemented for certain datastructures in the Amino libray. The performance improvement of this method in a stack is shown in Figure 1.
Figure 1. Performance comparison: EBStack and TreiberStack

Traditionally, a lock-based approach is pervasively used to ensure consistency of shared data, and exclusive access to critical section. Locks are relatively easy to understand but lock based algorithms introduce a large set of challenges. Some of the well known problems introduced by locks are dead-locks, live-locks, priority inversion, lock-contentions etc. Lock-contention tends to reduce scalability of components and algorithms.
Lock-Free and wait-free algorithms have a history of more than two decades now. They have been thought of as an approach that solves most of the problems associated with locks. These kinds of algorithms allow concurrent updates of shared data structures without using any locking mechanisms. It not only solves some of the basic problems associated with using locks in the code but it helps create algorithms that show good scalability. At the beginning, these lock-free and wait-free algorithms were of pure theoretical interest. But with progress both in the algorithms community and development in newer hardware support, lock-free techniques have gotten more and more usage in real products, such as OS kernels, VMs, thread libraries, etc.
Since version 1.5, there are lock-free algorithms implemented in JDK such as ConcurrentLinkedQueue, AtomicInteger, etc. They often scale better than corresponding lock-based components. When we began implementing new components as part of the Amino library, we chose to base it on the latest non-blocking algorithms from the research community. This made the Amino datastructures highly scalable and efficient. Sometimes, especially at low core counts, lock-free datastructures can have worse throughput than lock-based datastructures but in general they have better throughput characteristics.
Since JDK 1.5, there is a highly efficient lock-free FIFO queue in java.util.concurrent package developed by Doug Lea. The algorithm for the queue is based on concurrent manipulation of a singly-linked list and was originally developed by Michael and Scott. Its dequeue operation requires one compare-and-swap (CAS) while the enqueue operation requires two successful CASs in order to complete. This may seem simple enough for enqueue and dequeue operation. But profiling (see Fig.2), shows that CASs take up a big part of the execution time and the enqueue requirement of two successive successful CASs increases the chances of failure. On modern multiprocessors, even a successful CAS operations cost an order-of-magnitude longer to complete than a load or a store, since they require exclusive ownership and flushing of the processor’s write buffers.
Figure 2. Execution time of CAS operation

From Figure 2, the percentage of execution time of CAS operation is 46.08%, nearly half of the whole execution time. There is one-instruction delay of profile data. The actual time is accumulated on the following instruction "SETE AL".
Mozes and Shavit (MoS-queue) in their lock-free queue algorithm, provided a novel way to reduce the number of CAS operations in enqueue. The key idea is replacing the singly-linked list, whose pointers are inserted using a costly compare-and-swap (CAS) operation, by an "optimistic" doubly-linked list whose pointers are updated using a simple store, yet can be fixed if a bad ordering of events occur which causes the doubly linked list to become inconsistent.
We have done a benchmark to compare performance of ConcurrentLinkedQueue from JSR166y and Mos-Queue. The result is presented in Figure 3. As you can see for large number of threads, the performance of Mos-Queue is better than the JDK ConcurrentLinkedQueue.
Figure 3. Performance comparison: ConcurrentLinkedQueue and Mos-Queue

A more detailed explanation of Mos-Queue can be found in Mozes and Shavit's paper "An Optimistic Approach to Lock-Free FIFO Queues".
The Java Virtual Machine (JVM) has a very powerful and efficient memory management system. The garbage collector (GC) used in the JVM can compact live objects, hence there are exists no holes in the heap after GC. Since free space will be continuous after GC, memory allocation is no more complex than increasing a pointer.
This claim is also true for multi-threaded applications, if memory consumption is not intensive. The JVM assigns a thread-local buffer to each thread. During allocation of memory, the thread-local buffer is used first. . Global heap is not touched until the thread-local buffer is exhausted.
The feature of allocating memory in a thread local way is very helpful for application performance, but it doesn't work if allocation is too frequent. According to our experience, thread-local buffer will be exhausted quickly if frequency of allocation is high.
ThreadLocal class may be helpful if temporary object is needed in a loop. If we store the temporary object inside a ThreadLocal object, we can reuse it in each iteration of loop. Although the ThreadLocal class has overhead associated with it, most of the time it is better than doing frequent allocation inside the loop.
Figure 4. Performance comparison: ThreadLocal and Allocation

In this article, we have given introduction to several important principles for creating highly-scalable components in Java. In general, these principles are often helpful, but they can't replace careful testing and performance tuning.
Zhi Gan is a software engineer at IBM's China Development Lab, joined IBM after receiving a Ph.D. in computer security from Shanghai JiaoTong University. Dr. Gan has extensive experience in SOA (service-oriented architecture), AOP (aspect-oriented programming), and Eclipse. His current focus is mainly on parallel software development in Java/C++.
Raja Das is a software architect in IBM Software Group. Currently, he is developing libraries and frameworks for multicore/manycore systems. Previously, he was the WebSphere® Partner Gateway product architect. Dr. Das's interests include programming languages, parallel software, and systems.
Xiao Jun Dai is a software engineer at IBM's China Development Lab, joined IBM after receiving an M.D. in computer science from Institute of Software, Chinese Academy of Sciences. He has extensive experience in agile methodology and programming languages. Mr. Dai's current focus is mainly on concurrent programming and multi-core platform.
I have experienced the same problem with frequent memory allocations and reduced performance. Is this a problem related to Java/gc or do other languages like c/c++ also suffer from the same problem?
Peter Veentjer
Multiverse Software Transactional Memory
Hi Peter,
If you really use malloc/new in C/C++ and you do not preallocate memory, performance in C/C++ will suffer even more than it does in Java.
Today's JVM implementations can allocate memory much quicker than a typical malloc implementation but as soon as you allocate quickly in parallel in a lot of threads performance can go down, because the new space will be overflowed frequently, leading to more fully GC's because objects get promoted to old space.
Regards,
Markus (kohlerm.blogspot.com/)
Any references to THOR?
What exactly was your test in Figure 4? It seems pretty vague to me.
Ashwin (ashwinjayaprakash.com)
I don't understand the "ThreadLocal vs. Allocation". Based on the chart on Figure 4, it seems like a huge performance gain. Can you give an example of how you would use ThreadLocal class instead of allocation on the heap inside a loop?
Markus,
THOR is now a part of Multi-core SDK, which can be downloaded at www.alphaworks.ibm.com/tech/msdk
Sergejs,
Thanks for the good question! Let me explain it a little more.
It's often that a function needs to create some temporary variable. If such function is frequently executed, memory system will have big pressure.
A work around here is to keep a static variable for storing value of the original temporary variable. But under parallel execution, we need a thread-local version of "static variable" for preventing data race. In the past, thread-local is notoriously slow. But according to our test, modern JDK improved performance of ThreadLocal class a lot. So it become an useful option now.
Yes, scalability of memory allocation in C/C++ is often not good.
On AIX platform, it's possible to get better scalability by using an env variable:
computing.llnl.gov/mpi/aix_env_vars.html#MALLOC...
The downside of above approach is the space efficiency. That's the reason why it hasn't be used as default allocator.
I would be interested in repeating your tests. Could you please tell me where I can find your test programs?
amino-cbbs.sourceforge.net/
Here is the project website. You can download all source code.
One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.
InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.
Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.
John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.
Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.
Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.
Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?
Sean Cribbs explains what Map-Reduce and Riak are, why and how to use Map-Reduce with Riak, and how to convert SQL queries into their Map-Reduce equivalents.
10 comments
Watch Thread Reply