InfoQ

News

JavaOne: Cliff Click on a Scalable Non-Blocking Coding Style

Posted by Charles Humble on May 27, 2008 01:28 PM

Community
Java
Topics
Performance & Scalability ,
Programming
Tags
Parallel Programming ,
JavaOne 2008
In 1967 Gene Amdahl pointed out what appeared to be a fundamental limit in how fast you can make concurrent code. Only a portion of a program's processing can be run entirely in parallel, and only that portion can scale directly on machines having more and more processor cores. The rest of the program's work is sequential. The main problem that Amdahl’s law highlights is locking contention, a problem exacerbated as the number of processor cores increase. Most large CPU shared memory hardware systems support very fast concurrent reads but are speed limited to "1-cache-miss-per-write" or "1-memory-bus-update-per-write," so it is important to avoid all CPUs writing to the same location. Even with reader-writer lock, it is not possible to scale past the 50-100 CPU range. Multi-core processing is an increasing industry trend with almost all hardware manufacturers now exploring the possibilities. Azul is making production hardware with 768 cores, Sun's Rock is around 64, and even x86 based commodity hardware is expanding the number of cores it uses. Thus the problem of lock contention is becoming a pressing one for developers looking to write performant code.

Dr Cliff Click, a distinguished engineer at Azul Systems, gave a talk (slides here) at this year’s JavaOne describing a set of techniques that have allowed him to get quite some way towards a scalable, non-blocking coding style in Java. In very broad terms his approach implements a non-blocking algorithm such that stopping one particular thread does not stop global progress.

The major components of Click’s work are:

  1. A large, fast array to hold all the data which allows rapid parallel reads of the data and also allows a parallel, incremental, concurrent copy.
  2. Atomic-update on those array words (using java.util.concurrent.Atomic.*). The Atomic update will use either Compare and Swap (CAS) if the processor is Azul/Sparc/x86, or Load Linked/Store-conditional (LL/SC) on the IBM platform.
  3. A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes.

With these basic elements in place, Click then constructs an algorithm from many FSM steps that is lock free, ie each CAS makes progress. A CAS success is a local success, whilst a CAS failure means another CAS succeeded. If the CAS is successful the state machine advances, whilst a failed CAS retries.

Click has implemented two examples (Bit Vector and Hash Table) with the code available on SourceForge and is working on a third (a FIFO queue). To take one example in slightly more detail, the hash table is an array of Key Value pairs with the keys in even slots and values in odd slots. Each word is Compared and Swept separately but the state machine spans both words and during copy includes words from both arrays. The hash table implementation supports concurrent insert, remove test and resize and passes the Java Compatibility Kit (JCK) for ConcurrentHashMap. On Azul’s hardware it obtains linear scaling to 768 CPUs with more than a billion reads/second simultaneous with more than 10 million updates/second.

InfoQ spoke to Dr. Click to get a little more background on the work. During his JavaOne talk he highlighted a couple of issues with writing the hash table implementation in Java so we asked him how suitable Java was for this kind of work. He responded that it was "actually fairly suitable... it has a well understood memory model (and well implemented). It's lacking in some fine-control issues, which I can ignore as they cost a little bit of performance. This lack of fine-control (i.e. direct ASM access) would be a problem with e.g. OS design or device drivers, but not for performant data structures.”

We also asked when he would recommend using one of his data structures. His general advice was to use it when the "tried and true" alternatives are too slow to be useful:

“When a single data structure is highly contended; and you've already tried e.g. java.util.concurrent.ConcurrentHashMap. My stuff is generally slightly faster with no load (so no real reason to use it), and works much better
- with more than 32 cpus contending, or
- with a high percentage of writes vs reads.
Your mileage may vary, of course, so test before using.”

There's a lot of activity around concurrency in Java at the moment and Dr. Click's work tackles similar problems to the fork/join framework being considered for Java 7. Although not a member of the expert group himself, Click is regularly consulted by them.

Compare and Swap by Steven Reynolds Posted May 27, 2008 4:36 PM
Re: Compare and Swap by Steven Reynolds Posted May 27, 2008 4:47 PM
Re: Compare and Swap by Charles Humble Posted May 28, 2008 1:21 PM
fork/join by Alex Miller Posted May 27, 2008 9:51 PM
Re: fork/join by Geert Bevin Posted May 28, 2008 12:45 AM
  1. Back to top

    Compare and Swap

    May 27, 2008 4:36 PM by Steven Reynolds

    Um, the intel instruction should be "Compare and Swap". Very nice overview of Click's work.

  2. Back to top

    Re: Compare and Swap

    May 27, 2008 4:47 PM by Steven Reynolds

    Sorry, typo above. Should have said the CAS instruction should be "Compare and Swap".

  3. Back to top

    fork/join

    May 27, 2008 9:51 PM by Alex Miller

    As I mentioned in my coverage of this talk, I asked Dr. Click a question during the Q&A regarding the relationship with fork/join, particularly the non-blocking queue work in progress. Both Brian Goetz and Cliff Click made the point that while the Executor pattern was a great improvement over what most people were doing pre-Java 5, it doesn't scale well beyond 4-8 CPUs. At the time, that wasn't a problem but that's changed and will get worse fast. The problem of course is single-queue contention. Brian is solving the problem in fork-join by splitting work among many queues and using some smart optimizations like work-stealing to reduce contention at both head and tail of each queue. Fork/join will serve us well as CPUs in the 10s become common. Cliff's work is really more targeted up an order of magnitude in the 100s or 1000s of CPUs. It pushes the idea of many queues even farther by effectively using N 1-item queues. What worries me is that while I find Executors easy and the implementation of something like fork/join fun but challenging, I find Dr. Click's work fairly intimidating. If this is where we're all gonna live in 5-10 yrs we all need to get a lot smarter quickly or find a better way to do things.

  4. Back to top

    Re: fork/join

    May 28, 2008 12:45 AM by Geert Bevin

    Hey Alex, I suppose that, like with most complex things, this will be encapsulated into the data structures that have been written to support his work. I do agree though, Cliff's work is humbling and intimidating. I always feel like I have to get my chest wet before reading anything that he has done ... respect!

  5. Back to top

    Re: Compare and Swap

    May 28, 2008 1:21 PM by Charles Humble

    Many thanks for this - have amended now.
    Charles

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.