BT

Azul's Zing 4.1 Virtualisation System for Java Gets up to 80% Better Performance Than Zing 4.0

| by Charles Humble Follow 977 Followers on May 25, 2011. Estimated reading time: 8 minutes |

Azul Systems, one of the five vendors to make Gartner's 2011 “Cool Vendors in Application and Integration Platforms” list this year, have released version 4.1 of their Zing Virtualisation system for Java, which, they claim, is up to 80% more performant than the already impressive Zing 4.0. Vice President of Technology and CTO for Azul Systems, Gil Tene, explained to InfoQ that the 80% figure was based on the original Liferay demo test we described here. Whilst the previous test was able to run 800 users and handle around 3.5GB garbage/second, Zing 4.1 was able to sustain around 1500 users. At that point the Zing garbage collector, referred to as the Continuously Concurrent Compacting Collector or C4, was handling 5.5GB of garbage/second with no loss of performance, according to Tene. By way of contrast, using the same hardware, the standard Oracle distributed JVM is able to sustain around 45 users when running the Concurrent Mark Sweep garbage collector that Oracle recommends for applications that are sensitive to garbage collection. 

Any JVM will exhibit observable time delays in executions that would have been instantaneous had it not been for the behaviour of the JVM and the underlying system stack, a quality Tene refers to as 'jitter'. With the most recent updates to C4, measured JVM jitter is no longer strongly affected by garbage collection, with scheduling and thread contention artifacts dominating.

The improvements between Zing 4.0 and Zing 4.1 come from a mixture of improved Just-in-Time compilation techniques, enhanced thread scheduling, and other advanced algorithms that increase performance and sustainable throughput, and lower response times. As an example of the kind of optimization techniques Azul are using, they have proven away the need for certain of the read barriers (referred to as "Loaded Value Barriers" or LVBs) that the C4 algorithm uses to support concurrent compaction, remapping, and incremental update tracing. So for instance, in the case of an object array copy, the values of the elements being copied do need to be LVB'ed, but neither the source nor target addresses in the copies need to be LVB'ed more than once. Tene went on:

This pattern also shows up in things like string compares and more generic user-written code - where the code accessing an array's contents is doing so one at a time, with each access happening in an accessor method (so it would seem like each accessor call needs to LVB the array), but through accessor method inlining into the loop and proper recognition that the repeated LVB in the loop is redundant, the LVB can be optimized to be done once, and hoisted to be outside of the loop.

Another simple example is comparisons with null. It is common to find user code that looks for non-null values, often in otherwise sparse (mostly null) structures. The code's common operation would load a reference (from an array, or a list, or some such), compare it with null, and skip ahead as long as it finds a null. By recognizing that we can tell whether a reference is null or not without having to LVB it, we can avoid LVB'ing all the null references we load - the LVB only happens ahead of the reference being used when it survives the null check (which on sparse structures eliminates most of the LVBs).

We can also take a small segway into safepoint handling: We never want to go very far without crossing a safepoint opportunity at which threads can scan their own stacks and correct any refs that need to be "fixed" - since the GC would not be able to make forward progress unless the threads cross that line "readily" (our checkpoint mechanism relies on this - and any blocking code is always at a safepoint opportunity that allows the GC to do the thread scan if the thread doesn't do it on its own). We will make sure that any long array copy will have frequent safepoint opportunities within it, but they don't have to happen on each and every iteration. For example, we can copy 256 array elements between safepoint opportunities.

An important thing to realize about safepoint opportunities (and this is true for all GC mechanisms) is that an object reference value on the stack can change when you cross a safepoint opportunity (e.g. a stop-the-world compactor like ParallelGC could have run at the safepoint, changing the reference to point to a new location). This means that loaded reference values cannot be cleanly treated as constants in various operations and optimizations. For example if one reference's current address is "greater than" another reference's current address, that relationship may change when you cross a safepoint opportunity, so you can't safely make code that uses an established "greater than" relationship to make certain optimizations.

However, between any two safepoint opportunities, references are known to not change "on their own" (i.e. we don't have to worry about someone coming and changing an object reference on the stack to a different value). A reference can be loaded from memory (and LVB'ed at that point), or clobbered in some way by the thread's instructions, but no one else will change it, and no event will occur that will make the thread self-change it. That allows us to do various optimizations that treat the reference as a constant between the safepoints, and allow optimizers to eliminate some redundant operations and make certain optimizations that would rely on constant relationships.

Azul have also made improvements to Zing's default behaviours, essentially allowing it to work across a wider range than previously without the need for tuning the collector settings. Tene suggested that the total smooth operating range is around 1.5-2 times wider than in the 4.0 product.

Zing 4.1 also benefits significantly from improvements to the underling hyperviser virtualisation technology on which Zing sits. So, for example, a JVM running on a hypervisor in a Red Hat Enterprise Linux (RHEL) 6.0 enviroment can now support 64 cores. Tene told us

KVM on RHEL 5.x supports up to 16 virtual cores for a guest OS (which the Zing Virtual Appliance is). KVM on RHEL 6 upped that to 64 vCores. RHEL 6 itself can run on hosts with up to 256 cores, with each guest OS running under KVM within it limited to 64 vCores. Modern Intel E7 based commodity servers will have those levels - e.g. a 4 socket server (available from Dell/HP/IBM/...) has 40 physical cores and 80 hyper-threads (aka virtual cores at the host level). An 8 socket E7 system (which at least HP is shipping) will have 80 physical and 160 virtual cores.

Tene has also created a simple, but ingenious, open source tool for measuring the latency between a client and Java application node or within a node, called JitterMeter. JitterMeter adds a thread that goes to sleep for 1 msec and measures how long it takes to wake up. Any difference between the perfect 1msec and the actual elapsed time gets measured and recorded.  A webinar (registration required) provides more details.

JitterMeter is a very simple, open sourced tool that can be run with any Java application, and on any JVM, and is used to establish the underlying platform's contribution to application response time behavior. By measuring the jitter experienced by an empty work unit while the JVM is under [your] application load, JitterMeter reports the best-case platform jitter (coming from the JVM, OS, hypervisor, hardware, etc.) that an application can expect to experience. all without needing to change code application deployment, or load measurement and generation methods. This information provides valuable insight into the causes of application response time issues, and allows developers and performance engineers focus their efforts in the right place. For example, if the platform jitter is small (in the 10s of msec), response time tuning efforts should probably focus on application code and behavior. However, if application jitter is dominated by platform artifacts (as is often demonstrated by JitterMeter results), tuning application code will have little effect, and tuning the platform should be the main focus. JitterMeter builds on a very simple observation: If an empty work unit in an independent Java thread experiences a latency "hiccup" for whatever reason (e.g. a 2 second GC pause, or a 1/2 second hypervisor VMotion event, or a high latency due to scheduling overload) it's a safe bet that other Java threads on the same JVM would also experience the same magnitude of "hiccup" at the same time, and that application response time during that interval would be *at least* as long as that measured by the empty work unit. By reporting a detailed histogram of experienced the empty work unit response time distribution, JitterMeter establishes a baseline, best-case histogram of what the actual application transactions must have experienced on the same system, at the same time.

Our experience shows that response time distribution for Java applications tends to be strongly "multi-modal", consisting of multiple dominant peaks, and that it looks nothing like the normal standard distribution graphs most people would like to model. There is usually a dominant common case of good response times ("typical", "average", "mean", and 90%'ile all tend to be within this main mode), with additional concentrated "modes" of higher response time events. At the low end of the spectrum, we see OS scheduling related artifacts in the low 10s of msec, and Hypervisor artifacts (where applicable) in the 10s to low 100s of msec. The higher end of the spectrum, and the one that represents the most visible response time inconsistencies, is usually dominated by GC related artifacts. Artifacts that correlate with minor GC events (such as NewGen collection) are typically in the 100s of msec range, while artifacts that usually ranges into the multiple seconds range are typically correlated with Full GC events, permgen collections, and promotion failures. The higher you go in the spectrum, the lower the frequency of events is, but it is usually those 2-5 second events that occur every few minutes that give application tuning professionals the biggest headaches...

Zing 4.1 is shipping in the next week. The pricing is unchanged, and is based on an annual subscription per server. A free trial copy can be requested from www.azulsystems.com/trial.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss
BT