Bio Gil Tene, VP of Technology and CTO at Azul, has been involved with virtual machine technologies for the past 20 years and has been building Java technology-based products since 1995. He co-founded Azul Systems in 2002, where he pioneered Pauseless Garbage Collection, Java Virtualization, and various managed runtime and systems stack technologies that deliver a scalable and robust Java platform.
SpringOne 2GX is a collocated event covering the entire Spring ecosystem and Groovy/Grails technologies. SpringOne 2GX is a one-of-a-kind conference for application developers, solution architects, web operations and IT teams who develop, deploy and manage business applications. This is the most important Java event of 2010, especially for anyone using Spring technologies, Groovy & Grails, or Tomcat. Whether you're building and running mission-critical business applications or designing the next killer cloud application, SpringOne 2GX will keep you up to date with the latest enterprise technology.
I’d be happy to. We started Azul in 2002, right after the big crash of 2001 in a very interesting time to start new companies. And the premise and the purpose of starting Azul was really to address some of the challenges in deploying and scaling Java applications that we perceived at the time. Azul aimed to solve both the individual application scalability and infrastructure, scalability and deployment paradigms. When we started that we set off to solve multiple different problems. The key one was the ability to provide infrastructure that will allow applications to smoothly scale and operate across a wide range of operating ranges and be able to share headroom, be elastic in nature and survive a lot of the things that we saw, individual in Java instances at the time being limited by.
When we started off, that included building hardware platforms that could support what we needed, solving key problems in a platform like garbage collection and allowing Java to scale to many cores of CPU to provide throughput in large enterprise environments. We spent a few years developing the solution and our first product, the Vega 1 Platform came out in early 2005. We’ve since developed and shipped three successive generations of the Vega product line, which is a custom hardware-based product and we’ve recently announced the general availability of our fourth generation product, which is a pure software product aimed at virtualized x86 commodity servers, but delivering the same capabilities that we’d had to build custom hardware to support in the past.
To start with scalability at the time, this was in the first several years of the 2000s decade, it really wasn’t viable on commodity hardware. If you wanted to provide a scalable system with a lot of memory and a lot of CPU power, you pretty much had to use custom chips of some sort. And if you looked at the systems that were available at the time that were able to do that, whether they’re SPARC-based or POWER-based or others, they all tended to be fairly customized and the server vendors basically owned the hardware that could do them. We found that the architectures across the board weren’t quite as matched to Java execution as we’d liked.
Key requirements like a truly symmetric system, massive memory bandwidth and other features drove us to building basically our own hardware platform from the ground up. We built chips; we built processors on those chips and drove multi-core very early. At the time when people were experimenting with two cores per chip, we were shipping 24 cores per chip. Our latest generation of Vega is actually 54 cores per chip, so we’ve taken the multi-core thing quite far. We also took the memory and symmetry parts of the system quite far for the time. Our systems had multiple memory controllers per chip, a symmetric mesh of interconnects between the chips that supported symmetric high bandwidth memory access from any core to any memory.
We had to build all those because we couldn’t build parts or we couldn’t buy parts on the open market the would allow us to do that. In fact, normal high end servers didn’t have those qualities either. An interesting aspect when it came to designing our fourth-generation of product is that every time we designed another generation we looked around to see if there are parts that we can build a better product out of without building our own and for three successive generations, the answer was "No.. At the time it came for the fourth-generation product, the roadmaps from Intel primarily and certainly to some degree AMD as well convinced us that there will be systems that could carry our software without custom hardware.
The things that convinced us were really that multi-core has come along and 8 cores per chip and more were coming, but more importantly, symmetric memory bandwidth systems with chip-to-chip interconnects that have enough bandwidth, memory controllers on each chip and coherency that could support symmetric memory access from any core to any memory are coming. The interesting thing I find today is that if you draw a Nehalem-EX architecture of a four socket system from any commodity vendor today and crisscross it with interconnects, four memory controllers per chip and such and you draw a Vega 3 system next to it and you erase all the labels, the system architecture is identical.
So, somebody else had finally built the systems we were building and done it with commodity parts, which allows us to focus on the value add that we are best at, which is delivering Java execution and scalable Java execution, smooth Java execution on top of those platforms without having to carry the burden of building the hardware ourselves into the future.
Zing brings a lot of the benefits that we have been able to deliver with our previous generations with custom hardware, and they really have to do with taking Java to the next level of scale. When we started off with Vega systems, a system with 384 cores and 256 GB of memory seemed like a big monster, but last I checked, we can go on the Dell website today and we can buy systems with 24 virtual x86 cores that are quite fast and about 100 GB of memory for less than 10,000 dollars. So commodity hardware has now caught up with the kind of sizes we were doing before. We’ve built JVMs that could scale to these sizes from the beginning, from inception.
It was clear to us that there was no point in building this kind of capability without solving the core scale problems for individual JVMs. What surprises us actually is that eight years later, general industry JVMs haven’t really moved from where they were eight years ago. While 1 GB or 2 GB seemed like a large amount of memory, it was quite practical to run that in a single JVM in 2001 and in 2002. Today, that is still roughly where the practical limit of individual Java instances are. But today, that limit represents about 2-3% of the size of a commodity server, which means individual Java instances today are highly limited by the JVM’s capability to actually use hardware.
As all JVMs, Zing being an incarnation of it, are able to use the entire capacity of the underlying hardware and scale smoothly within it without sacrificing response times or SLAs and that is one key characteristic of the platform. As a result, we could leverage large amounts of memory without fear of pause times or response time issues, we can elastically grow and shrink the footprint of individual JVMs, so we can share headroom and we can react to loads as they happen on an instantaneous basis. We can use all those features together to provide both robustness and good infrastructure and scale.
But at the heart of all that are key differentiating capabilities around being able to scale memory to begin with, without the bad impact of garbage collection and other things that have become all too well known to Java developers out there.
The answer is "Yes" and "No." The Zing product actually comes in multiple parts that work together. One part of that is the Zing JVM and that is a piece of downloadable code that runs on Linux, on Solaris. In fact, it will be available for Windows soon and other operating systems as well. In the Vega world we actually have zLinux machines running the Azul JVM and leveraging our capabilities. However, the Zing JVM that you run in any of these operating systems is a virtualized JVM. The platform itself that you execute is simply a small virtualization proxy for an actual JVM that gets pushed onto an Azul appliance, which is a better place to run and execute a JVM.
And has underlying capabilities that simply don’t exist in other operating systems and allow us to support things like pauseless garbage collection and elastic memory and dramatic scalability of the JVM to orders of magnitude beyond what a regular JVM can do on a regular operating system. We ourselves do not know - at least do not know today - how to execute a JVM like Zing within the capabilities a regular operating system provides on common hardware. But with Zing we’ve figured out how to deliver the capabilities we needed on top of a virtual appliance, on top of a virtualized infrastructure like VMware or KVM on commodity hardware.
It’s that backend part that we call a Zing virtual appliance that delivers the actual capability. That is also a downloadable piece of software, but that you need to run on a hypervisor of some sort in order to gain it. Together the Zing virtual appliance and the Zing virtual machine provide the solution that combines into the Zing platform.
5. One of the things which the Azul VM is known for is the pauseless garbage collector. Can you describe in general how the pauseless garbage collector works and how it’s different from the garbage collectors that are present in other VMs?
Sure, I’ll be happy to. Pauseless garbage collection is a core part of our technology and without it we wouldn’t be able to reach the scales that we did on Vega or the scales that we can do on Zing. Key to it is the complete elimination of garbage collection pauses, as they relate to heap size, to allocation rate, to mutation rate, to various metrics that applications exhibit as they scale. Our approach to garbage collection differs dramatically from the common approach in the industry in that we’ve taken the opposite direction in garbage collection tuning and dealing with what would be generally thought of as rare events in garbage collection.
If we look at historically what garbage collection has looked like in commercial JVMs and in other types of virtual machines, stop-the-world events are really the bad things in garbage collection and tuning of garbage collection is generally focused on delaying the stop-the-world events as far as possible in trying to push them into the future. Some stop-the-world events could be eliminated in some cases but there are some that simply couldn’t on commercial JVMs today, with pretty much every single commercial JVM that is shipping today having code in its collector that performs compaction of the heap or full compaction on the heap under a full stop-the-world condition where the application cannot run.
All those collectors have various filters and tuning ways that allow you to try and run that code later, make it run further out into the future, more rarely, but the code is there and it’s there for a reason. It’s there because eventually, invariably the heap will fragment like Swiss cheese, will get to the point where there is empty heap, but not enough room to put one contiguous object that you really want to put in there. At that point, some amount of the heap will need to be compacted and defragmented. That operation is the one we focused on to begin with. At Azul we took the approach of saying "There will never be a rare event" and we have to be able to do the worst possible thing at all times.
We will test the collector and we will stress it so that that bad thing will be the thing we do well and we don’t mind doing. By doing that we are able to not just handle large scales and do that robustly, we’re able to do it without any significant tuning, because most of the motivation for tuning garbage collection has disappeared. That motivation is generally been to delay a really bad thing and if the bad thing is gone, there is no need to delay it anymore and you’re off to simple efficiency tuning, which is much more easily done.
So compaction is one of the key things we took head-on and the key thing that differentiates pauseless GC from Azul from every other collector out there is that we relocate objects on the fly, without stopping the application during the relocation of the objects themselves. We’re able to do that without having to locate and remap the potential billions of references to those relocated objects before we allow the application to run. We’ve got various mechanisms within the collector that will allow us to do that safely and without stopping the application.
Those mechanisms allow us to build what is actually a very simple and straightforward mechanism that simply always does the same thing, always performs the same algorithm, will always perform marking, identifying of live stuff in compaction of the live stuff. But there is no other way to recoup memory from the world, so there is nothing else to worry about.
Azul announced an initiative we call the "Managed Runtime Initiative" I believe back in June of this year. The Managed Runtime Initiative is aimed at actually taking the capabilities that we have in our own virtual appliance, the kernel level support for virtual memory and other behaviors that allow us to perform things like pauseless garbage collection and try and bring those into the community as features for platforms like Linux and potentially other operating systems. The Managed Runtime Initiative is really a platform for advocating for new features in operating systems and other underlying layers of the entire system stack that are focused specifically on making managed runtimes like JVMs better.
The specific example that we’ve put out there and we’ve open sourced as part of the Managed Runtime Initiative two key components that demonstrate this capability hand in hand. On the one hand it’s a set of kernel enhancements, in this case Linux kernel enhancements, that provide new functionality, new semantic capability for managing memory and scheduling. And on the other hand it’s a Java runtime based on the open JDK platform that uses those capabilities to deliver pauseless garbage collection on a system that has those capabilities built in. We put those two pieces out there together because there is a "chicken and egg" problem.
It’s virtually impossible to convince people to build new kernel features for non-existent applications and it’s virtually impossible to convince people to build or not just convince, but to build an application that requires non-existent kernel features. So to avoid the discussion of what comes first, we took both and put them out there and demonstrated that together they deliver extreme value - two orders of magnitude improvement in scale and reduction of response time and things like that -- with the hope that that value will convince people in the various communities to include these features in upcoming upstream efforts.
We hope to be able to drive features into the Linux community, to be able to support runtimes like Java and others on future Linux platforms with these capabilities. We also hope that other operating system vendors, not necessarily open source vendors, will look at the capabilities themselves and provide some APIs (they don’t have to be the same APIs) that would match them and provide similar abilities. So operating systems like Windows and Solaris and maybe others would be able in the future to support pauseless garbage collection behavior as well in a native way.
But until then, until the Managed Runtime Initiative bears fruit and we see off-the-shelf operating systems deliver these capabilities, virtualization is a solution that’s available immediately and right now. And Azul leverages virtualization with the Zing platform to deliver the full pauseless capability, the full scale and the full capacity of commodity servers into existing operating systems today without having to wait for operating system evolution. Eventually we’ll have both, but for the near future, you can have a JVM that never pauses for GC that supports 10 and 15 and 20 and 100 and 300 GB, if you want, without ever pausing or stopping. And you don’t have to wait for other solutions to do that; it’s available as of yesterday.
7. There have been more recent garbage collectors added in to the JVM; for instance, there is Concurrent Mark Sweep and there is the Garbage-First collector. Do these address some of the issues with stop-the-world garbage collection that you mentioned?
The answer is yes, they do address it to some degree, but even their results are still far behind the capacities of current servers. Concurrent Mark Sweep which started off in about 2001 has matured in the past years. It is a solution that allows you to delay big pauses, but not to avoid them. It delays them further than a full stop-the-world collector like Parallel GC does. It does a good job of delaying it, but the key thing to look at all collectors in a simple question to any vendor that is shipping the collector is "Does your code have a mode that collects and compacts the entire heap in a single stop-the-world pause?" If the answer is "Yes," if that code does exist it’s there for a reason and the next question is obviously "What will make that happen and when will it run?"
Looking at collectors like the Concurrent Mark Sweep collector which is a class of its own and newer experimental collectors like the G1 Garbage-First collector it’s worth discussing what they do well and what sensitivities remain that force large pauses to still be there. Concurrent collectors are pretty much a necessity for anybody that wants to meet response times with more than 1GB or so of heap and the response time requirements being better than 5 or 10 seconds. Concurrent Mark Sweep does a pretty good job of meeting, in our experience, up to 5 second response time and up to about 3 GB of heap on modern servers.
It can pretty much contain the worst case pause time on 1 GB of live set and on 3 GB of heap within five seconds, but that will still occur. The part that is concurrent in Concurrent Mark Sweep is a mostly concurrent collector. The part that is concurrent in the operation is the operation of finding live objects in the heap, often referred to as marking, and then the operation of sweeping the heap, detecting and tracking down dead objects to recycle them. It’s a Concurrent Mark Sweep collector, which means those two phases are concurrent. The marking phase is a truly concurrent phase and it has a very small stop-the-world operation at the end of it, just to catch up with a last little piece of references that might be marked.
It is sensitive to mutation rate, but unless that application completely overwhelms the collector, the collector’s generally able to keep up with modest allocation rates and mutation rates while tracking down the live objects without stopping the application. Sweeping is done as a concurrent operation there as well. You find the dead objects, you put them on free lists and when you need to allocate new things, you just use those dead objects as space. The part of the operation that Concurrent Mark Sweep does not perform concurrently is compaction. If it ever needs to move even a single object from point A to point B in memory, in order to free up a larger contiguous space in memory, it will have to stop-the-world, scan the entire heap for references that might point to those objects and remap them.
That compaction pause is generally the worst possible pause a collector can have and it remains in the Concurrent Mark Sweep collector as it does in other similar collectors that attempt to do some sort of concurrent operations to delay the inevitable pause, but do not compact to avoid it. Newer collectors, experimental collectors like the G1 collector for example, go a step further. They actually perform stop-the-world compaction, but they attempt to do it in small increments instead of in one large stop-the-world operation. While G1 is probably a couple of years from being stable, it already demonstrates that when compaction happens, it is able to cut it into pieces often, but the key word here is "often."
Full heap compaction is still a necessity that appears in these collectors as popular objects and popular parts of the heap end up being referred to from pretty much everywhere in the heap. And compacting those parts of the heap will end up requiring a full scan of the entire heap before allowing the application to continue the execution. The thing that makes the Azul collector unique today in the market, at least in the commercial JVM market, is that when we compact and relocate objects, we do not need to go and track down any references to these objects before we allow the application to continue running.
The reason we’re able to do that is that we introduced what is called a "reference read barrier" or "reference load barrier" into the execution of the JVM. We’re able to intercept any attempts to use references to relocated objects before they’re ever actually accessed by the Java applications. That read barrier is key to what allows us to build a fairly simple mechanism that never has to stop to remap references.
The Azul JVM uses, as I said, a read barrier which basically means that every time a Java application or the runtime on behalf of the application is attempting to load a Java reference from memory, not use the Java reference, but simply get the reference to begin with, the reference undergoes some sanity checks that verify that it’s ok to use it. Now those sanity checks for example make sure that the reference isn’t pointing to an object that’s been relocated to another place. We also use the exact same read barrier to verify that the reference has already been visited by the collector, traversed by the collector or is known to be traversed by it in the future. This allows us to do another part that is a deterministic single-pass marker.
The read barrier is a new thing. Other commercial JVMs today do not use a load barrier of this sort, and while there is plenty of academic work on this subject, read barriers have generally been both considered expensive and did not provide a complete solution algorithmically for the GC problem. With the Azul read barrier, which we actually provided details of in a 2005 academic paper titled "The Pauseless Garbage Collection Algorithm," basically makes sure, promises an invariant that says the application is unable to observe a reference to a relocated object and is unable to observe a reference to an object that the collector is not assured to be marking, if the collection is in the mark phase. And that intercept is done very efficiently.
In the Vega machines we actually had hardware systems specific instruction, a load value barrier instruction that was able to complete the entire barrier in a single cycle. On x86 virtualized hardware, we don’t quite get that single cycle capability, but we’re lucky enough to run on an extremely fast super scalar pipeline architecture that can do out of order execution and speculation. And in effect we sprinkle the microcode equivalent of the simple operations of our read barrier, along with the rest of the compiled code that we generate with our JIT compilers. So it tends to hide pretty well in the execution pipeline and be of fairly low or nearly no cost to the execution of the Java code itself. But the barrier exists there and if an attempt to use reference occurs, we deal with it.
We logically call that a trap. It’s not an actual CPU trap; it’s a logical trap in execution and instead of executing what the Java application wanted to do, which is load the reference and then perhaps use it, we’ll sidetrack and we’ll sidetrack into a handling of the trap condition. In that handling of the trap condition we do the obvious thing: we fix the reference, make sure that it’s pointing to the right place, where the object really is, or make sure that the collector is assured to mark this reference by logging it somewhere or, in some cases, where the object we point to should be relocated but hasn’t yet been relocated, this might actually relocate the object on its own, to make sure it does it without waiting for anybody else.
In all those cases the result of handling the read barrier trap is a good reference that would have passed the read barrier. We can always do that, we can always push forward. We never have to wait for any other thread, including garbage collector threads to proceed past the read barrier execution. A very important quality of the read barrier is that it is self-healing. The self-healing term is something that we coined back in our paper in 2005. When we refer to "self-healing" we mean that beyond just fixing the reference itself that we loaded from memory and that we’re about to use in the application, we also go back to the place in memory that we loaded that we loaded the reference from and we heal the memory place so that nobody else, not us in another iteration of a loop and no other thread that might read the same content would ever have to deal with that trap again.
So we atomically replace the content of memory with the fixed reference that points to the right place or indicates the marking has already been done. By doing that healing, we basically assure that every reference at memory will at most trap once. That assurance means that there are no hot storms of reference traps. It means that the collector can actually be lazy about fixing references as it goes and it’s in no hurry to fix them. The collector might reach the reference first and mark through it or relocate it or remap it or the mutator the application might reach it first and do the tiny little bit of GC work on its own and then keep executing. Those tiny little bits of GC work are an order of a microsecond or two of execution when a trap occurs, so they are so fast that it doesn’t really matter.
We generally see these events spread over time very evenly. So it’s not like you have a strong storm of them occurring that is problematic. The read barrier is really designed to do this well and efficiently both in the fast mode where the reference is just fine, just keep going at virtually no overhead or in the slow mode where you’ve actually run into a reference that needs some healing and execute the healing very quickly, as I said, without ever blocking or waiting for another thread before we can do that. That basically is the essence of pauseless operation; in fact the mutators can always push through the operations that we’ve run into and can always run and deal with references as they load them.
While the implementation of the Azul collector does have what we call "phase flip" pauses. Technically we will bring all threads to a stop, flip phase and then let them all go. The operations that happen in those phase flips are basically global changes of state that have nothing to do with the heap size, nothing to do with allocation rate, nothing to do with the activity of threads, they’re simply a crossing of a barrier, globally. They generally happen in sub-millisecond times.
Part of our pauseless garbage collection mechanism includes what we call "quick release," which is really the ability to recycle memory immediately after compaction without waiting for the garbage collection cycle to complete. The way we work that is interesting: the collector is actually a very simple mechanism. It goes through a simple pass mark phase; it detects what’s alive and where in memory and it tracks liveness by page - 2MB in the x86 case. So it knows how much memory we have that we can recoup out of every 2MB in memory. We then do the obvious part of going after that empty memory, sparse memory first and recouping that memory by relocating all the live objects away from a page, compacting them somewhere else and that gives us an empty 2MB piece of memory right there.
In a normal collector, once you’ve relocated an object somewhere else, you have to find all the references in the heap and remap them and as we said, we just relocate and we have time. In fact, we will delay our remap operation until the next mark phase. We’re really not in a hurry to remap. The next mark phase is going to scan all the objects anyway and look at all the references anyway. It will also fix any references that need to be relocated and if the mutator runs into them before that, that’s fine and it will only be fixed by then. However, getting the memory itself to be reusable as something we want to do without waiting for the cycle to complete, primarily because we don’t want to be in a hurry to complete the cycle.
In fact, we want to be able to perform these compaction operations in very low memory conditions, which would require us to do what we call "Hand-over-hand" compaction, compacting some objects away and recouping an empty megabyte in order to compact some other objects into this MB. So even when we have a few MB left, we want to be able to compact the heaps using those MB. We use a trick that we call "quick release" to separate the virtual memory mappings from the physical memory content, so we can recycle the physical resources without having to remap virtual references. At the point where we’ve relocated objects to compact, we have objects that have moved from point A to point B and we have references to the virtual addresses of those objects laid all out in memory.
Until we know that no such references remain, we can’t recycle these addresses, we can’t take the virtual address space that we relocated away from and reuse it for something else, because there could be pointers to still pointing into it. However, we keep all our relocation information, the information of where objects went to and how to correct references outside of the 2MB page that the objects came from. Through doing that, we know that there is no physical content, no actual content in this page that we need in order to complete the operations. What we do with that capability is once we’ve relocated the objects away from a page, we unmap the page, we retain the virtual memory mapping, but we unmap the physical content and we put it on a free list that could be immediately mapped somewhere else in the memory and used for actual allocation.
In fact it will likely be used to allocate new compacted objects to continue Hand-over-hand compaction. But if not, we’re freeing them at very high rates and that’s also where the Java allocations will go. In general, the lifecycle in this collector is extremely simple: all objects that are compacted away we leave empty pages behind and the only way we free pages is to unmap them. We simply drop them, we give them back to the operating system; they are not even retained by the JVM. The only way that we allocate objects, whether it’s for Java mapping of new objects or for compaction in the collector itself is to map new pages and ask for physical backing store for them. It’s those two operations that churn through memory. These are the only operations you will see us do. It does mean that we interact with the virtual memory system at a very high rate.
10. In that kind of interaction with a virtual memory system, in my experience like with the Sun VM for instance, what you’ll often like to do is to set the min and max RAM to be exact the same so that everything is allocated up front and you wouldn’t be giving and taking away pages like that. How are you able to do that?
The Azul collector relies on operating system features in the Zing virtual appliance and in the Managed Runtime Initiative that allow us to perform operations a regular operating system can’t do. Most of those features focus on virtual memory management and physical memory management underneath. It is important to us to be able to map, remap, protect and unmap pages at a continuous rate of allocation that matches the capabilities of modern CPUs. A modern Nehalem core is able to easily and comfortably generate 1GB/second of garbage, which means we want to be able to handle and keep up with tens of GB/second of sustained churn through memory.
This translates in turn into hundreds of GB or TB/second of remapping operations at peak points and the need to keep changing those mappings at good rates. Regular operating systems simply were never built to do this. There is nothing super-hard about it, it’s really just knowing that you want that operations because it’s useful. But existing operating systems in the semantics around key operations like map, remap, protect and unmap are very limited in throughput. In fact, if you measure the rate at which a modern Linux operating system is able to remap memory, while a multithreaded program is actually executing, not when only one thread is performing operation. You’ll find that on most platforms Linux is unable to remap at higher than a few hundreds of MB/second.
So we have a gap of about 2-3 orders of magnitude between what we need and what the operating system can do. Luckily, using large pages, using finer grained locking and using interesting semantics like not requiring TLB invalidates and hard consistency across CPUs until the process indicates that it wants it. We’re able to cover those three orders of magnitude and even more and when we measure the rates at which the Managed Runtime Initiative kernel mods and the Zing virtual appliance can deal with virtual memory, the operations for things like remaps tend to be five orders of magnitude faster than a typical Linux kernel, about 10,000 times faster for remaps/second.
And some key operations that are very critical for us at the phase flip parts of our collector are a million times faster. A million times faster is the difference between flipping a phase in 20 microseconds and flipping that same phase in 20 seconds. It’s the difference between a practical pauseless collector and one that would be nice on a whiteboard but not really practical to implement. The kernel features that we rely on are key to us being able to complete this algorithm and actually perform the operations we do, and that million x faster thing I’m talking about is actually the operation of making changes to a lot of mappings and pages where we’re about to relocate a lot of content.
We need to slap on protections and remap those pages to other locations so we can deal with them but know that the runtime can no longer access them without us intercepting them. We need to do that in an atomic place for all Java threads at the same time. To do that with a regular Linux remap operation, we would have to stop the threads, do all the remaps and then let them go. That would be a pretty long pause. With the Managed Runtime Initiative kernel features, we actually maintain a double buffered, double mapped effectively, page table where you have an actual currently active page table with the current mappings and we prepare a large batch operation of remapping a lot of things and re-protecting a lot of things, but we don’t actually have that operation take hold until we’re ready.
When we’re ready, we bring all the threads to a stop, we flip the page table like that and then we let them go. That flip actually costs us one pointer copy/GB of memory to perform a flip so we can do quite a few GB in a single flip in 20 microseconds.
Yes, it is quite similar to a lot of concurrent algorithms out there where you perform large operations then simply replace content with an atomic operation. The atomic operation is usually a critical bottleneck, so you want to make sure that it’s as short as possible and that’s why you perform the operation ahead of time for it. In our collector, the collector controls these semantics orders then decides when they should happen quite well. We just need the underlying OS capabilities to really do the operations in the looser kind of semantics that we need. Our collector has no need for the operating system to invalidate the TLB caches of every CPU running threads on the system for every 4KB or remapping we do.
We can do 64 GB of remapping and ask for a single flip in a single TLB invalidate at the end. And it’s really all about cost elimination, elimination of work that we never needed and it’s simply there because of the historical semantics of virtual memory manipulation and kernel operating systems.
Elastic memory is something that is enabled through solving garbage collection to begin with. In fact, at Azul we shipped our first pauseless garbage collection system almost five years ago or actually more than five years ago now. We’ve spent the time since then evolving pauseless garbage collection, making it run across wider spans of memory, higher throughputs and such, but also figuring out what the next thing that you can do once GC is basically solved. For us, garbage collection is a solved problem, it’s gone, it’s never going to be in your way again. What can you do with that other eliminate pause times? The thing that we found most annoying about large scale Java deployments is that they tend to be very rigid, very inflexible.
You pretty much have to decide how big your JVM is before you start it and if you got that number right, great. But if you got it wrong, one of two things will happen: either you are very unlucky and it’s too small and you will crash and run out of memory or you’re lucky and it was too big and you just wasted a whole bunch of memory nobody else gets to use. Your chances of actually getting it right are close to zero. If you got it right today, it’s probably wrong tomorrow. So elasticity or elastic memory is the simple ability to grow and shrink what you actually have in memory footprint, in heap footprint, according to the needs and the use of application.
Because our garbage collector releases memory back to the Zing virtual appliance immediately in each collection and always at a very high rate recoups that memory for allocation on need, it’s able to share memory headroom across JVMs elastically. We use that capability along with SLAs and policy to provide JVMs with both robust assured committed amounts of memory, but headroom above that memory for two distinct and separate purposes. The simple first purpose that most people will gravitate to is what we call performance elastic memory.
It’s a nice way of sharing headroom so that if there is available memory, I don’t have to garbage collect tied back to back and I don’t have to try and fit in a very tight place and I can just grow into it. It’s there, it’s empty, let’s use it to not waste CPU. It’s really a tradeoff between CPU and memory at that point where it’s just there, let’s grow into it. Or maybe my live set is going up and down and other people’s live sets are going up and down and I want to be able to share headroom for the live set itself. That performance elastic memory is a feature that is useful and helps efficiency, it helps robustness, but more important and something that’s delivered value in Vega appliances for several years now in commercial mission critical back to business application, is what we call insurance elastic memory.
Insurance elastic memory is there to basically correct your mistakes or your inability to predict the future. A way to think of the difference between insurance elastic memory and performance elastic memory is this: insurance elastic memory is memory you really don’t want to be using. You didn’t plan on using it, you’ve designed your committed level of memory, you are supposed to fit this and if you need more, you made a mistake; you actually want warning lights going off, you want somebody to tell you that something is wrong, maybe you just misconfigured it, maybe you have a memory leak. But this is memory that is saving you from crashing. The only alternative to using that memory would be to throw an out of memory exception or throw a really large GC pause and then throw an out of memory exception.
Insurance memory is memory you are really happy to be using when you do, but you shouldn’t be using and it’s the job of the garbage collector to keep you away from that memory at all costs. So, if you are in insurance memory, we are back to back spending CPU, collecting, trying to get you out of there. If you have a memory leak and you can’t get out of there, we’ll keep trying to collect, which is fine. You have CPU cores to spare, we’re ever stopping your application, you are just working harder. Insurance memory we just try and stay out of and it’s a warning condition when you are tapping into it. You are just really happy to have it there.
Performance elastic memory is the opposite kind - it’s memory you want to be using. You’d rather use that than collect within your committed memory because you’d be more efficient. That allows you to basically set a basic commitment to meet base level SLAs, use performance elastic memory to meet peak SLAs or high rates if you want to do that and insurance memory to protect against getting the other two wrong. Some people will use pure commitments and pure insurance, but nothing in the middle. Some people will say "I don’t need insurance, this is a QA system, it’s free for all. Who cares?" Some people will commit everything and not let elasticity happen at all.
It’s up to the customer to decide how much of each memory to do and there are settings and command line flags for each JVM that lets you say "This is the maximum you are allowed to use above XMX" or "Don’t use more than XMX because I’m queuing the system and I really want to make sure that you fit in there." But generally, our experience with production customers is when they go to production, they’ll release the maximums, allow the JVM to use as much as the insurance would allow it because it’s always better to let the JVM run and warn you than it is to crash it in those situations, as long as we could do that without hurting your SLA. Because of pauseless garbage collection, we can keep growing.
We can take a JVM that was configured with 2GB of XMX and borrow 20 GB of insurance memory and make it through the weekend or to the weekend without crashing even though you have a bug and a memory leak. And we will do that without slowing down your application and without pausing your application, so it’s ok to do that.
Thank you and I invite everybody to try this new Zing thing out.
Nice and quite a lot of detail
Will the Hypervisor running Zing be able to run other applications side-by-side or only the Azul JVM?
x86 read barrier
Waiting for this in a normal JVM
Re: x86 read barrier
The details of the barrier's ref-sanity checking operations and how they are implemented varies in barriers exist in C2 (-server) compiled code, in C1 (-client, or -tiered on the Azul VM) compile code, in interpreted code, and in the runtime itself (there are a whole bunch of places in the C++ HotSpot code that reference objects, and they all have to have good read barriers too). The level of optimization done for each differs (in the C2 case it is highly optimized to fit well in the pipeline).