Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Gil Tene on Zing, Low Latency GC, Responsiveness

Gil Tene on Zing, Low Latency GC, Responsiveness


1. I’m Ari Zilka here at QCon and I’m interviewing Gil Tene CTO of Azul Systems. Gil why don’t you introduce yourself for folks who somehow haven’t heard of Azul by now, give us a quick overview?

Thanks, I’m happy to be here, I’m the CTO of Azul Systems and I’ve been working on Garbage Collection and high performance managed runtimes for the last ten years or so, I’ve been using runtimes for twenty years and building all kinds of other things and systems before that. Azul we basically build scalable consistent execution Virtual Machines and we’ve been doing whatever it takes to do that for about ten years, that included doing some crazy things like building our own chips and our hardware, and our own appliances and we’ve shipped very impressive things in the large scale SMP market over the years. In the last couple of years it’s become possible for us to take our entire stack and move it to pure software on x86 commodity servers since the hardware just got good enough for us to do that, we no longer have to build our own hardware. We have a product called Zing today that is basically a native JVM for Linux, that provides pauseless Garbage Collection and the ability to use large heaps or small heaps with very consistent execution, so that the worst case jitter that comes out of your application and the computer you're running on, is very, very contained.


2. So you said SLAs and you said pause times and things like that – it sounds like you are getting latency predictability of Java based applications, is that correct?

Yes, I think there is a wide spectrum for what predictability people want to have, historically we served kind of the enterprise market, the interactive human response time market at large portals, big retail stores, banks, self-service things and telcos, but recently especially with our Zing product that is native to Linux, we’ve been drawn into the low latency market in the financial world where people go further and further down in the consistency levels they want, where it is no longer about human response time needs or even machine to machine response time needs with things like e-commerce is now about trading and reaction time to market, some things like that.


3. We are here in New York in the hub of Financial Services, ca. 2006 there was something called “a low latency race” that many companies were embroiled in, can we get a below second to low single digit or low double digit milliseconds, are we there today six years later, and why or why not, and is there more to squeeze out of Java based systems?

I think that the race for Low Latency is probably never ending, it’s a race to the bottom and it’s an arms race, because in trading the fastest guy wins, and rings the bell. And I think that the interesting thing is that Java has had interesting use cases in that world and interesting challenges, so when latency was hundred of milliseconds, Java did pretty well, and then when the latency race few years starts going down, and down, things like Garbage Collection started being big issues for people in that world. And they’ve done a lot to work around them and tune and figure out things, and I think that today they're doing tens of milliseconds pretty well, if they have a very restricted application behavior.


4. Over what kind of time frames, like hours, days like sustained tens of milliseconds for even weeks?

I don’t think that is actually practical in Java in most JVM’s, we are an exception, we are there to solve that problem, but in regular JVM’s there is almost an inevitable pause that will be probably in the big fractions of a second, maybe multiple seconds called All Generation or Full GC, and that will eventually happen, it’s a question of how long before it happens, people are good at tuning that to the future, most of people that I know in financial services in these markets tune it to the future that doesn’t happen, meaning they'll tune it so it'll happen only every two or three days and they'll reboot every day and so it never happens. With container applications with known loads, with known code and very predictable overall data sets - that is doable.

So the multi-second thing is not a problem in that space really, is just a multi-millisecond thing that is a problem, in the more complex enterprise space I think the multi-second stuff is very real but specifically for the lower latency markets people have managed to live a day without a huge pause, now they are dealing with I have this problem every few minutes or every few seconds, I have tens of milliseconds of jitter in my system because it freezes. We are seeing people go well below that and doing some pretty aggressive things, bringing Java maybe to the ten millisecond or even slightly below that on regular JVM’s, by using very aggressive code practices, not allocating objects, but we actually believe that Java can go all the way to the hundreds of microseconds and better, and one of the things that Zing does is provide a Garbage Collector that not only eliminates all generation large pauses and also eliminates young generations small pauses, that human being typically won’t feel, but machine certainly do.


5. The young generation pauses, what is that about?

As Java allocates objects, the collector has to clean them up every once in the while, the young generation is a very efficient way of focusing Garbage Collections only on the recently allocated objects most of which also died. It’s a well-known technique, it’s been around for few decades now and it's a practical necessity for any Garbage Collector in a server today. By running younger generation pauses and promoting surviving objects after a while to an old generation you are able to push that big bad pause in to the future, that multi-second one to the future, but the young generation pause itself is in most JVM’s a stop the world event, when you freeze everything, you copy all the live objects from a from-space to a to-space compacting them, get a lot of free space and keep running and allocate that and do that again, and again.

The leakage of the young generation test has to be very slow into the old generation and that’s why people in this space have been able to push that problem into the future. But the frequency of young generation pauses is pretty simple math, how fast are you allocating and how much empty memory do you have, every time you fill that up you have to do a pause to clean that up. And in most financial system that happens every tens of seconds at least. The length of the pause is usually measured in milliseconds it could be ten, twenty, thirty, eighty milliseconds in some cases, sometimes larger than that.


6. That is the jitter you were referring to, basically apps kind of hovering on a throughput instead of being predictable and flat?

Yes, the application will run really fast and then it will stop, and then it will run really fast, you’ll see this almost perfect bimodal behavior, it’s not an average, it’s not a normal curve with a standard deviation that would described well, it’s very fast because no Garbage Collector pauses are happening and you are running the optimized JIT compiled code at a hundred microsecond time, and then you do nothing for twenty milliseconds, and then you go back. I like to call this hiccups, system just goes through a hiccup and it's a discontinuity in execution, the computer just didn't do anything for you for that time and then it runs really as fast as it can. For human response time is virtually invisible at the young generation level, but for trading it could be devastating, somebody could arbitrate you in that twenty millisecond window and make a lot of money off of you.


7. So it sounds like you are moving the low latency race an order of magnitude forward with Zing, so are you at the microseconds scale today and you know how predictable are you getting and why doesn’t Zing suffer from the copy collector hiccups or jitter?

So I'll answer that that in reverse, Zing has a concurrent Garbage Collector both for the old generation and the young generation which means that we don’t do the Garbage Collection work with applications stopped, we perform all that work that is needed, compaction and the rest, concurrently with the application running. In Zing, in practical terms we do bring the applications to tiny phase flipped stops during the garbage collection cycle, but we don’t do work in there that has anything to do with the objects or the heap size or the processing there. There is mostly telling threads to get to a point flipping it and going, and depending of how well people can tune their systems where in financial services low latency they do things like make sure there are more cores than threads and things like that, you can get those flips to be sub milliseconds and like.

I don’t like to make promises of reaching numbers because then everybody expects the same, but here is how I describe what you usually get out of the box without much tuning we easily get to below ten milliseconds worst case, with a little bit of tuning which is usually a day, you easily get to one to two milliseconds worst case. Beyond that it starts depending of what your application does and you know, how well you can figure out what codepaths might be causing problems or how well you can make sure you don’t have threads contending for the same CPUs at the same time, but we’ve seen people chop that down below milliseconds to hundreds of milliseconds and the best I’ve seen reported is actually by Martin Thompson who has blogged about a system running with a worst case of eighty microseconds running on Zing.

So we’ve seen this numbers but what I like to warn people is your experience may vary, your mileage may vary, and it takes work to get to below a milliseconds of consistent execution. Probably the most common thing I see is that the system itself, not the JVM, not your application, but the system itself also needs to reach that level of continuity, and I usually recommend for people to run a test, like an one hour idle test and run some sort of jitter meter on it, we have JHiccup which we open sourced; and just, measure your idle system and see that doesn't have a hiccup of a few milliseconds here and there because Linux fired up a cronjob and a lot of threads are running. Once you get the system quiet enough to that level we’ve been able to get people to the hundreds of microseconds consistency, and we are looking forward to doing a lot of that.


8. So last question, what should Zing not be used for, it sounds like is the holy grail in JVM’s ?

What should Zing not be used for? We would like people to use it for everything, it’s a good JVM we think it can be used for everything on a server, we are not a client-side JVM, but we are very focused on servers. The other part that I have to point to is we are in making applications do high throughput while maintaining good responsiveness, whether it’s human response time with big portals or financial services, if you have a Garbage Collection pause issue, we have a solution for that, but there are places where you don’t have a Garbage Collection pause issue, maybe the responsiveness is good enough for what your customer needs, or maybe you are running an overnight patch application and you just don’t care, and we are not going to make an overnight batch application finish any quicker, we're just going to make it run without a twenty minute pause in the middle, maybe a twenty minute pause is the right way to do it, so if you care about response time of individual transactions and Garbage Collection is in your way, Zing should be a great fit, if you don’t have response time issue or need, then you know, any JVM should able to do the job.

Sep 06, 2012