BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Cliff Click on In-Memory Processing, 0xdata H20, Efficient Low Latency Java and GCs
Recorded at:

Interview with Cliff Click by Werner Schuster on Jan 10, 2014 |
29:36

Bio Cliff Click (@hexadata) is the CTO and Co-Founder of 0xdata, a firm dedicated to creating a new way to think about web-scale data storage and real-time analytics. He worked on the HotSpot Server Compiler (the Sea of Nodes IR). He helped Azul Systems build an 864 core pure-Java mainframe that keeps GC pauses on 500Gb heaps to under 10ms, and worked on all aspects of that JVM.

Code Mesh London is an annual conference dedicated to non-mainstream technologies. In 2013 it featured talks from over 50 inventors and experts in languages, libraries, operating systems and technologies that handle the programming and business challenges of today. Programming languages discussed ranged from technologies that have been around for a while such as Haskell, Clojure or Erlang to new languages such as Elixir, Rust, Go and Julia.

   

1. We’re here at CodeMesh 2013 in London, I’m sitting here with Cliff Click. So, Cliff, who are you?

Yes, that’s a good question. Well, I am Cliff Click, of course. I am an old school hacker, I am a compiler, writer, I am one of the primary drivers for making Java go fast, I worked for many years at Azul Systems, doing low latency GC, as well as custom hardware for doing Java and then for the last year and a half I am the CTO and one of the main hackers at 0xdata, which is a company focused on doing big data analytics, predictive math modeling for all the data that fits in the RAM of a cluster. So, 0xdata will happily inhale data from Hadoop and HDFS, S3, local disk, URLs, URIs, whatever, stripe it around a cluster, optimize and compress it so we can pack more memory in, then do high-end math modeling on it, so logistic regression, generalized linear modeling, gradient boosting method, Random Forest, KNN, a couple of other techniques, and general data munging.

A lot of this data comes in dirty, it has to be cleaned, we replace missing values with computed means or drop outliers , do all kinds of things you would normally do to data on a desktop environment using a desktop tool like R or RStudio or Excel, except we do it on data that’s hundreds of GBs or TBs. And we’ll do it fast because it’s all in-memory analytics. And that in turn, lets you deal with data in these very large sets, the large data lets you build models that are more predictive and that drives the accuracy of the model and in turn drives what you are doing with it. So, a sharper distinction between fraud and non-fraud, better ad placement, better health insurance risk analysis, there are a bunch of use cases, they are all over the map, people have been reaching out to us very steadily for the last year, we have a number of paying customers, some of whom are in production, basically doing math modeling of big data.

   

2. What’s the interaction model with the user in that case, do you infer information automatically and how do I as a user interact with your tool?

We have a couple of different ways that we interact with the tool. That first way we demo is that we have a nice web interface, so it’s like a clicky-clicky drive-through shooting, you can click your way through some “Load a file” menus, the file is hundreds of GB or it’s directories of Hive files where there is a large amount of data, pull it all in, investigate it, look at it, add or remove features and columns and do things with it, and then just click a button and get a logistic regression running. Logistic regression, we have easily the fastest one on the planet, it’s usually over within under a minute for hundreds of GB and then from that you get a model out and there is some viewing of the model’s results which will is called a confusion matrix, area-under-curve kind of metrics, the quality of the model as it is presented on test and training data kind of things.

From there you might click your way through another button and get the model dumped out as Java code. And you can take that Java code as is, and basically put it into production as a stand-alone, no allocating, just doing math, low latency kind of we can go, apply your model to new data and do the prediction the model is going to do. So that layer if very useful for demoing because it’s a nice web interface, very useful for quick looking at stuff and surfing around, when people begin model building in earnest they usually work out a workflow they like and at this point the web interface gets in the way and they want to script it. So the web interface is in turn driven by JSON, REST API under the hood, and the rest of the JSON API is essentially auto-generated and the same thing that generates that, generates the web interface piece of it, so that they are one and the same, just the syntax is different, so the web form, you click through, the URLs are totally identical, you can write a script that does REST and JSON.

With the REST and JSON interface we then supply wrappers for Python, for R and for RStudio, for Excel, for Mathematica, and from there you get it into a more normal environment for doing math modeling. I guess R and RStudio are the bigger, more commonly used things, but some of the others are definitely at play, and you can sit on the R console and basically have it forward your workflow to the H2O cluster and have it do all the big math. So it looks like RStudio, it feels like RStudio, because it is RStudio, and yet you are talking about data that is 100x bigger than you can fit on your local machine. So that is the second layer that we expect people go to. And at that layer you are dealing with the math that we have devised and we are using state-of-the-art algorithms and state-of-the-art implementations, but it’s not necessarily all the math there is and you might have some more interesting stuff that you want to do that we haven’t implemented directly, in which case you need to crack open git and pull the source code down and then you can start writing your own math and at that layer you are writing Java code and the primary mechanism is a very clean version of Map/Reduce, sort of theoretically clean Map/Reduce, where map is a function that takes something of type a and produces something of type b, and reduce takes two things of type b and makes one thing out of it.

And then you can implement almost anything you want in a Map/Reduce call that’s plain old Java and we will distribute it and scale it around a cluster for you automatically. So, all the parallelism is taken care of and all the load balancing and all the data roll ups and the management of the cluster is handled by H2O. You basically can write a map in a line, a reduce in a line of Java code and that will run all CPUs, all machines across the clusters, as fast as it goes. And we have people who are doing that, they know the math they want is some business specific thing, it’s their math, but they are using H2O as a distributed system server to spread the math out around the cluster.

   

3. What’s the relationship to something like Hadoop, because you say they can write their own Map/Reduce code?

The relationship is interesting and subtle and everyone jumps to the wrong conclusion. So, the first thing people jump to is they hear Map/Reduce and they hear Hadoop and they think we use Hadoop’s Map/Reduce. And the real story there is we’ll input and output Hadoop as actually HDFS, not Hadoop, just HDFS, as a parallel distributed file system, we’re reading the data, and we want to read it in parallel distributed ways, so we use HDFS. And that’s fine. And then all the big data and most of it is in HDFS, great. We’re using Map/Reduce because it’s a paradigm that scales well. This is not Hadoop’s Map/Reduce, there is not shuffle step, there is no sort step, the maps are small and in-memory only, the reduces come early, come often, every time two maps are done, a reduction is done between the two maps to shrink that data result immediately, there is no collection of all the maps, we get fine-grained load balancing, we’re using Doug Lea's Fork/Join for load balancing within a node and we don’t see the curse of last reducer, we get good data distribution and when we load the data in and therefore almost all the math we are doing is on the order of magnitude of the size of the data and therefore if the data is even the amount of work is even, so the CPU load utilization is very even and we don’t get the last reducer issues.

Reductions come early, come often, it’s not Hadoop’s Map/Reduce by any stretch. That’s said for a production deployment, we need a cluster to run on and Hadoop is one of the main sources of clusters, so we can be engineered to run as a good member of the Hadoop ecosystem, as a giant mapper that sticks in all machines and doesn’t go away until our math is done. When we do that, we don’t use Hadoop’s clustering technology, we use our own, we don’t care about the name node or the job tracker, Zookeeper, that stuff, we’re just using Hadoop as a source of a cluster and then we’re parked over the data, so the data ingest is aware of that and pull the data in that fashion as well, and we’ll supply the necessary info to the job tracker, so standard Hadoop manager can look and see that H2O Java is running on the cluster. But that said, we are not using Hadoop for its Map/Reduce capability, we are using it as is a source of a cluster and we are using Hadoop for its HDFS, but not the Map/Reduce technology that’s built in it, but we have Map/Reduce, it’s just not Hadoop’s.

   

4. You mentioned your computations are all in-memory, there is no reading a database while you are working? How do you fit all that data in?

There are a couple of different things to go here. The first thing we do is just the observation that memory is cheap. And I can’t really buy a machine with less than 32 gigs and if I go down my Fry's I can buy a TB of RAM for 20k, I think we have a TB of RAM in our server room on four modern but not bleeding edge thing, just Dell server blades, cost us 20K. So, as soon as I get that much data, I am almost two orders or magnitude beyond what people normally model with. Modeling up to a couple of gigs has been happening and people pay for an expensive desktop server and they run R console and RStudio, it’s slow, but it works. Buy more memory and run it out of memory, that’s the first observation.

The next thing we did was a column compressed store, and column compressed stores have been around, a lot of people do that, we are apparently particularly aggressively good about it, we typically see 2x- 4x better, smaller footprint than gzip on disk, so being 4x better then gzip on disk is pretty good when we’re talking about big data, and that allows us to squeeze more data in. The next thing we observe is that the time it takes to do these models is limited by the passes of the data, the time it takes to r pull data in, most of the passes become memory bandwidth bound, how fast can we pull the data in. Highly compressed data, you get more data in for unit of cache bandwidth that you’re pushing trough. The modern x86 as a cache line moves through even it’s highly pipelined but you’re trying to feed 32 CPUs, every CPU has got hundreds of clock cycles to do something with that cache line, so it decompresses. We have a compression strategy which is essentially every individual element is compressed by itself and it’s decompressed by itself, often with a Shift, Scale, Add kind of thing. Some of them have more sophisticated bit length and bit run encoding and things like that, but the decompression essentially happens in the cache miss shadow, it’s free, so we compress the heck out of it and then we decompress on the fly and we do the math in the CPU and the roll ups and the math is typically pretty small and that just lives in your L1 cache and it’s happy. And the other aspect of this is the data is big and can get arbitrarily large, so you can ask for data sets that are beyond what will fit in memory.

In that case we will spill to disk, we have a user mode spill mechanism and then roll through the data in and out of disk. But as soon as you do that, you lose the advantage of in-memory calculation, the disk is 100x slower than memory, you must touch all the data, you must go through it, usually several passes, you have to make all these passes to the data off-disk, what goes from minutes goes to hours, it will work, it’s just really slow. Instead we suggest you down-sample until it will fit whatever amount of RAM you have and run it in memory. And we’ll take memory that’s easily 100x more than what you’re typically used to, we’ll use all those CPUs that come with that memory and it will run really fast. So that’s the desired, useful, sweet spot for us.

   

5. You say you have a clustering solution, but I hear you say you can have a server with a TB of memory; do I need just one server or what’s the parallelism story there?

We will run parallel within a server and then distribute it across a cluster to whatever extent you like, most of the math algos can use more memory and to that extent you’d rather have fat servers relative to the number of CPUs and then every time you cross a JVM boundary, you have to open a socket and send your data across and that has speed issues. So we would rather have fat JVMs as well. So the optimal solution for us is fewer fat JVMs and we will happily live with the old school GC collector at 200 GB mark, you get a 256 GB machine, use the -Xmx 250, fine, that works great with us. And that’s because the data is structured in such a way that the GC costs are very, very low. There are some of the algorithms which have more CPU than memory use for and then you might want to go trade off, you’re asking for more CPUs and less memory, but that’s a typical tradeoff. In any of those scenarios, if it’s more CPU or less CPU to relative memory, you always want fewer JVMs because every time you have another JVM you have a boundary to cross.

Most algorithms do a log tree break down of things, so you only cross the boundary once per pass over the data, but if you are making a bunch of passes, you can be latency bound on the network traffic crossing the cluster. In particular the gradient boosting method has very fast passes, it has to make a lot of them and it can get network bound, whereas the logistic regression typically is pretty CPU heavy and goes over the network only a handful of times to produce an answer and it’s more CPU bound, and you’d rather have more CPUs and the fact that you have more or fewer network connections doesn’t matter so much. So you go to Amazon, spin up a 60 node cluster, and logistic regression runs really fast, whereas if you’re running a gradient boosting method, that same 60 node cluster is less useful because your network speeds are unreliable from Amazon and you’re always latency bound by the slowest guy, so now you do have some last reducer kind of issues. You’re better off there sticking to your internal machine room servers where you are probably guaranteed high, fast, steady throughput.

Werner: 0xdata has written using Java, is written with Java.

It’s all pure Java, it’s all open source, it’s on GitHub, you go to github.com/0xdata and download and go. Our website, 0xdata.com, has the links to GitHub, links to the latest release and links to the docs and that kind of stuff and yes, it’s all pure Java.

   

6. If it’s open source, what’s your business model?

It’s service and support, it’s a RedHat model. It turns out that people who put a large, interesting, complicated piece of software in their production, their business critical line of work, want to know there is somebody there on the phone they can call 24/7 and say “hey, what the heck, come fix me”. And so we really see a poll model, we’re doing a lot of meetups, Twitter and a lot of non-traditional marketing channels and most of the people we talk to have already downloaded H2O, they are already messing around with it and they are headed for “we’ve been screwing around, we’re getting these really great models out, really fast, we want to know what the production story is, how do we get it into production”, the answer is “talk to us about a license and we’ll do a 10k per node, something different for enterprise-wide licenses” and get it straightened out from there.

   

7. Getting back to Java, it’s written in Java, how do you make the code fast, how do you optimize the Java code for cache efficiency and things like that?

We’re using Java, as I said, it’s pure Java within certain limits of what it means to be pure, we’re using Unsafe strictly to unpack and pack the data, so the data is ints, bytes, shorts, longs, floats, doubles, whatever, not necessarily aligned because the compression strategy is going on, so we use Unsafe to peel out of large byte arrays. And that’s the only use of Unsafe we have, we send stuff over the wire using NIO and ByteBuffers, under the hood that does Unsafe, past that, it’s all pure Java. So, how does it go fast? It turns out that the JVM JIT will do this dumb linear algebra math at the same kinds of speed as the C or the FORTRAN compiler will up to doing register tiling or blocking for cache.

So, as long as you are in one dimensional arrays, the JVM does a perfectly fine job of stripping out all the range checks and null checks and unrolling all the loops and scheduling all the loads and stores, it does the right thing. So we just use plain old Java in the inner loops, but clean Java and most of it is done in one-dimensional arrays, so there are some arrays of structs kind of things going on, but not classical two dimensional array math, it’s usually one dimension, although the dimension is billions long, it’s much bigger than what a Java array can hold because it’s striped across the cluster. Individual CPUs are generally exposed to a slice of that and they are marching away forward through a thousand to a million elements by how many features you have across your data. So the data is typically structured as a big table, billions of rows, hundreds to hundreds of thousands of columns and a typical math algorithm has to touch all the columns or some subset of the columns and then all the rows and that looks like a for loop over all the rows and then move over to the next column and make another for loop of those. So, these are sort of linearly-striding-through kind of things, the JIT does a fine job, you’re memory bandwidth bound, within the limits of memory bandwidth the CPU can totally keep up with whatever is going on there.

After that, you’re more interested in memory hierarchy effects, like caching, and you want to know that the small data fits in memory in cache and the big data is just not going to fit in cache. So, you want to touch the big data once, and that’s what we call pass over the data and that’s your entire memory bandwidth is streaming all that through and all the small data is sort of fairly carefully constructed to fit in the L1 cache as much as possible and there are some variations in algorithms where if you have enough features of one kind or another, some of the intermediate structures will get bigger than that and you’re stuck running in L2 and it’s a bit slower and that’s how it goes. But typically just reasonable engineering gets you to have all your in between math states kept in the L1 cache. We do do some cache oblivious algorithm coding where such algorithms apply and that of course gets you in cache pretty directly.

   

8. I guess that explains how you do the live decompression of data from memory, I guess. Do you just hope that it fits and that it does that in time?

We just hope it does it in time and observation is that it does, it varies by the dataset what you are doing, so the logistic regression is really fast, but in the end it does n2 amount of math, what we call a p feature squared amount of math per row that comes in. So, for p features worth of data you suck in, you get p2 work. So, for large enough p you blow out, you do more math computation than you have time it takes to load the next cache line in. But that is all done in a way that is entirely friendly to the JIT as a classic [nested for loop] where the JIT does the best it’s going to do short of going to SIMD ops. And so, if that gets CPU bound, it does, and you have more time to decompress, most of the time you wait on memory band width to pull the thing in and then you want to decompress and we usually do the decompression something pretty cheap, you have a scale factor and a bias, is the common case one, and those can be done in a couple of clock cycles, pipeline on x86, there is one add, onefloating point multiply one floating point add plus a byte or short extraction to a float , those are all one clock throughput instructions, they have a latency to them but we are all just pipelining them through.

So, the end result is three or four clocks to decompress that. The more complicated ones, like the run-length encoding ones where you have very sparse data, it’s mostly zeros, you have this really super high compression rate, 100:1, 1000:1, whatever, but you have an indicator in your code that the next thousand elements are zero, some of the algorithms know about that and will ask how many zeros in a row they can skip, some of the algorithms do not and they say “count my rows” and they just keep getting zeros back in which case the decompression has some overhead because it has to say “find RAM and the run-length coding, keep tracking, I’ve got the next thousand elements so I have a little state machine counting up to a thousand and I trigger”. Still it’s not very much overhead relative to the cost of hauling everything in from memory, sort of the usual story.

   

9. You used to be at Azul and you had the benefit of using Zing, these highly optimized JVMs, and I think nowadays you have to stick to regular, off-the-shelf JVMs. So, the first question is, I guess, how do you optimize the code with that, do you just test it and hope it works and then curse if the next version kills the optimization?

We’re using this totally standard JVM and totally standard OS, no OS tweaks, no JVM tweaks, we have targeted an audience where no command line flags beyond Xmx, so we have a very clean, very straightforward command line setup and in some ways I am sort of cheating, I wrote the JIT so I know what it’s capable of doing, so I rarely step outside the bounds of what the JIT will do well on, unless I don’t care. We have thrown YourKit at it profiling at it periodically and YourKit is doing ok, usually we know where time goes because we have small inner loops that were applying billions of times, it’s pretty clear where the time is going to be. Sometimes somebody screws up something and you don’t understand what’s going on and suddenly time goes somewhere weird and YourKit will point that out and you go fix whatever the problem is. As I mentioned earlier we don’t really have GC problems, so we are not needing something like Zing to do low latency GC, it’s a standard JVM doing standard stuff, does a good job, the stock GC. We only have huge amounts of bulk primitive arrays that aren’t moving anywhere and the small amount of allocation and the standard default collector does a fine job on that.

Werner: I guess if you want to achieve low latency, it’s the way to go, just basically large arrays or buffers and don’t allocate.

Break out low latency from the different use cases because Zing clearly has a use case for low latency that is different from what we are doing. So we are mostly doing bulk Java, batch Java, we want to be interactive but it’s not really low latency for the math modeling piece of it, that’s the part that says “you run over this TB of data”. When we produce a model out, we’ll provide support for people executing a model and the model might be used in low latency environments such as credit card swipe, is this fraud or non-fraud, are you actually buying your Starbucks coffee or not. There you have a low latency stack in mind and you want something out of the model that you know has fixed latency requirements.

So the model needs to run fast, needs to not do allocation that will trigger GC issues, needs to not lock, be geared for running in a low latency environment, we provide that model, but we are not trying to provide a low latency environment per se. So you are already have to have made decisions in your head about how to do low latency. That said, if you are looking for a low latency solution, Zing has gone down that path a long time and Azul Systems has a lot of expertise there and there are a bunch of issues involved in doing low latency that are sort of outside of the domain of Java, they are really tied up in the OS and then there is some stuff that you need out of Java. And everyone knows about GC pauses as latency problem and so on, but once you get below some milliseconds worth of pause time, you typically will have, in an un-tuned system, you will have pauses that are not related to Java or GC. There is a fabulous tool that Gil Tene at Azul has put out, I think it’s on Github, called jHiccup, and it basically just runs a tiny amount of load in the background on a regular basis and measuring the latency between something you ask to happen and when it actually happens. This records and plots and shows you , it uses 1% of your CPU, it sits in a background process, you let it run a little while, and if you do that and you look at many a modern system you’ll discover you have pause times of tens to hundreds of milliseconds that have nothing to do with Java, nothing to do with GC pauses, they have to do with your OS, your OS side is doing disk compression or one guy is doing all the disk I/O interrupts, and someone is got a big copy going on and the disk Interrupts are slamming this one CPU out of how many and that one guy is just saturated and he can’t respond in a timely fashion. And the other CPUs are hanging out and if you land on the bad guy, life sucks for you. Or the network traffic is pounding one particular CPU or another. Certainly if you’re running a hypervisor situation, hypervisors are doing all sorts of weird things to latency. I’ve seen one of the major standard hypervisors hand out a 500 milliseconds, half a second pause, which was clearly outside the OS or anything the JVM or Java was doing, the hypervisor put the OS slice to sleep for half a millisecond doing God only knows what and if you’re trying to do a low latency situation here and you take a random 500 millisecond pause, you’re pretty messed up.

So, you do a bunch of tuning at the OS layer, you get rid of hypervisors, you lock interrupts to particular CPUs, you keep processes away from them, you use Cgroups and use that to kind of define that, you have to use priorities and stuff and that kind of tuning is necessary to get below some a level of latency. At that point, once you’ve done that, it makes sense to bring Zing in and have them bring Java latency down to the next layer. So, Zing’s done a lot of work bringing latency down, when I left the QA people were viewing pauses beyond a few tens of microseconds as a bug. And those would be tracked, tested for, determined, and tracked as a bug and hunted down and fixed. I don’t know where they are at these days, but I think you can get below ten micros pretty reliably as your max pause times. But as part of that comes this package notion that you have to tune your OS for low latency because it’s not just the Java now, it’s well past being the Java issues.

Werner: Well, Cliff, you’ve given us a lot of interesting stuff to check out, so 0xdata is available online, it’s on GitHub.

Spelled 0xdata, H2O is the product, but 0xdata is the company, 0xdata.com and on GitHub.

Werner: It’s on GitHub, it’s open source and everybody who is watching this should check it out and thank you, Cliff.

Yes, please. You’re welcome.

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT