Bio Ari Zilka founded Terracotta in 2003 and is the company’s CTO. Combining business and technology leadership, Ari was an Entrepreneur-in-Residence at Accel Partners and worked at PriceWaterhouseCoopers. He was Chief Architect at Walmart.com. In the mid 1990s, Ari invented a new object relational database that still exceeds the capabilities and performance of database technology today.
Absolutely. RAM is the new disk. You need certain technologies to make sure that RAM is not volatile, the data you leave there is you're not going to lose it, but RAM is cheap now. It's very cheap. It's so much faster than talking on the network to other machines and you know it's something that developers can use just very simply. It doesn't require complex APIs, libraries, interfaces to get at. So RAM as the new disk is kind of something I think people should start thinking about.
Sure. So, I think, you know, going along the theme of RAM as the new disk, if you think about backup and recovery, years ago, we saw disk becoming the new tape, right? Because tapes could handle a couple of hundred gigs per medium and you need an autoloader or some complex software, etc. You can just load a san full of hundreds of disks and back up and rotate through those disks over time.
So, the things I'm seeing now are that disk is basically the new tape, right. It's how a lot of my customers are talking to us. I don't want to put stuff in the database because I realized that it doesn't belong in the database long term. So, the customers are seeing the database as a system of record sure, but they're seeing that database as this data is there for analytics purposes, for compliance purposes, for audit and logging, and security purposes. And you know the data that is in the database is mostly consumed by applications, not by humans directly and not by 4GL programs either. So you end up basically realizing that the database is just a place to put the data at rest for a while, it is your disk. So RAM is where you need the data. The more RAM you can get online, the more data you can handle at low latency.
So the trend we're seeing is people who are starting to treat databases like tape or like disks that's where I archive realizing that their applications are the thing consuming most of the data and creating most of the data. They also have spent the last 10 years plumbing up crazy amounts of telemetry into their applications. So now, they're throwing off data like you wouldn't believe.
You have applications like ESPN.com on eight servers, but mobile ESPNs on a thousand servers, right, to handle all the device proliferation, iPad, iPhone, Android, Blackberry. Everyone's checking the scores of the latest game while they're on the move. So the mobile cluster is way bigger. That's throwing off tons of telemetry that the business wants. The business wants to know who's watching, what scores, who's reading which articles, etc, etc so you have a data explosion. And couple all of that together with increased complexity.
You know, I don't know about most folks, but I would rather run an 8-node cluster than an 8000-node cluster. It just scares me to think of 8000 nodes. I know it's sexy to a lot of people and Hadoop users talk about the hammer cluster, the disk cluster, 40,000 nodes and there's a place for all of that stuff. But if you want to build a website, if you want to build a customer self-service portal, if you want to build a financial services, you know, planning and trades engine, you want it as stable as possible. You want as much business functionality as possible. You don't really go into solving business problems with a desired number of servers.
So the trends I'm seeing are basically database is sort of like the new tape. RAM is getting really cheap. Simultaneously, data is exploding, the volumes are exploding and complexity is also on the rise and people want to solve all of those problems, simultaneously.
That's exactly what I see. Essentially, if you have a system where -- if you have a fundamental assumption where RAM is, you know, cheap, that means I no longer play in the 1-gig range, in the 2-gig range as a window of RAM on to a disk that's more like a terabyte of data. I can actually buy 144-gig machines easily, you know, for a couple of thousand -- not a couple of thousand dollars, but for less than $10,000. I can buy 250-gig machines. We recently bought a bunch of 512-gig machines at Terracotta and I thought they would be $25,000, $30,000, or $40,000 a piece with that much RAM in them. They were more like $16,000, right.
I actually have machines with 140-gig disk and 512 gigs of RAM. Why is that? Because all my business data is in the database, I don't need a bunch of local disk on the machine. But my database has a terabyte SAN attached to it so if I can get 512 gigs of RAM, I can pull the entire database into memory and where the application needs that data is right next to the applications processing. That's really a way to go about NoSQL fundamentally because the NoSQL problem is two-fold, interface and latency, right. The problem that NoSQL sets out as a community to address is interface and latency.
So in-memory, we're talking objects, right. So if I have to go through a marshalling overhead be it a relational format or a columnar format meaning like column families, rows, cells, things like that, whatever interface I have to go through to store my data, it's fundamentally a marshalling problem. I have to move things from objects into some non-object format and then back. So there's no more natural simple interface than objects and memory.
The problem with objects and memory is -- and they're also the lowest latency. So theoretically, if I defined NoSQL as low latency simple interfaces to data in a more natural fashion then in-memory is the best solution to that problem space. The only issue is in-memory is volatile. I'll lose it when I lose the process. And in-memory is not centralized or coordinated access across instances, right. Fundamentally, all applications end up scaling out. That's the attractiveness of the cloud is I could scale out on demand and one node is never enough first of all from an availability and reliability perspective, but also from a throughput perspective, you typically need more than what one box can do.
So all I'm really saying is if the problem is one of data access, data volumes, then the best way to solve the data access problem and the volumes problem is to pull as much of the data into as few nodes as you can in-memory.
Basically, you can't. If you tried to create -- I've come across two or three or four customers who have 60-gig heaps meaning -XMX, -XMS for the Java gear heads, 64 G for example. And that works -- one guy said I've not had any pauses in five months. I delved into it a bit because I was curious. How can you have no GC, garbage collection, problems in a 64-gig heap? It's all read only data. He never mutates that data. He loads it all out from a database and then holds on to it and then people just do memcache style remote access calls, get and put key values.
So I could brush that use case aside. Okay, that guy has a 64-gig heap, but it's not really, you know, an application node, it's a special case, key value server that he built by hand. Most folks I come across who are doing 64 gigs, they have multi-minute pauses multiple times throughout the day. One of my customers actually restarted his applications three times a day every four hours during the business day because garbage collection would get worse and worse and he couldn't figure it out.
Another customer figured garbage collections pauses on 12-gig heaps, figured it out for one instance of his application but he had 12 different customers. It was a software-as-a-service use case. So we had 12 different deployments of the exact same software, the 12 different customers send such different workload for the system that he's tuning for his largest customer didn't apply to his other 11 and he's like tearing out his hair, I can't tune the same system for 12 different workloads, right.
So fundamentally, a Java heap is challenged to keep up with the data volumes and when you want to pull all the data in the memory to get microsecond latency, you find that you end up suffering many-second pauses from the garbage collector. And, you know, so we've started working on ways to solve that problem at Terracotta and, you know, I have done a lot of work with customers hands-on with the solutions we've come up with and found that fundamentally most of these use cases what's in that Java heap is a lot of cached information. They're pulling data from a web service, they're pulling data from a database, they're pulling data from a NoSQL store, they're pulling data from memcache and then holding on to it in memory because it's multi-milliseconds to get at that data store. And they're holding on it in memory, but the in-memory cache of that data is itself causing Java to pause more and more and more.
So why the 64-gig heap, why the 12-gig heap? It's because it's full of business data. It's not that they need that much elbowroom if you will for parallel user request response or an application flow to run. Of course, there's analytics engines that are actually pulling in a bunch of data that crunch on them, but transactional applications need giant heaps because they're holding on to caches of information.
What most people are doing is chopping the box up, right. VMware bought spring sourced in part because they a huge pool of VMware usage into the Java market. And so why is that pool for virtualization? It's because people are slicing up.
Let's take the 64-gig box you're talking about because that's a great size because that's the largest box Amazon will give you for a reasonable price, right, the extended memory server. Many of our customers run Memcached or Terracotta on a box that large in EC2. 64 gigs, you don't want to run one JVM on that large a box so you'll run eight JVMs at 8 gigs a piece or 6 gigs a piece leaving some room for the operating system to run in. And then you'll find out that 8 JVMs at 8 gigs a piece is very fragile. If the operator goes in and says kill - 9 Java for example on UNIX box, he'll kill all 8 of them and now you're down. So you used 8 instances of your application to get more throughput, but it causes operational complexity. It all causes developer complexity because I have to keep those eight boxes in sync.
How does the average developer keep eight copies of an application in sync? They build a stateless application flushing all its state to a database. So as soon as you scale out that application even if you're doing scale out on one box, one physical box, I'll still consider that scale out. If you scale out to multiple copies of your application, you cause more pressure on your database.
Those are the typical tools people have. It's basically I don't want big JVMs so I chop up a box. If I chop up a box and find that operational complexity is causing error, then I'll go to VMware where each instance is isolated. So now, I have eight guests running inside that 64-gig machine and that will work and that will address the operational complexity. It will be safer, it will be more stable, it will work better than if I ran it all on the multi-process on one physical host. But I still have the problem that I'm overloading my database as my central point of coordination.
So, you know, really fundamentally, you can't tackle scale up or scale out in a vacuum, right. And that's why Java devs are looking at NoSQL because it's a central store. Sometimes those stores are clustered, sometimes they're not. It's also a way that they can lay out the data in a more application-friendly fashion. But fundamentally, people are looking at their data store and saying that's my point of coordination, it's also my point of contention. And the tools they're using right now are let's start with a simple application let's then buy big boxes. Let's chop 32-gig or 64-gig, let's chop up those boxes into many instances and then let's go to cloud-computing frameworks and let's go to distributed caches to get those boxes to state those instances to stay in sync without bottlenecking on the central data store.
Right. That's really the elephant in the room, right. Because if we're buying 64-gig boxes in operations or 32-gig boxes and then chopping them up and scaling out on a single piece of hardware and then we realize, hey, we might as well go cloud because we have virtual machines. It's a vicious cycle sort of. Like we have to virtualize machines as the assumption because we can't use a whole box. And then once we virtualize, we realize we can stack a RAC full of these machines and not need to know how many there are and we could dynamically allocate them using Amazon style techniques, be it VCloud or Eucalyptus or Citrix Xen, what have you.
BigMemory is basically another way to skin this problem, right. If your first assumption is that you must chop up the machine, the physical machine into a lot of little ones and therefore that accelerates you into cloud-based deployment that's an interesting approach. But what we're saying is basically don't chop the machine up, right. If you're chopping the machine because you want a lot of data in-memory, but you don't want too much data to where Java slows down, that's what BigMemory is for.
BigMemory has been tested, it's in production at many shops and doing Java VMs north of 150, 250, 350 gigs of data in a single Java process. So when you do a PS in UNIX or you do Task Manager in Windows, you're literally looking at one process taking up as much RAM as you want it to. The value in that is not necessarily clear. People are saying what's wrong with chopping up the box, right. But we talked about it, it's complexity, operational risk, it's also a central point of coordination overhead. There's a point of contention. Everyone has to know what everyone else is doing. So 8 copies of a box in the worst case could have 8 times the -- 8 copies of an app in the worst case could have 8 times the overhead of one copy, right.
So BigMemory has -- the principles behind BigMemory are scale up first and then scale out.
BigMemory is essentially our own memory manager. You get to it through one of our interfaces, Ehcache or quartz or what have you. Basically focusing on Ehcache for a second, when you do a get or a put, BigMemory is a separate chunk of memory that we've asked the Java process to hand us that we will manage ourselves. And so you may have a 1-gig or a 2-gig heap, but BigMemory could be a hundred-gig slab alongside it inside the process but not part of the heap, the garbage collector won't look at it. So your put will actually optionally go to either Java heap or to BigMemory native memory that we're managing ourselves. With that, you get all the benefits of being able to deal with fragmentation, deal with cleanup when objects are removed from BigMemory in a much more efficient way than Java garbage collection.
So essentially if you look at BigMemory, as I said, sitting underneath the Ehcache interface and you have heap as one place Ehcache could store data with the classic garbage collection challenges and then you have BigMemory as this giant slab like we discussed with, you know, 64-gig or 300 gigs of data in it, all this would reside in the single JVM.
And the question is how do you decide what goes in heap and what goes in BigMemory? The answer is the system will decide it on your behalf given some tuning parameters that you specify.
So, the way to think about it is as a tiered store or a layered pyramid of options and option 1 would be the heap and option 2 would be BigMemory. The heap has the characteristics of basically being free to get to. I'm going to call it zero microseconds to access that data. But as we've discussed, you really don't want to put more than 1 gigabyte of data in that heap. That's that guy right there.
BigMemory has a tradeoff to it meaning you have to actually marshal data through a Java serialization interface. So everything that you put in here needs to be serialized and it will serialize on its way in and deserialize on its way out. We've tuned that serialization cost to be one microsecond. So I wouldn't be afraid of that interface because on the flipside of that one microsecond, which is almost imperceptible for most apps versus the native heap call, you get access to, you know, just under a terabyte of data, and that's in memory, in the JVM next to you. That is really the fundamental invention of BigMemory is give me access to a totally new scale of data at near the same latency I used to see.
Then comes in the Terracotta classic distributed cache solution. On the one hand, you can choose to put multiple copies of your app. This guy here, I can have 1, 2, 3, 4 of those scaling out using Terracotta or I could put data on my local disk. Why do I want to put data on my local disk on this JVM? That's basically a third place to tee off data because that way I can put lots of data in and, you know, I'm not too far away, 4 milliseconds let's call it from either of these two stores. But what happens is this JVM dies, right. I want to be able to get the data back into the JVM into BigMemory without having to go back to the system of record that I took many hours to warm up from as one example.
So I've got local disk and Ehcache and basically these are settings in Ehcache in the XML file or through the APIs. I say one heap equals 1000 objects, off heap equals -- an off heap is the internal name of the BigMemory store, off heap equals 500 gig and disk persistence equals true. That would give me this layout basically pure config.
Now in the Terracotta array, it's important to point out that this pyramid repeats itself one more time. I have heap for the Terracotta servers and I have BigMemory for the Terracotta servers, which is really where the invention got its start. It's we wanted to make it so that when I stored Ehcache data across nodes to coordinate and centralize access to all that data. We wanted to make sure that the Terracotta servers I stored all that data on were tuning free and never paused and so we created BigMemory for ourselves and then pulled it up into your application tier for you to be able to use as well.
Sure. That's a great question. The way I'd want to answer it is let's take a specific example of for example 100 gigs of data right and let's say in this pane, I'll paint it in a fictitious 100-node grid. Basically, what will happen is I have my app here and let's assume the grid is separate from the app. So I'll have my grid layer below this. They'll basically be grid nodes 1, 2, 3, to 100. And the app has a small library in it, the grid access library, that will actually decide which node to go get a particular key or value from. So this is a system that will give me a hundred nodes worth of throughput. The latency to get to this is somewhere between -- you know, it's milliseconds latency. It depends on the technology that's in use, be it ours or someone else's, but it's somewhere in the milliseconds range. So it could be, call it 4 milliseconds for the average technology that I've seen. Some technologies might be a 100 milliseconds, right? But this is milliseconds to get at your data.
The thing is I can get at my data on this node at the same time that I'm getting a data on this node or this node. So I have a hundred parallel channels at 4 milliseconds a piece. One way to think about it is I could get a hundred keys at the same time in the same 4 milliseconds latency. But a hundred gigs of data would basically be 1-gigabyte per node and let's say I had a query that needed 4 keys out of it, I would have to make a call maybe to this node, this node and this node. And so a data grid gives me unbounded capacity for the amount of data it stores and it gives me fixed latency, but it gives me like I said earlier, high latency and a lot of nodes, a lot of complexity.
So the other way you can tackle this is the traditional database way, which is, and you can do this now with BigMemory, app, 100-gig data store, and then connected into that app 100-gig BigMemory slab. So this gives me all my data at 1 microsecond. I don't have to traverse the 25 milliseconds to get to the database and back and this is also a SQL interface. So I have my objects in object format in BigMemory at 1 microsecond that would give me the same hundred gigs in BigMemory, but I'd have the problem that if I lost this node, I'd lose all these data.
So the best way to do this is hybrid of these two approaches and it's basically let me run two copies of my app for high availability so app instance 1 and app instance 2. I could for example put all 100 gigs into BigMemory on both app instances and then I would coordinate access to those two instances so that if this guy mutates the data, this guy can see it using a Terracotta array instead of a database below. So I'd put BigMemory here, here and here and end up with a small Terracotta array. Instead of a 100-node data grid, I'd end up with a small number of Terracotta servers, one or two maximum. I'd end up with two copies of my app. I'd have HA of the data, I'd have central coordination with no single point of failure and I'd have very few application nodes each with 1 microsecond latency. So this is 1 microsecond, this is 4 milliseconds, this is 4 milliseconds. So I get the best of both worlds, I get the 1 microsecond access, I get the centralized coordination, but I don't get the proliferation of nodes at the same time. So I've solved the latency problem and I've solved the operational complexity problem.
Sure. So in a nutshell, garbage collectors are the antithesis of predictability. It's possible to tune them, but the typical garbage collector will do incremental collections on the order of milliseconds and then full pauses on the order of seconds or minutes. The point of BigMemory is to take the cache out of the view of the garbage collector so the old space never needs a full pause to clean. So, concurrent mark sweep collectors will work better in hotspot VM for example and you'll end up in a place where you're doing microsecond operations. Instead of zero microseconds, you're doing 1 or 2-microsecond operations for every cache call, but in exchange for that, your multi-second or multi-minute pauses theoretically go away.
One example of this would be -- on that thread, I also talked about a customer who had gone seven days without a full pause. That was a while ago. At this point, they still haven't paused and we're talking months later. So the name of the game for BigMemory is to get rid of the full pauses and that's where predictability comes from. It's not that you have a pauseless application, you're still using Java in the Java runtime but you will pause for the milliseconds and do all the incremental work only.
11. You mentioned something you learned at QCon about total cost ownership and consolidating on those boxes. So how does that play in when you can actually come down to a few very expensive large boxes versus a lot of cheap small boxes?
Right. I think the most succinct way to answer total cost to ownership is through a quick anecdote. So back to the 4-terabtye customer who I said we could do with 10 boxes, I actually got on a call with their head of operations and he wanted to start buying machines and he said, what spec machines in Terracotta's experience should I buy? And my answer to him was the kind of boxes I buy, which are these 500-gig monsters from Dell for, you know, less than $20,000 each. He said, I'm not interested in 500-gig machines Ari because they're only useful to Terracotta, you're the only guys with the BigMemory technology, what would I do with a 500-gig machine if not for Terracotta?
And I said, I understand that, but remember you can always take that kind of box and virtualize it the way you used to, chop it up into a lot of small machines. So it's not just us that could use that. You can run databases on it with huge files system caches. A lot of applications would benefit from a lot of RAM today. We just benefit from it at -- you know, we max out a machine's RAM so we really enjoy a lot of RAM.
But what I did for that customer is I took a bunch of vendor solutions and built a quick Excel spreadsheet for them showing what if I did this at a hundred gigs per box, which he was comfortable with. Don't ask me the difference between 100 gig and 500 gig those are both big machines in my view. But I did it on 100-gig machines, I did it on 500-gig machines. I did it on 300-gig machines from Cisco that can do 300 gigs with only two CPU sockets used versus Dell, which needs four sockets to get that much RAM. And what we found is the most expensive part of the machine was (a) obviously the total number of machines led to price. So doing it 100 gigs at a time cost him $300,000 or $400,000 for that 4 terabytes of data. Doing it 500 gigs at a time cost him as low as $120,000. And doing it 500 gigs at a time on Dell with four CPUs per box versus Cisco with two CPUs per box, with Cisco's UCS technology we got the $120,000 from Cisco, we got $180,000 from Dell. So while Cisco is more expensive per box, it gives you a better RAM to CPU density ratio.
And so total cost of ownership basically down to very few -- medium size boxes worked great. CPUs were expensive. Disk, you know, machines were expensive. RAM was not the driving price so 4-terabytes of RAM done 100 gigs at a time was the most expensive by a factor of 4. I think that's a really telling tale right because most apps are idle on the CPU. So I like the Cisco story, I like the Dell story as well. I like, you know, 500 gigs per box, 200 gigs per box. It gives me a lot better ROI from my operational dollar, but I really like the story of - since most Java apps are, you know, using 30% to 50% of CPU at best and they're not running at 90%, why not go for the fewest numbers of CPU possible to get the job done.
[InfoQ thread: http://www.infoq.com/news/2010/09/bigmemory]
The short answer to that is most things in computer science have been done before, right. I didn't invent distributed caching. I didn't invent large slab memory managers. I didn't invent most of the things -- I didn't invent the network. I didn't invent the processor. So I agree it's probably been done 10 years before, but it hasn't been done by the distributed caching vendors. Distributed caches are fundamentally about chop up a data set into a few gigs at a time. And when you take those systems and run them at 64 gigs, you're running Java heaps that large and then you're suffering pause problems, predictability problems.
So, while direct memory byte buffers have been used by many applications before, we're not the first to do that. Huge slab memory managers have been built by Linux for one example any operating system vendor. We are the first to put it all together in one place for one user under the Ehcache interface. And basically saying we've brought high density, low latency, high reliability, you know, read/write memory to bear for you in the Java application underneath the Ehcache interface with no tuning required, no special API or special complex way to program your apps.
That's a great question, right, because I kind of look at that as why take this approach, why take Terracotta at its word, why take Ari at his word and just use BigMemory and not look anywhere else. The answer is anywhere else you're going to look is going to take a divide and conquer approach. That leads to complexity of implementation. So there's not anything out there right now giving you a Java process with in-memory microsecond access.
And you said regardless of interface. I think interface is important because Ehcache is already there for more than half of the applications out there, they're already using it. But if you take Ehcache and set the interface aside so interface doesn't matter, you don't want to run a 64-gig heap and tune it by hand. That's option 1 and I don't recommend that. You don't want to divide and conquer because divide and conquer immediately introduces distributed cache and coordinated access, centralized access. If I write to the data on one node, the other nodes need to know about it. So you want to avoid distributed caching simply for the sake of handling 64 gigs of data. If you need more than one node, you need more than one node, but do that when you need more CPU or more network throughput not when you need more RAM.
Other options would be like virtual machines that are supposed to remove the complexity and the risk of operation of many instances of an application. They achieve that they seek to address, which is the operational complexity. But what they don't do is avoid the central point of coordination or bottleneck. So fundamentally you're dealing with run a giant Java machine and tune it by hand or divide and conquer, chop up the data and run multiple copies of your app, which thrust you into the complexity of multi-node computing potentially just for the sake of large data.
BigMemory vs Azul
I would be interesting to have comparison between these 2 solutions.
Maybe one advantage of Azul system, is that it can manage a single large heap,
so complex graphs of objects don't have to be serializable.
Re: BigMemory vs Azul
For me, I think we need to bring some kind of managed memory concept to JVMs for managing very large heaps. It's been 30 years of GC research right now and it's proving to be a hard problem to solve. Most long term state in a JVM is likely sessions or cache related and thus is usually managed by some framework. Such frameworks could leverage a managed memory API (think java collections for off heap). Care would need to be taken to isolate on-heap and off-heap (no references between them and so on).
The Azul solution is pretty cool though if you can get it. I think caching solutions will all be pushed towards some kind of off-heap solution to allow larger JVM sizes, more efficient memory usage (POJOs are seriously inefficient memory wise) and fewer JVMs to manage from an operations point of view.
BigMemory vs. Azul vs. Redis