Bio Manik Surtani is a core R&D engineer at JBoss, a division of Red Hat. He is the founder of the Infinispan project, which he currently leads, and also leads the JBoss Cache project. Surtani is a strong proponent of open source development methodologies, ethos, and collaborative processes, and has been involved in open source since his first forays into computing.
JavaOne is all about exploring Java and related technology. From special exhibitions to demos and partner offerings to community networking opportunities, you'll find an endless source of inspiration as you explore JavaOne 2011. JavaOne offers the best Java-focused content, training and networking.
1. Hello, my name is Rick Hightower here at JavaOne with Manik Surtani, and we’re going to ask him some questions with regards to data grids and Infinispan and JSR 347. Manik, why don’t we just start with a quick introduction about who you are and a little bit about your background?
My name is Manik Surtani, I work for RedHat, RedHat’s middleware division specifically also known as JBoss. My background is I’ve been hacking on JBoss for years. Even prior to acquision by RedHat I used to write the clustering code for JBoss AS and most recently I’ve been working on Infinispan, which is a data grid platform that I started a couple of years ago.
Data grids are very interesting things. They’ve been around for a while, they’re not necessarily very new technology in terms of concept they’ve been around for a while. People tend to use data grids primarily as a distributed cache or at least they how they used to have been used, that is changing fast and we can talk about that in a bit. But as a distributed cache, they are extremely useful things, especially if you are building an application that needs certain characteristics like high availability and scalability and elasticity, but also at the same time you need very fast low latency access to data. Traditional storage mechanisms like file systems and databases don’t really provide this for you. They kind of make it kind of hard to do that, latency touching disk is quite high, it’s expensive. If your application needs to have very low latency, very fast access to data, data grids are the way to go.
The reason why I call them a distributed cache is because traditionally they’ve been used in front of a database, so you still have a database or a file system or some sort of persistent storage mechanism. But you stick a database on top of that and your application talks to the data grids so you get really quick access to data, but at the same time the data is actually stored somewhere else in a database.
Like I said, they’ve been fairly popular things - data grids - and they’ve been around for a fair amount of time. When I say "fair amount" it’s maybe 8 years or so - quite popular and there are a very few products out there that provide this functionality, but each of these products have kind of evolved in their own space, at their own pace, in their own style. And while they tend to have converged today in terms of feature sets, they all give you some form of clustering, they all give you some form of persistence, they all give you some form of eventing and listening and so on and so forth, they all do it in very different ways. Their APIs are very divergent and that makes it very hard for the average Java developer. It means that you are learning curve is very steep, it means that portability is almost not existent and so on and so forth and most people don’t like that. I mean one of the things Java has always done right is standardization, is to make things easy, to make things portable and so on and so forth, and suddenly you can’t do that with data grids. So this is why I think it’s very important to build a standard around data grids as well, just the way Java has with servlets, with EJBs, with a whole host of other things.
Personally I think that while data grids have been popular in the past, it’s usually been very for very niched industries - things like financial services, telcos, extremely high volume e-commerce sites and things that, but not general purpose applications. So by and large, your average Java developer did not need to know or care about data grids, but that’s been changing over the last few years, especially with the advent of Cloud and Platform-as-a-Service and being able to deploy stuff on Infrastructure-as-a-Service. Suddenly data grids are extremely important because cloud nodes tend to be ephemeral, they tend not to have as much persistent storage or persistent storage is difficult and data grids try and help alleviate that problem, they act as a layer in between. So suddenly data grids are a lot more popular for a much wider class of application.
5. There is a set of APIs that have come out recently or are in the process of being formed, one being JSR 107, the other being JSR 347, which you are leading up. Where did these overlap and what are their differences and where does one sort of pick-up and the other leave-off, if you will?
JSR 107 is the temporary caching for Java API and the goal of JSR 107 is to provide a temporary caching mechanism for applications, a standard API for this. That’s very useful and very important in its own right and it provides a mechanism to store and retrieve state and memory, to attach things like expiry and eviction parameters to it to do "right through" and "right behind" and so on and so forth, but that’s not a data grid, that is just a temporary caching API. While that temporary caching API is very useful for standalone cache or even a distributed cache, that’s still not quite a data grid. I think JSR 347, which is the data grid’s API kind of adds to 107. It builds on top of 107 and it adds things like asynchronous operations where you can do a "get" and return a feature as opposed to the actual value, because you know that this is going to be across the network. Your biggest bottleneck is going to be network communications, so having a non-blocking API is important.
It also adds things like grouping APIs to be able to control collocation entries- this is stuff again that’s irrelevant to 107 but very relevant if you are running in a distributed environment to be able to say "These items of data need to be stored together just as an optimization for efficiency." For the map reduce there is remote code execution to be able to say "I know my data lives out there in the grid, I don’t want to pull the data back here, I want to send in a process instead and move that process too where the data is and run that locally." That again is very important for distributed computing and distributed data structures like this. Eventual consistency, that’s kind of important as well for data grids because the network is always going to be your most - for want of a better word - flaky part of the system, the most unreliable around stable part of the system. So you want your system to be resilient to network failures and things like partitions and split brains and so on and so forth, so an eventually consistent API, again, is important as well.
Keep in mind though, that all the stuff that I said about JSR 347, none of these is carved in stone, it’s a very new JSR. In fact, just yesterday evening we had a BOF session to try and hash out exactly what should go into 347, so at this stage it’s still very much open. These are some of the things I’m hoping we will see in 347, there may be more, there may be less.
6. Looking at JSR 347 and looking at Infinispan and then reading various blogs written by you and others, it seems to me that there may be some overlap between what some would consider even an enterprise cache versus a data grid versus a NoSQL solution. How would you differentiate between for example a data grid solution like Infinispan and NoSQL?
From my point of view, when you talk about NoSQL, I’m assuming you are referring to distributed NoSQL solution as opposed to a standalone NoSQL solutions like CouchDB or Redis, which tend to be single node, single VM, if you will. If you want to compare distributed systems, both data grids and NoSQL have kind of come from different starting points, if you will. They solve different problems, but where they stand today they’ve kind of converged. Data grids have been primarily in-memory but now they spill off onto disk and so on and so forth and they’ve added in-query and mapreduce onto it while NoSQL have primarily been on disk, but now cache stuff in-memory anyway for performance. They are starting to look the same now, or are very similar.
One big difference though that I see between data grids and NoSQL, something that still exists today, is how you actually interact with these systems. Data grids tend to be in VM, they tend to be embedded, you tend to launch a Java or JVM program, you tend to connect to a data grid API and you work with it whereas NoSQL tends to be a little bit more client server, a bit more like old-fashion databases where you open a socket to your NoSQL database or your NoSQL grid, if you will, and start talking to it. That’s the biggest difference I see today, but even that will eventually go away.
Prior to Infinispan I used to run a project called JBoss Cache. JBoss Cache was the clustering toolkit we used in JBoss AS to kind of build out the cluster session state within the app service so we could have a clustered up server. We spun it off as a separate project in the end because just modularization of code and things like that and since it was open source a lot of folks started using JBoss Cache as a data grid. JBoss Cache was never designed to be a datagrid, it was designed to be a clustering toolkit in memory cache. Of course, things didn’t work as well. I mean it wasn’t designed to be a data grid, people were trying to use it as one and that’s when we realized how many people actually do want a data grid and we actually can build a data grid out of something very similar to JBoss Cache.
So we took a lot of ideas from JBoss Cache, a little bit of code as well, and that’s where Infinispan came from. We kind of fundamentally re-architected the core data structure and stuff to actually make it scalable. As a result, Infinispan kind of servers the purpose of that clustering toolkit that JBoss Cache once was, but also serves the purpose of a data grid. Infinispan in fact, in the current JBoss AS 7 series, is used as the clustering mechanism for HTTP session state and EJB session state.
To add to your question as to where one would run into Infinispan and who would use it and why, I mentioned at the very beginning that people tend to use data grids as a performance boost over databases as the distributed cache. So that’s where I see a lot of usage at the moment, where people tend to have applications that rely on a database and they realize a database is a bottleneck - it’s either too slow or it does not scale and so on and so forth and then they introduced Infinispan as a layer in between their application and their database and that helps them scale out.
That’s more of your distributed cache use case, that’s today probably the most popular use case for Infinispan, but that’s also changing. People are now starting to look at Infinispan as a data grid, as a primary store of data without the database altogether, more like a NoSQL solution and that’s more interesting to me, because I think the scope there is far greater.
8. One of the things I noticed about Infinispan and actually other solutions in general just out there in the wild, is a lot of them had added wire protocol support for Memcached. So it has the Memcached wire protocol, which is important for various reasons, but I also saw that you guys had added your own wire protocol called Hot Rod and people have started developing non-Java clients. Can you go into a little bit of why you support Memcached wire protocol and why you decided to come up with a different wire protocol and have people write clients to that?
A little bit about wire protocols in the first place and why we did it. Like I said earlier, data grids tend to be primarily in VM. You primarily communicate with a data grid within your Java VM. That’s kind of cool and very handy but one of the things that causes is that your application has got to be a Java VM application, it’s got to be something that lives within the Java VM. It does not necessarily have to be Java, it could be Scala, JRuby, whatever, but it’s got to live in the VM. Now that’s not so cool for certain applications. There are lots of applications out there written in C, in C#, in .Net who also want to use Infinispan and they can’t and this is why we ended up with this whole wire protocol thing, where remote clients can talk to Infinispan more like a NoSQL solution these days and you can just over a particular wire protocol speak to the grid.
We initially used Memcached as the wire protocol because it was pretty well-known, it was quite easy to implement and it was fairly mature, but more importantly there are clients for this Memcached protocol that have been written for pretty much any platform on the planet, whether it’s Erlang or Perl or whatever. There are loads of Memcached clients for pretty much anything. That made it very easy for people to ramp up and start using Infinispan regardless of what they’ve written their application in. That was a good starting point but we thought we could do better, so that’s where Hot Rod comes from. Hot Rod is a binary protocol that we came up with, very similar to the Memcached protocol in that the verbs are similar. You do "puts", you do "gets", you do "removes", but unlike Memcached, Hot Rod is a two-way protocol. It’s not just clients talking to the server or to the server cluster, the server cluster can also talk to the clients in the case of Hot Rod and that’s interesting for a number of reasons.
Initially, version 1 of Hot Rod we used this capabilities to inform clients of server typology changing. So if nodes die and new nodes are provisioned so on and so forth, you don’t need to sit and restart your clients or reconfigure your clients saying "This is the new address on the backend that you can talk to." That’s the stuff dynamically pushed to the clients every time there is a change in the backend. That makes things very easy to manage, on the first hand. Further to that, we also use that capability to also push the consistent hash function being used in the backend cluster to the clients. What I mean by this is Infinispan uses a consistent hash algorithm to determine which nodes in the backend act as hosts for certain entries, for certain bits of data. With the Memcached protocol clients aren’t aware of this. Clients aren’t aware of this and they just ping a request randomly to any node in the backend to retrieve a certain amount of data. If that node does not have it, that node will ping another node in the cluster to find it and bring it back.
It works, but it can get expensive, whereas if you are using Hot Rod, your client can be smart about it, it can be intelligent and it knows precisely which node in the backend has that data it’s looking for and it can ping a request directly there, grab something locally out of memory much faster and much more efficient. That’s Hot Rod version 1.
We’re working on ideas for version 2 of Hot Rod and we want to add a lot more capabilities to make use of this two-way channel. One of the things we are looking at is eventing, so your remote client can almost act as an embedded client, as an in-VM client and attach listeners and stuff to the entire backend grid saying "Tell me when these things happen. When this particular bit of data has entered or these particular keys change" and things like that, which makes this very powerful.
In terms of people writing clients for Hot Rod we started out with a very simple Java client as reference implementation, if you will, "This is how you write a Hot Rod client." People have taken that and have translated that to Python and Ruby, so we now have Java, Python and Ruby clients for Hot Rod. We’d like to see some more as well. I’d like to see maybe not as many clients as Memcached has, but quite a few more. Specifically, I’m calling out to the C communities, the .Net communities - we’d love to see clients for your platforms for Hot Rod.
Yes. The specification is quite clearly defined, is quite well defined on the Infinispan wiki, but more than that, because we already have clients in Ruby, in Python and Java and they are all open-source, you can read that and use that as a reference literally and then writing one for whatever other platform of your choice.
Yes, that is correct. 5.1 is the current series, code name "Brahma".
5.1 is our latest release. We are currently in BETA1 released today actually and there have been a bunch of changes, not just in transaction management, but I’ll start with that, I’ll talk about that first. We kind of refactored a lot of how we deal with JTA and XA transactions within Infinispan. Most of it is just simple refactoring of code, but in the process, a lot of other interesting things have come out, lots of interesting optimizations and we are going to see all of that hit in this release. One of the big things is we used to have this concept of transactions and they were primarily optimistic in that locks were required at prepare time and they’d would fail lazily if you will, if you could not acquire them. But that was not enough: a lot of people wanted a pessimistic approach as well where you could fill fast if you could not acquire the locks you need, so we had this thing called eager locking and most of it it’s just how we normalize all of this stuff. Because in the past we had layer upon layer of different locking schemes and this time we might have cleaned all that up, provide two locking schemes - optimistic and pessimistic - over your XA adaptors. That makes things much cleaner to understand, much easier to configure and it performs a lot better as well. It’s just a cleaner code base, but that’s just step one.
We’ve done a lot of other interesting things as well, including lock reordering if you are using optimistic locking and that suddenly means your probabilities for deadlocks drop significantly. Even though we do have deadlock detection and things like that, deadlock detection does add a certain overhead to your regular transactions. Reordering locks makes things much more efficient. Further to that, the other very interesting thing we’ve done with transactions is we’ve added support for XA recovery - in the past we never had that. We understand that not a lot of people use recovery in an XA environment, it’s kind of an extreme case because it does add a big overhead to things, but for those who do and for those who need it, we now support that as well.
12. One other thing that’s in JSR 107 is a CDI support and since JSR 347 basically extends, if you will, or builds on top of JSR 107 or I should say uses the same base API, it also has CDI support. I just wanted to get your thoughts on CDI, why it’s important and then specifically maybe about the CDI support that’s in JSR 347, 107 and Infinispan.
CDI I think is one of the major innovations we’ve had, you’ve seen in the enterprise Java space of the last few years. I think it's suddenly made a lot of relatively complex things very much easier to consume and to use by developers. It makes things easier to test as well, to unit test and generally to understand and maintain. CDI is generally a very important, very good thing and we wanted to extend that to Infinispan as well and then to data grids and JSR 107 as well. Stuff that other platforms and other languages have actually had for a long time, the Java is actually missed out on if you will. A very simple example: with JSR 107 where you can actually annotate a method saying "Make sure you cache the results of this method" so that subsequent calls of that method you don’t actually invoke the expensive code part or generating stuff or building objects or whatever.
It’s very transparently placed into a cache with expiry timeouts if need be and things like that, without all the complex of juggling of dealing with the cache and taking some stuff and putting it and then reading it out of the cache and it’s not there and you go and generate it again and put it in. All that stuff is done for you automatically by just annotating a method saying "This method should be cached." Stuff like that, that’s an example of CDI making caching easy. I think that applies to data grids as well, as you said, because it is a distributed cache in some senses. We are at the moment talking with the CDI expert group about adding more support for data grids, what else can we do. Some of the ideas we’ve tossed around are things like distributed code execution and mapreduce. If you have a piece of code that you want to execute in the grid, what can we do about that to make that easy for people to do, instead of using the mapreduce API, which can get quite complicated and you got to kind of understand how mapreduce works for that to happen.
The whole thing about CDI is just to make things easy. You don’t actually have to understand the nitigrities of how things actually work underneath.
13. One thing that was being proposed perhaps in JSR 347 that is also heart of Infinispan is this idea of a fluent API. So, in addition to CDI, there is a fluent API that makes things like mapreduce easier? Why don’t you expand on that? I want to hear your thought on that.
The way we did mapreduce in Infinispan was that we looked at a lot of the existing mapreduce APIs in Java, so stuff like Hadoop, Cassandra - things like that - how do they do mapreduce? Not implementation details, not how things actually work under the hood, but from a developer’s point of view, how do I interact with this? How do I write a mapreduce task? Is it hard? Is it easy? Is it testable? - important stuff from my opinion. Frankly, we found them kind of clunky and kind of hard to use, not necessarily because they were poorly designed or anything, but they were the first mapreduce implementations for Java out there. So, you learn from your mistakes, you move on and with Infinispan we kind of took it a step further. We have a fluent API, as you said, to define mappers and reducers and so on and so forth, which makes it very conversational in the way you actually write this code. It makes it much easier, easier to read, easier to maintain, so that’s a start. But you’re still expressly doing mapreduce and I think that can be taken a step further in simplification by introducing CDI to make it even easier.
There certainly has been an increase in interest, as I mentioned earlier as well. I think the big contributor to that has been Cloud, has been deploying applications on Cloud nodes and because such Cloud nodes are ephemeral, you can’t trust them as much, so you always have another copy, a backup and so on and so forth, so suddenly everything is distributed. Whereas in the past you could just deal with one big server, here you will have 2 or 3 "just in case" and as a result you need to have your data stored distributed as well. The other reason why I see a big bump in popularity is again due to Cloud, because people have the infrastructure, they have the ability to very cheaply provision a large number of nodes, they actually make use of it, just because you can. So they will actually have in terms of demand and traffic, in terms of demand and response times and low latency and stuff, caching suddenly becomes very important. Hitting a database from multiple nodes at the same time makes things a lot slower than it once did hitting a database from a single node. As a result, having some sort of distributed cache is very important.
To add to that, that’s also been expressed in Java EE, so the next edition of Java EE, which is EE 7 has expressly stated as one of their goals to be Cloud-friendly, to be the application development platform for the Cloud, if you will. As a result, having a caching API in Java EE is critical, it’s very important. That really has put a lot of pressure on distributed cache vendors and things like that to say "We need to make sure we’re up-to-speed to be a part of Java EE 7". Java EE 8 is going to be even more interesting because they take Cloud one step further and that’s where we’re hoping JSR 347, which is the data grid API, will slot in.