Java Data Grid Specification: JSR-347
JSR-347 is the data grid specification. This JSR came to life with a bit of controversy and confusion where it fits compared to JSR-107, JCache. InfoQ got a chance to catch up with Manik Surtani to get his take on JSR 347 and JSR 107 as well as his thoughts on caching, NoSQL, data grids, Infinispan and related topics.
Manik is the specification lead of JSR-347 and long time commiter and maintainer of JBoss Cache as well as JBoss Infinispan, a leading open source Java cache and data grid implementations respectively. The Infinispan data grid project was announced in April 2009, Manik worked on data grid prototypes for at least 4 months prior to announcing. Infinispan is part of the inspiration behind JSR 347, and many features from Infinispan are in the current proposed feature list of JSR 347.
InfoQ: What are the goals of JSR 347? How do they differ from JSR 107?
JSR 347, also known as Data Grids for the Java Platform, has been proposed to standardise a set of APIs, a programming model, and expected behavior of distributed, fault-tolerant in-memory key value stores. It differs from JSR 107 (Temporary Caching for the Java Platform) in a number of ways:
- Permanence of data. JSR 347 attempts to be a record of store, using its inherently distributed nature to provide durability. JSR 107 makes an assumption that data stored is temporary and transient.
- Distribution. JSR 107 allows that implementations may be distributed, while JSR 347 mandates that implementations are distributed. As such, the standard can offer users richer APIs to make better use of the data store. For example, exposing APIs to control where data is stored in the grid, asynchronous and non-blocking APIs, and APIs to support eventually consistent implementations only make sense when you know the implementation is distributed.
- Map/reduce and distributed code execution. When data is distributed/partitioned across a grid, it sometimes makes sense to move code execution to the data rather than the other way around. JSR 347 will offer standard APIs for such features as well.
InfoQ: Which vendors have signed on to implement JSR 347? Why isn't Gemfire, Coherence involved in JSR-347 yet?
So far, the expert group includes Red Hat, Gigaspaces and GridGain. Oracle and IBM are going through legal approval before they formally sign on, but have expressed interest.
Manik went on to say he hoped that Oracle Coherence team would get involved in JSR 347, and they had expressed some interest and are waiting on internal processes before formally signing on. He also stated that the JSR 347 team contacted Gemfire, but received no response from them.
InfoQ: How has JBoss Cache evolved over time? How has JBoss Cache evolved into Infinispan?
JBoss Cache was a clustering toolkit we used to cluster the JBoss Application Server. We used it for HTTP and EJB session clustering as well as a transactional Hibernate/JPA second level cache.
Manik explained that developers started to use JBoss Cache as a data grid with permanent store features. As JBoss Cache was not designed to be a datagrid, Infinispan was created. Infinispan both supersedes JBoss Cache as a clustering toolkit and provides more powerful data grid capabilities.
InfoQ: Given that JBoss AS users use Infinispan by default for things like session replication, how many JBoss users are actually using all of Infinispan's data grid features? Given there is not a standard interface for Caching or Distributed Caching from the JSRs (yet), how many JBoss users actually use Infinispan for distributed caching or data grids?
This is hard to tell. Both JBoss AS and Infinispan are open source projects and we have limited visibility into precisely what the community is doing or how they interact with Infinispan. But if questions on the user forums and on IRC are anything to go by, I see a majority percentage of people asking questions about using Infinispan's direct APIs from within a webapp or an EJB, deployed in JBoss AS. But these are just those who ask questions.
InfoQ: What defines a data grid solution? queries, transactions, read through caches, write behind caches, data sharding, data replication, map reduce etc.? What list of features must a data grid support?
Well, this is subjective of course, but I think a data grid needs to offer transactions, read through, write through and behind, some form of sharding or partitioning, as well as listeners. Queries and Map/Reduce are more advanced features, although they're quickly becoming what everyone expects of a data grid, so I suppose they should be added to that list.
InfoQ: How would you define Infinispan's map reduce and why is it important for Java developers?
Map/reduce itself is an important concept when processing data distributed across a large number of servers, as it exhibits greater CPU and core utilization while minimizing network traffic.
Infinispan's map/reduce follows the original Google paper fairly closely in concept, however in implementation, we follow principles of fluent APIs, human-readable and intuitive interfaces, and general best practices for modern Java API design. As such, unlike other Java Map/Reduce implementations like Hadoop, we feel Infinispan's implementation is far more intuitive and developer-friendly.
InfoQ: Will Infinispan be the RI for JSR 347?
No. The RI will need to be Apache licensed, and Infinispan uses the LGPL license.
InfoQ: I noticed that Infinispan supports Memcached Text Wire protocol. Why?
We supported the memcached wire protocol originally as a way to gain acceptance on non-Java platforms. Memcached has so many client libraries, for almost any platform on the planet. Supporting the memcached wire protocol meant that almost any system could make use of Infinispan.
Subsequently, we designed and implemented Hot Rod as well, as an alternative to the memcached wire protocol, and after writing the "reference" Java client, we've seen the community build Hot Rod clients for Python and Ruby.
Manik went on to explain Memcached's protocol was too simple for a data grid solution as it was very client/server using a request/response style. Conversely, Hot Rod allows for servers to contact clients and push backend topology changes to clients which is very important for elasticity, adding new data grid nodes on the fly. Future versions of Hot Rod, will add eventing which Manik says will open up a world of opportunity. While Memcached wire protocol for distributed caching, it seems like Hot Rod could grow legs and become a de facto standard wire protocol for data grids.
InfoQ: How does JSR 347 and/or Infinispan compare to features found in Oracle Coherence, Enterprise EhCache and VMWare Gemfire?
For the large part, all of the features planned for JSR 347 are already supported by the products you mention. The main difference being the specific APIs themselves. Of course this isn't comprehensive, some products may not have certain features, such as Map/Reduce, but they probably have the building blocks atop which the missing features could be added.
InfoQ: Is JSR 347 and/or Infinispan a NoSQL solution specification and/or a NoSQL solution? Why or why not? What features are missing from JSR-347 that would make it a NoSQL specification?
JSR 347 is a standard. And it isn't a NoSQL standard but a data grid standard. Infinispan, on the other hand, while it will implement JSR 347 and as such is a data grid, is also evolving to add more NoSQL-like features. The gap between NoSQL and data grids is small enough as it stands; it's just set to get even smaller.
Manik went on to explain that JSR-347 is a precursor to a full-blown NoSQL specification with the major exception being its Java focus.
The big difference would be platform independence. JSR-347 is still a Java specification, many NoSQL databases go beyond Java.
InfoQ: Are queries part of JSR-347?
This is something the expert group will need to decide on.
InfoQ: How would you delineate data grids, NoSQL and object caching?
I see object caching as a temporary, in-memory store for objects that are expensive to retrieve or hard to calculate. Data grids take this a step further by providing a degree of durability thanks to their elastic and distributed nature. NoSQL takes another approach, typically using a disk store as your primary storage engine but providing characteristics of elasticity and scalability, at least in the case of distributed NoSQL engines.
InfoQ: What are the most important features for a NoSQL implementation?
In my opinion, this would be elastic scalability. Otherwise you'd may as well use an RDBMS and take advantage of the far greater familiarity in setup and usage.
InfoQ: What in the design of Inifinispan differentiates it from its competitors (Coherence, Enterprise EhCache, GemFire)?
I have no idea of the internal design of proprietary products.
InfoQ: Can you describe the design philosophy of Inifinispan?
Pluggability and extensibility are key. We expect people to do all sorts of things with Infinispan, not just as end-users following usage patterns we prescribe. Interceptors, commands, behavior can be added, in some cases dynamically. And being open source, all the benefits of code and design transparency make it easy for people to extend Infinispan.
Manik went on to explain some ways to learn more about Infinispan and JSR 347. The next release of Infinispan 5.1.0 should be beta in the next week or so. The JSR 347 Wiki is a good place to see where JSR 347 is moving as it is a work in progress. There are some videos on Inifinispan and CDI integration, which could be a precursor to what will be in the specification. He explained that you can use Infinispan Maven Archetypes for jump-starting a project and explore what might be in JSR 347.
Where are the Hazelcast guys?
Re: Where are the Hazelcast guys?
I believe the participation is just on a public google group so anyone can join in. To be on the JSR takes a bit more work, and a legal contract or two. To be honest, I am part of a few JSRs and they all seem to be run fairly differently. I am sure someone knows.... I guess you could send Talip Ozturk an email and ask him to get involved.