Jags Ramnarayan on In-Memory Data Grids
In-memory databases or in-memory data grids (IMDG) are gaining lot of attention recently because of their support for dynamic scalability and high performance for data intensive applications compared to the traditional databases. Some of the products in this space include Oracle Coherence, IBM’s WebSphere eXtreme Scale and DataPower XC10, Red Hat’s JBoss Infinispan, VMWare’s GemFire and Terracotta’s Ehcache BigMemory.
InfoQ spoke with Jags Ramnarayan, Chief Architect for GemFire products at VMWare, about the architecture of in-memory data grids, their advantages compared to the traditional data stores, and emerging trends in this space.
InfoQ: How are in-memory data grids (IMDG) architecturally different from traditional databases?
Jags Ramnarayan: Traditional relational databases are designed with good principles of data durability, isolation and independence but the design is centralized and tends to get "disk IO" bound for high write workloads. The use of locking at different levels (row, page, table level) for consistency/isolation and the need to flush transaction state to disk introduces scalability challenges requiring users to scale by deploying "beefier" machines (vertical scaling).
In-memory data grids (as the name implies) primarily manage data as objects in memory avoiding expensive disk seeks. The focus shifts from disk IO optimization to optimizing for data management over a network (data in a grid). Scalability is offered through a combination of replication (for access to slow changing but often requested data) and partitioning (for data volume). Data changes are synchronously managed in multiple nodes for protection against failures and in some advanced data grids, data could also be asynchronously replicated over the WAN for disaster recovery.
You could think of data grids as distributed caches with lots of other valuable features. For instance, data grids can co-exist with relational databases by offering 'out of the box' integration services like "read through", transactional "write through" and asynchronous batch "write behind".
InfoQ: What are the advantages of using IMDGs compared to traditional data stores?
JR: For the most part, IMDGs complement traditional databases. The most common design pattern is as a "distributed cache" for one or more databases. Well, compared to a traditional database, you get better scale, much lower and predictable latencies for data access, significantly higher write throughput and generally higher availability. You can de-couple your data sources from your application by moving your database specific logic into Java code that runs in the data grid and not be dependent on the availability of the data sources for the application(s) to keep running. Most prominent data grids are built on Java allowing the data grid to run embedded inside your application server cluster cutting off (or dramatically reducing) the traffic to the database servers.
InfoQ: How are IMDGs different than distributed caching solutions like memcached?
JR: Distributed caching products like memcached provide a simple, high performance, in-memory key-value store and scale really well. Its "scalability" secret lies in making servers completely independent of each other. It is the client (configured with a list of all servers) that ties all the data in the servers together. A hash function resolves the keys to servers on each client. This implies consistency of data even in the simplest cases requires all clients to have identical server lists. If different clients have different server lists or different hash functions all bets on consistency are OFF. This also means that scaling the data tier requires a code update and the testing overhead that goes with it. And with no built-in support for replication there is no native support for high availability, so any network partition or server crash leads to a loss of availability.
IMDG servers are fully clustered and are always aware of each other. They use a variety of algorithms to establish distributed consensus and ensure higher levels of consistency guarantees. The list of IMDG capabilities is vast but often it is assumed that you would get higher levels of performance with memcached. IMDGs support tiered caching so most frequently used data can be cached in client processes which is automatically synchronized with data in the server cluster offering better performance with no network hops.
InfoQ: What about NoSQL? What are the key differences?
JR: The NoSQL solutions come in a variety of different colors and flavors. Solutions like MongoDB, Cassandra, Hbase, etc. all have a similar horizontal scalability value proposition and are often positioned as an alternative to the traditional database. IMDG are positioned as complementary with distributed caching as the primary use case. All the products in the space are rapidly evolving. IMDGs (like GemFire) have already evolved with sophisticated "shared nothing" disk persistence and some in the NoSQL camp have optimized their use of distributed memory for higher performance. So, there is increasing overlap in the underlying technology.
That said, there are many differences - Support for distributed transactions, scatter-gather parallel query processing, tiered caching, support for publish-subscribe event processing, a framework to integrate data with existing databases, replication over wide area networks, etc where, I think, IMDGs have an advantage.
InfoQ: A recent Gartner report says that In-Memory Data Grids technology enables new computing paradigms such as cloud, complex-event processing (CEP), and data analysis. What do you think the role of IMDGs in these computing paradigms?
JR: Among many things, cloud deployments promise scalability and higher availability. The service can dynamically scale irrespective of the spikes in demand. So, from this perspective, IMDG is a better fit. When spikes occur, the automatic detection and provisioning of resources (h/w capacity) is handled using virtualization (for instance, automatic provisioning using vCloudDirector in VMWare environments) and integration with IMDG means, now, the IMDG elastically expands or contracts without any operator assistance.
Advanced IMDGs offer CEP through a feature called "Continuous querying" - clients subscribe to data of interest using queries and as and when the query result set is impacted due to updates the IMDG can asynchronously push "change events" to clients. This feature enables a new breed of real-time, push-oriented applications where events can be pushed all the way to thousands of devices running applications.
Application behavior can be parallelized and executed colocated to the data. Complex analysis can now be done in a fraction of time than traditional stored procedures with the data in memory and the application being able to harness the processing power across the Grid.
InfoQ: What are some polyglot persistence patterns supported by in-memory data grids (e.g. write-behind)?
JR: As mentioned before, IMDG can be used as a cache in front of other data sources. The framework allows the application developer to interface to a data service, access files, RDBs, etc. It supports 3 core patterns:
- Read on a miss - execute a "data loader" if the objects being accessed are missing in the data grid.
- Transactional "write through" - Changes are synchronously updated in backend repositories and in a transactionally consistent manner. The update to the IMDG is successful if and only if the update is also in the backend repository.
- Asynchronous "write behind" - Enqueue updates across the grid and transfer the changes in batches to a backend repository. The queuing can be configured to be in-memory replicated for HA or also be done to disk for replay if the cluster were to fail.
InfoQ: Can IMDGs be used to store non-relational data (semi-structured and unstructured) like we can with NoSQL databases?
JR: Yes. IMDG store keys and values as objects. These objects can be deeply nested, semi-structured or mimic the structure in a relational database. Just like NoSQL, developers are not constrained by a rigid SQL schema - 2 rows in a collection can contain completely different fields. The application classes associated with the stored objects can be versioned and several IMDGs support forward and backward compatibility (multiple versions of the objects).
InfoQ: What are some limitations of IMDGs?
JR: Even though the JVM GC problems can be circumvented through memory over provisioning, most don't offer a good solution for "cold restart" - how to recover TBs of data from some backend repository? Parallel recovery can take hours and overloading the backend database could stall all other apps connected to the DB. It becomes important to evaluate the persistence capabilities within IMDGs and if parallel recovery is supported.
Any distributed platform that offers elasticity (grow or shrink dynamically) is exposed to a myriad set of failures (Byzantine failures) and the fault handling capabilities becomes an important measure of the reliability of the system.
None of the IMDGs today offer "Change Data Capture (CDC)" capabilities - i.e. if the backend enterprise repository is updated from other sources, how do these events propagate to the IMDG? Users have to use 3rd party products or combination of triggers and messaging to accomplish this.
InfoQ: What security considerations need to be taken into account when using IMDG in applications, in terms of authentication, authorization, and data encryption?
JR: Security is a complex discipline with many sub-disciplines. From a developer productivity standpoint and for applications that aren't operating with sensitive data a username/password based basic authentication will suffice. Anything beyond that and you should make sure the product offers pluggable security - a way to authenticate users using LDAP, PKCS or being able to handle kerberos tickets. Similarly, you need simple ways to plugin a SSL provider for connection authentication and/or "on-the-wire" tamper protection. Authorization (permissions on what operations a user can do) can be challenging. You need a way to grant or revoke permissions dynamically but given that operations can be executed in sub-milliseconds you cannot hop to an external process for authorizing operations. The "access control rules" have to be cached and you have to make sure the IMDG processes themselves are always operating in a secure environment.
InfoQ: What are the emerging trends in in-memory data grid space?
JR: Being able to handle larger volumes of data and increasing use of the technology as the primary operational data store (i.e. with no other database backend). Updates are only propagated to other data marts or data warehouses in batches and enqueued data in the IMDG has to persist to disk in a very reliable manner.
There are many other interesting trends we will see as cloud based deployments become more prominent - seamless integration with virtualization for dynamic resources, better provisioning and automated management for "data as a service", evolutions in the programming models as we go from a traditional web application architecture (thin web client -> app server -> cache -> DB) to a more device centric architecture (richer apps on client -> distributed services -> IMDG/NoSQL -> repositories).
IMDGs were originally designed to complement traditional databases, by allowing critical pieces of fast-changing data and application logic to operate at the memory layer with much higher throughput and lower latency. These high performance needs have become so prevalent that we increasingly see people who want to use the IMDG not as a complement, but actually as a store-of-record, taking advantage of the built-in persistence that some IMDGs have.
A big trend industry-wide is to have truly global applications where users in different parts of the world are using the same app and updating the same data set all in real-time. Because IMDGs are distributed by design, it makes them an excellent starting point for building a global data grid, and in fact we see IMDGs playing a big role in global data grids.
About the Interviewee
As the Chief Architect for GemFire products at VMWare, Jags is responsible for the technology direction for its high performance distributed data Grid. Jags has been a part of several Java standards - EJB, J2EE while at GemStone systems, XML based JSRs like JAXM while at BEA. He also represented BEA in the W3C SOAP protocol specification. Jags has presented in several conferences in the past on Data management, clustering and grid computing. He has over 20 years of experience, a Bachelor’s degree in Computer Science and a Master’s degree in Management of Science and Technology.
Change Data Capture
"None of the IMDGs today offer "Change Data Capture (CDC)" capabilities" - Golden Gate Adapter for Coherence that aims to solve this problem in Coherence IMDG.
IMDG vs Distributed caches
Also are there popular benchmarks done comparing IMDG and distributed caches like memcached? Does memcached outperform IMDG since they provide less features and all the operations are performed in O(1)?
Re: IMDG vs Distributed caches
That said, you could use the Yahoo cloud serving benchmark (specifically written for NoSQL) as a starting point.
To your second question: Does memcached outperform IMDG? no, not necessarily. Most IMDGs provide "near caching" - you can host a LRU cache embedded in your client app process that will be orders of magnitude faster than memcached.
Re: IMDG vs Distributed caches
good to clarify grid versus cache
When I come to select a data grid the distinguishing features are where it supports distributed transactions and whether I can use it as a system of record without some other underlying store. Both care critical for data grids, but not for caches.
I agree with Jags' list of products, and I'd add Gigaspaces to it for any effective selection process.
Data-Grid benchmark report
The full report can be found here: