BT

Distributed Cache as a NoSQL Data Store?

by Srini Penchikala on Nov 04, 2011 |

NoSQL data stores offer alternative data storage options for non-relational data types like document based, object graphs, and key-value pairs. Can a distributed cache be used as a NoSQL store? Greg Luck from Ehcache wrote about the similarities between a distributed cache and a NoSQL data store. InfoQ caught up with him to talk about this use case and its advantages and limitations.

InfoQ: Can you explain how a distributed caching solution is comparable with a NoSQL Data Store?

Greg Luck: A distributed cache typically keeps its data in-memory and is designed for low latencies. A NoSQL store is a DBMS without the R (relations), and usually also lacks support for transactions and other sophisticated features. It is the joins of table relations that are the most problematic part of SQL for systems that don't support relations, and the source of how the NoSQL moniker comes about.

One type of NoSQL store is a key-value store. Some examples are Dynamo, Oracle NoSQL Database and Redis. Caches are also key-value stores, so these two are related. While many cache implementations can be configured for persistence, much of the time they are not because the system of record is in a database, and they are aimed at performance rather than persistence. In contrast, a NoSQL store is always aimed at persistence.

But a persistent cache can be used in much the same way as a key-value NoSQL store. NoSQL is also addressed at Big Data, which usually means bigger than you want to put in a single RDBMS node. It is typically thought of as starting at a few terabytes and moving up into petabytes.

Distributed Caches are typically aimed at lowering latencies to valuable transactional data which tends to be smaller in size perhaps ending much where Big Data begins. Because caches store data in memory there is a higher cost per unit stored which also tends to limit their size. If they are relying on heap storage they might have as little as 2GB per server node. It depends on the distributed cache. Ehcache stores off-heap as well and stores hundreds of GBs per server for multi-terabyte caches.

A persistent, distributed cache can address some NoSQL use cases. A NoSQL store can also address some caching use cases, albeit with higher latencies.

InfoQ: What are the similarities between Distributed Cache and NoSQL DB from an architecture stand-point?

Greg: Both want to deliver more TPS and scalability than an RDBMS. To do so, both do less than an RDBMS with the idea that simplifying the problem and leaving out problematic things like joins, stored procedures and fixed ACID transactions gets you there.

Both tend to use proprietary rather than standardized interfaces, although in the area of Java caches we now have JSR 107, which will provide a standard caching API for Spring and Java EE programmers.

Both scale out using partitioning of data transparent to the clients. The non-Java products also scale up quite well. With Terracotta BigMemory we are pretty much unique in scaling up on the Java platform. Finally both deploy on commodity hardware and operating systems making them ideal for use in the cloud.

InfoQ: What are the architectural differences between these two technologies?

Greg: NoSQL and RDBMS are generally on disk. Disks are mechanical devices and exhibit large latencies due to seek time as the head moves to the right track and read or write times dependent on the RPM of the disk platter. NoSQL tends to optimise disk use, for example, by only appending to logs with the disk head in place and occasionally flushing to disk. By contrast, caches are principally in memory.

NoSQL and RDBMS have thin clients (think Thrift or JDBC) that always send data across the network whereas some caches like Ehcache use in-process and remote storage so that common requests are satisfied locally. In a distributed caching context, each application server will have the hot part of the cache in-process so that adding application servers does not necessarily increase network or back end load.

RDBMS is focused on being the System of Record ("SOR") for everyone. NoSQL comes in flavors and so wants to be the SOR for a specific type of data, namely key-value pairs, documents, sparse databases (wide columnar) or graphs. Caches are focused on performance and are typically used with an RDBMS or a NoSQL store where the type of data is SOR. But oftentimes a cache might store the result of a web service call, a computation of even a business object that might require tens or hundreds of calls to an SOR to build up.

Caches like Ehcache tend to run partly within the operating system process of the application and partly in their own processes on machines across the network. Not all of them do this: memcache is an example of a cache that only stores data across the network.

InfoQ: What type of applications are best candidates for using this approach?

Greg: Following on from the earlier question, distributed caches fit in with your existing applications. They often require very little work to implement whereas NoSQL requires a large, jarring architectural change.

So the first types of applications suitable to distributed caching are the ones you already have and in particular applications that need to:

  • scale out because of increasing use or load
  • have lower latencies to achieve their required SLAs
  • minimize use of very expensive infrastructure such as mainframes
  • reduce their use of paid for web service calls
  • cope with extreme load peaks (e.g. Black Friday sales events)

InfoQ: What are the limitations of this approach?

Greg: Caches, being in memory, are limited in size by the cost of memory and their technical limitations in how much memory they can use (more on this below).

A cache, even if it does provide persistence, is unlikely to be a good choice for your system of record. They purposely avoid sophisticated tooling for backup to and recovery from disk although simple tools may exist. RDBMSs have rich backup, restore, migration, reporting and ETL features, which have been developed over the last 30 years. And NoSQL tends to lay in between.

Caches provide programming APIs for mutating and accessing data. By contrast, NoSQL and RDBMS provide tools that can execute scripting style languages (e.g. SQL, UnSQL, Thrift).

But the key thing to remember is that the cache is not trying to be your SOR. It coexists very easily with your RDBMS and for that reason simply does not need the sophisticated tools of the RDBMS.

InfoQ: How do you see the distributed cache solutions, NoSQL Databases as well as the traditional RDBMS working together in the future?

Greg: Because distributed caches are anywhere from 1 to 3 orders of magnitude faster than RDBMS or NoSQL depending on deployment topology and data access patterns, those looking for lower latencies will use a cache as a complement to NoSQL just as they do today with RDBMS.

Where there will be a difference is that it is often hard to scale/or there is impact on the programming contract or CAP tradeoffs when you try to scale an RDBMS beyond a single node, whereas with NoSQL you treat it as a multi-node installation even when you are running against one node. So when you scale up there is no change to these things. With RDBMS a cache is added to avoid these scale out difficulties. Often the cache solves the problem to the capacity requirements of the system and you need to go no further. So the cache gets added when scale out is needed.

For NoSQL, scale out is built-in, so the cache will get used when lower latencies are required.

 

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

I think the future direction is the combination of computation WITH storage by Faisal Waris

One distinguishing feature is that some NoSql allow computations to be pushed to the storage nodes. Caches have get/set APIs only (in general). Consider for example MongoDB which allows map-reduce queries (involving javascript) to be executed on the storage nodes.

On the hardware side, memristors (www.physorg.com/news190016024.html) may allow us to do something similar.

The success of Hadoop, to a large extent, stems from being able to efficiently co-locate processing and data.

Re: I think the future direction is the combination of computation WITH sto by Steven Dick

Many of the distributed cache guys are moving into grid compute (or are already there) - think Oracle Coherence, Gigaspaces, Gemfire.

So for me I don't see such a clear separation between the NOSQL and distributed cache guys. Frankly, it leaves me a bit confused over which products are a good fit for the areas I work in.

Re: I think the future direction is the combination of computation WITH sto by Greg Luck

Actually most distributed caches have search APIs that execute on the storage nodes - Enterpise Ehcache does this.

Re: I think the future direction is the combination of computation WITH sto by Greg Luck

Starting with the use cases is a good way to go. Then line up your requirements with the features. I for one think it is a good think that we have more than one hammer for all of those nails out there.

Re: I think the future direction is the combination of computation WITH sto by peter lin

That's not correct. Many data grids have supported queries and work managers for several years. For example Coherence has provided query and work manager capability since version 3 download.oracle.com/docs/cd/E15357_01/coh.360/e.... I could be wrong, but version 3 was released back in 2004/2005. According to wikipedia, MongoDB was started in 2007. I'm sure Cameron can provide the exact date of when query and work manager features were added to coherence.

Yep - Right tool for the job based on the implementation. by S Iyer

Yes conceptually NoSQL and Distributed Caches are quite similar and several use-cases overlap - in that the use-case can be skinned by either/or. Several high-scale applications I have worked on have used one of the other (Ehcache, Memcached, MongoDB, Redis, Coherence etc.) - for me, it really boils down to a question of implementation and specific feature set. So I tend to put them all in the same bucket and look at pros/cons vis-a-vis the use-case at hand.

1.
Example - the article does talk about the advantages of a local in-memory cache which distributed Ehcache and Coherence provide and this is key for applications that need micro-second latency assuming the data being cached features a hot-set and/or good locality and does not mutate too much. However, Memcached (a popular distributed cache) does not feature such a "L1-cache".
2.
Typically NoSQL mimics a traditional database:
- In terms of better backup/recovery capabilities.
- In keeping the data format agnostic of the application (e.g. JSON/BSON in case of MongoDB), where as the distributed cache implementations are typically guilty of inventing proprietary and opaque persistence formats.
- Allowing a much richer query interface programmatically and via an application-agnostic mechanism (e.g. Mongo-shell), Caches on the other hand typically offer only programmatic/API-based key-based access. Although Search APIs are being added, these are still reasonably primitive in terms of expressability and run-time performance.

So basically the article sums it up right. If latency trumps everything else and the data-access patters are right, go with a distributed cache. If the data-set does not lend itself (e.g. no real hot-set, no locality, constant mutation and consistency is important) and/or queryability and other dimensions are key for your use-case, then maybe one of the NoSQL implementations may fit your bill.

Distributed Caching by Wes Nur

i am a big fan of distributed caching because perhaps its the easiest way to enhance the performance and scalability of an application. During the peak load times, the performance of an app usually goes down which ultimately affects the end-users(customers). So the use of a third party distributed cache like NCache or any other product can really help to solve the problems like performance, scalability and high availability.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

7 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT