Virtual Panel: NoSQL Database Patterns and Polyglot Persistence
NoSQL database space has different databases that support different data storage patterns. InfoQ spoke with four panelists about the current state of NoSQL adoption, architecture patterns supported by different NoSQL databases, and security aspects when using NoSQL databases. Following are the panelists we spoke with and the NoSQL databases they discussed.
- Robin Schumacher from Datastax to speak about Cassandra, a Column-family based database
- Jared Rosoff, 10gen to talk about MongoDB, a Document based NoSQL database
- Darren Wood, Chief Architect at Objectivity Inc. to discuss InfiniteGraph, a graph-oriented database
- John Davies, CTO and co-founder of Incept5
We asked these panelists the following questions:
- What is the current state of NoSQL databases in terms of enterprise adoption?
- Can you discuss the core architectural patterns your NoSQL database product supports in the areas of data persistence, retrieval, and other database related concerns?
- What are the use cases or applications that are best suited to use your product?
- What are the limitations or constraints of using the product?
- Cross-Store Persistence concept is getting a lot of attention lately. This can be used to persist the data into different databases including NoSQL and Relational databases. Can you talk about this approach and how it can help the application developers who need to connect to multiple databases from a single application?
- What do you see as the role of In-memory data grids (IMDG) in the polyglot persistence space?
- What is the current state of security in this space and what's coming?
- What is the future road map for your product in terms of new features and enhancements?
InfoQ: What is the current state of NoSQL databases in terms of enterprise adoption?
Robin Schumacher: Here at DataStax, the trend we’re experiencing is pretty clear. At first, the types of companies that were adopting and using Cassandra were the types you’d expect: fairly young, Internet-type businesses that liked what NoSQL databases offered, both in terms of abilities and cost.
But in the last 8 or so months, we have been signing up major industry names that are recognizable to everyone. Moreover, they’re putting our software in the critical path of their key business lines so that it’s responsible for systems that keep the lights on.
What’s surprised me is how fast all this has happened. When I was at MySQL, the adoption timeline of open source database software in the enterprise was much slower, especially where running key systems was concerned. The uptick for NoSQL in the enterprise seems to be happening much faster, perhaps because of the requirement to manage big data or the increasing need new applications have for a different data model than those found in a RDBMS.
Darren Wood: We are working with a lot of prospects/customers and they fall broadly into two categories.
First, newer markets like social networking, gaming, internet/mobile advertising and other large scale web based applications are building systems from the ground up based on NoSQL technologies. In many cases, multiple NoSQL data stores are being used together, each providing different capabilities in line with their individual strengths. For example, there is extensive use of distributed key-value stores for user/device profile information and in some cases initial capture of interaction and event data. In the backend, information derived from interactions is being used to populate various data stores used for specific types of queries. Graph Databases like InfiniteGraph are a good example of this as they can be used to store and discover connections between entities in the system, e.g how groups of users are related.
Secondly, in our more traditional markets (Telco, Government, Healthcare, Sciences etc), there is a steady increase in NoSQL adoption to help find new value in existing data. While there is some displacement of traditional RDBMS, we are seeing a lot of projects in these markets which are pulling data from various existing data platforms and storing them in more efficient ways to become the basis for "information discovery" or data analytics type applications.
John Davies: We (the investment / wholesale banking world) have been using "NoSQL" for a decade now, we didn't have a name for it before but we've been storing data in non-relational storage for years.
InfoQ: Can you discuss the core architectural patterns your NoSQL database product supports in the areas of data persistence, retrieval, and other database related concerns?
Robin: Sometimes you’ll see Internet FUD that says NoSQL databases aren’t safe for data. With Cassandra, I can tell you that claim is absolutely false.
Cassandra was built with the understanding that hardware failures can and do occur, and that data loss is always to be avoided. To ensure hardware or other such breakdowns don’t threaten the data assets of a company, Cassandra was architected with a peer-to-peer design that ensures continuous availability without compromise where performance and ease of use is concerned.
Data coming into Cassandra is always written to a commit log that guarantees data durability, and then to an in-memory structure called a memtable. That is then flushed periodically to disk to what’s called an "sstable" (sorted strings table).
Cassandra is designed for divide-and-conquer database use cases where scale out architectures are desired. Data is automatically distributed across all nodes in a cluster based on the primary key in a column family (akin to a table in an RDBMS). Data is also replicated among other nodes based on how much redundancy a user wants, with the replication being easily set up no matter if a database cluster is just in one data center, many data centers, or a combination of on-premise data centers and the cloud. The bottom line is, these features ensure no hardware failure or data loss on one or more nodes threatens uptime or data loss for the whole cluster.
Because all Cassandra nodes are the same, data can be read and written to any node. Cassandra uses an eventual consistency model that is tunable in that a developer can decide per-operation how consistent they want each insert or query to be. If you look at the RDBMS ACID paradigm, Cassandra has the A-I-D, but not the "C" because there is no such thing as referential integrity or foreign keys.
At DataStax, we use Cassandra in our DataStax Enterprise product to bring the same data protection and continuous availability to Hadoop for analytics and Solr for enterprise search. With DataStax Enterprise, a customer has one database cluster for real-time, analytics, and search, that supports everything described above.
- All data is stored as BSON (Binary JSON) objects.
- Persistence is achieved through an asynchronous framework of wait conditions. Clients can wait for data to be written to disk, replicated to slaves, or wait for nothing at all. This enables tunable durability under control of the developer.
- Data is written to disk first using a write-ahead log, which ensures data hits disk as soon as possible after writing (by default, the write ahead log is flushed to disk once ever 100ms).
- The database maintains B-Tree indexes over document data that are updated atomically when document data is updated.
- Queries are processed using a query planner that determines the optimal index to use based on the fields included in the query from the client.
- In sharded deployments, the system partitions documents across servers according to a user selected shard key. Documents are kept in contiguous ranges called Chunks and these Chunks are distributed evenly across available servers. The system automatically moves Chunks between servers to keep the system balanced. Once a query hits a shard, query processing proceeds as if in a non-sharded environment (e.g. query planner determines best plan & uses appropriate indexes to retrieve data).
- Replication is achieved through the oplog. The oplog is maintained by a primary member of the shard and contains the operations applied to that shard's data in the order in which those updates happened. Secondary members tail the oplog and apply updates locally in the same order they occurred on the primary.
Darren: InfiniteGraph is a Distributed Graph Database. Like many NoSQL solutions, it stores data in a way that makes it very efficient for performing particular types of queries. InfiniteGraph treats connections in the data as first class citizens, persisting them as a kind of "virtual pointer" that is optimized for traversal. In contrast to other database types which store foreign keys to related objects (requiring nested/recursive queries to perform traversals), InfiniteGraph traversals are orders of magnitude faster.
In addition to the physical storage differences, InfiniteGraph's unique distributed data model allows the database to make use of high performance computing clusters or cloud infrastructures to distribute both insert/update operations and navigational query load.
John: We build most of our NoSQL "databases" behind APIs or a façade so that we can abstract the implementation, the vast majority are in-memory as performance was the first driver away from SQL and classic RDBMSs. We have no concerns with the technology, it's standard practice, the biggest issues we have are reporting and analytics as most off-the-shelf packages are SQL based.
InfoQ: What are the use cases or applications that are best suited to use your product?
Robin: With Cassandra, the underlying data model is a Google Bigtable design, which is column-family in nature. This means you have a sparse table format that is much more flexible than a RDBMS table, so that one row may have 3 columns and another row in the same column family could have 3,000 columns.
This type of data model is perfect for use cases that involve time series data (e.g. machine generated data, financial click data, web clickstream data, etc.), retail data models that are used for Web shopping carts and user transactions, social media applications, online gaming, and systems that are very write intensive.
With DataStax Enterprise, we support all those use cases as well as Hadoop batch analytic use cases and enterprise search use cases with Solr. For example, SourceNinja, a leader in helping organizations manage open source, uses DSE 2.0 to correlate multiple distributions of a single open source project to a single project. "For example, Linux is a single project - but there are multiple distributions and versions of Linux - Red Hat, Debian, Amazon and more," said Matt Stump, Co-founder, SourceNinja. "With all the different versions available, a search solution like what’s in DataStax Enterprise 2.0 is really the only option that makes sense to keep all projects straight."
Jared: MongoDB is a general purpose data store and is being used in a wide variety of applications. Typical use cases include content management systems, real time analytics, machine-to-machine communication, e-commerce catalogs, distributed state and counter management, and sensor data collection. In general, you can use MongoDB almost anywhere you might use a relational database. MongoDB will be particularly effective in applications that require multi-variate data (which leverages MongoDB's rich data model), high transaction rate applications (which benefit from sharding and replication to spread load across multiple servers), agile development cycles (with MongoDB's dynamic schema model) and cloud delivered applications (where MongoDB's scale out architecture makes it easier to compensate for the small, virtualized machines available to the developer).
Darren: InfiniteGraph is all about making use of the connections in your data. InfiniteGraph allows you to discover how something (a customer, user, product etc) is connected to other things using a rule based traversal algorithm. Once you start thinking about data as a graph, it becomes obvious how powerful traversals can be.
One of the most well-known use cases is "how are these two things related", finding all the paths in the data between two entities in the system. For example, the relationships between two people in a telephone call network would reveal possible ways for information to be passed between them and even identify the intermediate callers. This is a very powerful concept that can be applied in a range of problem sets.
Graphs are everywhere in the data world. There are countless real world examples where data can be expressed in terms of a series of vertices connected by edges. Another example is airline flight routing, where individual segments between airports can be used to construct different routes between two cities. Weighting on a graph like this can be used to query the most cost effective routes etc.
John: By product I assume you're referring to C24's Integration Objects (C24-iO). Our product is a Java Binding code generator, we take a model-driven approach to messages and therefore their storage. We're ideally suited to an environment where you're persisting messages, for example, SWIFT, FpML, Fix, ISO-20022, SEPA, ISO-8583, DTCC, Minos etc., what we generate fits nicely into ESBs, SOAs, integration frameworks such as Mule, Spring Integration etc. and in-memory databases such as GemFire, GigaSpaces etc.
InfoQ: What are the limitations or constraints of using the product?
Robin: The primary limitations with Cassandra are that there are no join operations between column families so the data is highly denormalized. Also, if there’s a need for complex or nested transactions (with savepoints, rollbacks, etc.) then Cassandra isn’t the best database choice.
Jared: Today, MongoDB does not support multi-statement transactions or joins. In many cases, this is not an issue, as MongoDB's rich document model and atomic update capability makes it easier to model objects that would otherwise need to be normalized in a relational database. For cases where true multi-statement transactions or joins are required, then an RDBMS might be the right solution. Also, MongoDB is designed as an operational datastore, not an analytical data store. If most of your queries are aggregations over the entire data set that must run quickly, then an OLAP database might be the right solution.
John: We've never gotten around to doing a native Windows version so we're Java only but it's never really been an issue in our environment.
Darren: Like many NoSQL databases, InfiniteGraph specializes in a certain range of use cases. For us it is storing data as a graph and queries that involve deep relationship traversal over several degrees of separation. This makes us less suitable for traditional queries across the dataset or statistical type analysis like "the average revenue per customer over 55". Relational, Document and Column storage models are typically more applicable here. For batch oriented or very large datasets, a distributed processing platform like Hadoop/MapReduce can also be appropriate and effectively used alongside InfiniteGraph.
InfoQ: Cross-Store Persistence concept is getting a lot of attention lately. This can be used to persist the data into different databases including nosql and relational databases. Can you talk about this approach and how it can help the application developers who need to connect to multiple databases from a single application?
Robin: Some of our customers have done a full replace of existing RDBMS's with either Cassandra or DataStax Enterprise, but others definitely have a co-existence strategy that's in place.
For a co-existence strategy to work, you need a number of things. Obviously you need the right client libraries/drivers for the app to connect to the NoSQL database in the first place. Next, you need a way to move data back and forth between an RDBMS and a NoSQL database that’s easy to manage. This last part in particular can cause lots of headaches if not managed well.
For us, we support sqoop with DataStax Enterprise, which makes it really easy and efficient to move data from RDBMS tables to Cassandra column families. If the data needs to be transformed in flight, users can utilize the free Pentaho Kettle product that can slice and dice data in most every way imaginable to and from Cassandra and Hadoop.
Jared: This is indeed a common approach. Many application frameworks such as the Spring Data Framework, Rails and Django already support running on top of multiple data stores. While this offers the developer increased choice, one should be cautious of not going overboard. Supporting 2-3 data stores is reasonable, but as you start to add your 5th or 10th data store, it can become a nightmare to manage. It can become increasingly difficult to manage data that is spread out over a lot of repositories.
Darren: Combining a number of database types in a system allows you to capitalize on the benefits of all of them and use them individually for their most appropriate purpose. It is very common for us to see entity data remain in an existing RDBMS since SQL is still a great tool for performing ad-hoc attribute based queries. They typically also provide good indexing capabilities and are normalized for efficient storage of data.
SQL however has no real concept of performing deep connection based queries like a Graph Database can. When combining RDBMS and InfiniteGraph, you can build a system that can identify the endpoints of a traversal using SQL, but exploit the power of a Graph Database for processing navigational style queries. We are currently involved in a number of systems where this type of approach is taken.
John: This is nothing unusual, even with a "traditional" database we design different schemas for different uses, some with indices some without so why should this be any different? Some of the data/messages we store in a NoSQL in-memory store, some transform and store in a classic RDBMS for analysis or reporting.
InfoQ: What do you see as the role of In-memory data grids (IMDG) in the polyglot persistence space?
Robin: They have their place for sure. However, big data use cases oftentimes overwhelm them and so users end up turning to a solution like Cassandra/DataStax Enterprise instead. We certainly aren’t main memory, but because of the scale out architecture, you get the benefit of smart automatic sharding across nodes, and you can utilize the memory on those nodes so response times are very fast because most of the data is in memory.
Jared: I think that many of these solutions are becoming less and less important. Data Grids compensated for the slow performance of disk drives by keeping objects in memory to decouple applications from the slow storage layer. But products like MongoDB use RAM to cache as many objects as possible, meaning that the data grid can become redundant when used as a cache.
Data Grids are useful for situations where you have data you want to keep in memory that must be shared across multiple application servers, but which will never be written to a database (however, in many cases, it's simpler to put this data in the database and eliminate the operational complexity of running another data store).
While data grids can provide an abstraction layer, making all of your data stores conform to a similar API, this is often not a very good choice, because it forces the developer to work with least common denominator of features across their data stores. Typically, you pick a cross-database persistence mechanism because you have different needs of different data. If you try to normalize all of this data into a single API, you might defeat the purpose of using multiple data stores in the first place.
Darren: There is certainly a lot of interest around IMDG and I have seen them playing various roles in customer deployments. In many cases, IMDG's are being used to "extend" the lifetime of a system built on traditional RDBMS, as a way to cache "transient" shared data that is not written to the database alongside existing information from backend databases. Many of the IMDG packages are also incorporating processing engines and enabling MapReduce style capabilities on top of the memory grid. This is an interesting story for doing real time analytics over data that is changing constantly and I think there is a big future for this kind of capability.
In contrast, most modern databases (like InfiniteGraph) use distributed servers with built-in cache to combine the effects of a data grid and database in one solution.
John: Mercury delay-lines, magnetic core, hard-disks, RAM and SSD, it's all the same, some better suited to certain tasks than others. We (the investment banking world) moved to in-memory data grids about 10 years ago so its role is a primary one.
InfoQ: What is the current state of security in this space and what's coming?
Robin: All NoSQL vendors are pretty much in the same boat here; security is very minimal at present. Our customers either utilize the security framework in Cassandra to build their own, use operating system security, or do security in their application.
I do believe you’ll see this change before long, especially as government agencies and the financial market adopt NoSQL more than they have today.
Jared: Security is evolving. It's hard to comment in general about security across all data stores. But for MongoDB, significant effort has gone into a number of fronts:
- improving the authentication and user management within the database
- adding SSL encryption to client and replication interfaces to enable encryption of data in motion
- supporting encrypted file systems like eCryptFS to enable encryption of data at rest.
Darren: There is definitely a trend toward moving security out of the database and higher up the stack where the concept of a user is better understood. This is especially true for polyglot systems where there are differing security models in use and little support for replicating user information between them. Some systems use a master data management layer and overlay security at this level, but many are simply storing security attributes with the data and processing them in the application against authorization information stored elsewhere.
InfoQ: What is the future road map for your product in terms of new features and enhancements?
Robin: Where Cassandra is concerned, we’ll continue to work with the strong community that is out there to make Cassandra even easier to use than it is today, plus focus on making it even faster.
With DataStax Enterprise, we have the data domains of real-time, analytics, and enterprise search covered well, so we’ll be looking into other data-related domains that we don’t directly handle yet. There’ll be more coming from us on this later in 2012.
Darren: Based on feedback from current customers, we are currently expanding our distributed query and data placement capabilities. As a company we are focusing on these two areas and should have some major announcements in the coming months.
John: We continue to build tight integration into partner products, we have deployed into Mule for example for 7 years, GigaSpaces, GemFire, Spring Integration too. We can load complex messages into a GemFire cache through Spring Integration without even writing a line of code. Our roadmap is to expand on these capabilities and increase the performance and scalability as we move forward.
About the Panelists
Robin Schumacher has spent the last 20 years working with databases and big data. He comes to DataStax from EnterpriseDB, where he built and led a market-driven product management group. Previously, Robin started and led the product management team at MySQL for three years before they were bought by Sun, and then by Oracle. He also started and led the database product management team at Embarcadero Technologies, which was the #1 IPO in 2000. Robin is the author of three database performance books, was the database software reviewer for Intelligent Enterprise and other magazines, and is frequent speaker at industry events. Robin holds BS, MA, and Ph.D. degrees from various universities.
Jared Rosoff is the Director of Product Marketing at 10gen. Before joining 10gen, Jared ran Product Development at Yottaa where he developed a real-time analytics engine on top of MongoDB and Ruby on Rails. Jared began his career by dropping out of Brown University to become the founder and CTO of TAZZ Networks.
Darren Wood is the Architect and Lead Developer of InfiniteGraph, a distributed graph-oriented data management solution from Objectivity Inc. Darren has spent the majority of his career architecting and building distributed systems with an emphasis on elastic scalability and data management.
Prior to joining Objectivity in 2007, Darren held positions as a Senior Consultant with IONA Technologies and a Development Team Lead for Citect Australia. Darren holds a First Class Honors Degree in Computer Systems Engineering from the University of Technology in Sydney, Australia.
John Davies is CTO and co-founder of Incept5 & C24, two international companies specialising in enterprise consultancy and integration. John was the original technical architect behind what is now Visa's V.me, in the past has been chief architect at JP Morgan and BNP Paribas and technical director of 2 NASDAQ companies. John has co-authored of several enterprise Java and architecture books and is a frequent speaker at banking and technology conferences.