Data Modeling with Key Value NoSQL Data Stores – Interview with Casey Rosenthal
In Key Value data stores, data is represented as a collection of key–value pairs. The key–value model is one of the simplest non-trivial data models, and richer data models are implemented on top of it.
These databases offer REST-ful APIs as well as protocol buffers interfaces for data access. Key Value data stores like Riak also support the following additional features:
- Search: Distributed, full-text search engine with a query language.
- Secondary Indexes: Tag objects stored with additional values and query by exact match or range.
- MapReduce: Non-key-based querying for large datasets.
Data modeling efforts when using Key Value databases focus on the access patterns.
InfoQ spoke with Casey Rosenthal, General Manager of Professional Services at Basho, company behind open source KV database Riak, about the data modeling concepts and best practices when using these NoSQL databases for data management.
InfoQ: What type of data is not suitable for storing in a Relational Database but is a good candidate to store in a Key Value Database?
Casey: Three types of data are better fit for Key Value than Relational:
- Data that have an indeterminate form. HTML pages, for example, all have different structures. Some have headers, some have tables, some have images; others do not. The variety of structure in HTML pages makes it difficult to construct a schema for them, and Relational databases require a schema. Key Value databases do not require a schema, and can store data like HTML pages that have indeterminate form.
- Data that is big in size or quantity. Relational databases are optimized for small rows, few enough in number such that a table can fit on one server. Large objects are easier to store in Key Value databases, as are a quantity so large that you must spread them out across multiple servers.
- Data that isn’t related. Authors, books and publishers are all related, and within a single application might be a good fit for a Relational database. Log files and cached application data probably aren’t related to each other, but still need to be stored for an application that requires both. It would be easier to store these disparate data types in a Key Value database since no relations are ever going to be modeled between them.
InfoQ: What are the advantages of using a KV Database over a relational database?
Casey: In the general case, the complexity of the query engine of a database corresponds to the difficulty of scaling that database. Most Relational databases have very sophisticated query engines. By contrast, most KV databases don’t really have a query engine at all, since the lookup path can be traced as a straight line from the request to the object in memory or on disk somewhere. As a result, most KV databases are much easier to scale than Relational databases. This is particularly true of distributed databases that are designed to exist on multiple servers. Relational databases have fundamental limits to how well they can scale, based on a combination of where they store the relation indexes, how much data exists in the system, the speed of the network within the distributed system, and other factors. A KV database does not have this fundamental limitation, since relations between data don’t have to be calculated by a query engine.
InfoQ: Conceptually all NoSQL Databases store the data in a key value fashion whether the value is a JSON Document, or a Column Family data set. What are the advantages of using a KV Database over other NoSQL databases like a Document or Column Family database?
Casey: The restrictions of handling data only in JSON or Column Family format carry implications about how the data is stored in the system and how the query engine must process requests. These restrictions and implications have further impacts on scaling profiles of those databases. KV databases don’t have these restrictions, and they rely on application code to parse the data. As a result, it is easier to scale the KV database irrespective of the type of data being stored within it. This is particularly true of distributed databases.
InfoQ: Can you discuss the typical data modeling process when using a KV database?
Casey: Best practices in KV data modeling focus on the access pattern. A developer is encouraged to approach the problem from the point of view of the application fetching the data out of the system. If the data can be written in such a way that it matches the format required by the application that fetches the data, then the data model is nearly transparent. Good KV data models “fall out” of the access-pattern approach to design.
InfoQ: Where should the modeling happen for the NoSQL databases, in the database or application layer?
Casey: In KV databases, modeling should happen within the application layer. In NoSQL databases that have more restrictive APIs, such as graph databases which only deal with nodes and edges, the modeling should happen within the database.
InfoQ: Can you discuss the design considerations for the key value data management requirements?
Casey: Besides the access pattern, design considerations include: whether the data will be encrypted or versioned or otherwise modified when it is persisted, whether it will be read or written to more often, and whether it will ever be modified. Data that will not be ever modified is called “immutable” data, and immutable data often provides advantages to the architecture of a system.
InfoQ: Are there any anti-patterns when working with KV data?
Casey: Treating KV data as though it were relational data is an anti-pattern. Normalizing data and trying to construct object that only represent relationships between metadata are two anti-patterns that fall within this category.
InfoQ: Can you talk about any gotchas or limitations of KV databases?
Casey: KV databases do not have the “richness” of a query language like SQL. Developers expecting an SQL-like query language on top of a KV database will suffer an expectation mismatch.
InfoQ: What is the current status of standards in KV data management space in the areas of data queries, traversal, analytics etc.?
Casey: The principles of REST-style API are well established and well understood by most developers, and the semantics of this API generally correspond to most KV databases. Formal consensus on a specific KV API or KV query language does not yet exist.
InfoQ: What is the future road map of KV databases in general and Riak in particular?
Casey: KV databases in general are moving toward co-existence with other styles of databases. Riak in particular is a solid highly available, fault-tolerant, scalable data platform. The KV database in Riak itself is the platform, a solid foundation, and in the future we at Basho will leverage that strength to provide other non-KV APIs to the developers. The large-object S3 and Swift APIs, for example, are already provided on top of Riak in the form of Riak CS. In Riak 2.0, we will be providing Solr API on top of the data platform. In future versions, we will expand the set of APIs offered on this platform.
Casey also mentioned the following about the data modeling and best practices with KV databases.
KV databases are the most fundamental of all databases, since they are the simplest to represent and don’t require query planners. As such, KV databases provide the best foundation for more sophisticated data platforms. Data platforms can be built for different use case properties, like highly available systems, or fault tolerant systems. If these data platforms are correctly built on a solid foundation, then the choice of data model can be made as a matter of developer convenience, rather than as an operational tradeoff. It is reasonable to expect that future data platforms will be built upon solid KV databases, with richer data models, query languages, and APIs being exposed over time.
About the Interviewee
Casey Rosenthal is General Manager of Professional Services at Basho, where he installs and tests Riak clusters, and provides training to clients so that they can do the same. As Chief Software Engineer for Port Forty Nine, Casey worked for NASA, Caltech, and JPL to engineer systems for storing and disseminating the image archives of space telescopes such as Hubble, Spitzer, Chandra, etc. He came in fourth place at the BotPrize 2K competition in Copenhagen for Discordia, a software bot written in jRuby that plays Unreal Tournament like a human based on a new artificial intelligence algorithm. He won a seed grant from the Maine Institute of Technology to commercialize a discrete event simulation framework written in Ruby. His Twitter ID is: @caseyrosenthal