Finding the Right Data Solution for Your Application in the Data Storage Haystack
Thanks to the NoSQL movement, data storage solutions are no longer a solved problem. Many are working hard to build new storage solutions, and even more are willing to use them. On the flip side, if you are a programmer or a solution architect who wants a data storage solution for your application, you have to face the daunting task of weighing and understanding the tradeoffs associated with the application and make a decision. This article explores the data needs of end user applications and various tradeoffs. It provides guidelines on the criteria for selecting data storage choices enabling the architects and developers to make an informed decision.
A few years ago, most systems were small and relational databases could handle those systems without any trouble. Therefore, the storage choices for architects and programmers were simple. However, the size and scale of these systems have grown significantly over last few years. High tech companies like Amazon and Google faced the challenge of scale before others. They soon observed that the relational databases could not scale to handle those use cases. Eric Brewer's CAP theorem explains this phenomenon as "a distributed system can only have two of the three properties - Consistency, Availability, and Partition Tolerance". Motivated by the CAP theorem and Werner Vogel's (Amazon CTO) work on loose consistency, both Google and Amazon came up with two storage solutions called Big Table and Dynamo respectively. These solutions scale by relaxing ACID (Atomic, Consistency, Isolation, Durable) guarantees. They gave rise to the NoSQL movement, which in turn resulted in many data storage solutions.
If pre-NoSQL era was boring for people who built storage solutions, the NoSQL/NewSQL era has been exactly the opposite. The architects now have a wide variety of storage choices like Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc.
Programmers now face the same dilemma that we all face when choosing a TV from Best Buy - too many choices. Different applications have different needs from a data store. For example, an online retail application needs transactions while a search application like Google needs high scalability and eventually consistent data. My recommendations described in this article will provide a good starting point for programmers and solution architects who want to find the right solution in the data storage haystack. I believe the reader will walk away with a broader understating of data storage models and which model should be used when.
Data Storage Requirements
To decide on the best data solution for a given application, we should understand what the application expects from a data storage. Some expectations like reliability of data, data security, Vendor Agnostic nature, availability and simple user experience are expected by most of the applications. However, there are others like type of stored data, support for search depends from application to application, and we shall focus on those. For this discussion, we will focus on the following characteristics of an application.
- Types of data stored
- Scalability requirements
- Nature of data retrieval (i.e. Types of Queries)
- Consistency requirements
Let's discuss each decision parameter in detail, and break values of each property into several categories so that we can analyze them case-by-case.
Decision Parameters in Selecting a Data Storage Solution
Types of Data
Data is usually one of the three types: unstructured, structured, and semi-structured.
Unstructured Data: We classify a given data product as unstructured when the application or the storage that handles data does not understand its structure. Examples of unstructured data are documents written in natural language, images, videos, and raw observation data. The storage handles unstructured data as Files or Blobs (sequence of bytes). Applications either present unstructured data as is to humans who make sense of it (e.g. Web) or convert to structured data through data processing (e.g. AI, statistical techniques, and semantic web).
Structured Data: Has a well-defined structure (e.g. employee record in a database), and it's possible to extract or search by specific parts of the data. A good example of structured data is a relational database where each column in the database is predefined to have a type and a name.
Structured data always has an associated schema that describes the structure of the data. Often, we define the schema beforehand for data validation, to provide user-friendly interfaces/APIs, and to improve the performance. Any data stored in a given relational database has to match the schema of that database.
Relational databases use a two dimensional schema. However, there are other schema models like Star Schema often used by data mining (OLAP) use cases.
Semi-structured Data: Semi-structured data has some structure (e.g. tree, graph, or a list) in terms of relationships or hierarchy. However, it only defines an abstract structure. For example, XML data has relationships defined to each data elements by XML syntax and semantics, but it does not limit the type of data included within a XML document. Some examples of semi-structured data are XML, Graphs, and Name-value pairs, and HTML documents.
Need for Scalability
There are two ways for a system to scale, vertical scalability and horizontal scalability. Vertical scalability adds more resources to the computer (e.g. more memory, run on a faster machine etc.), and typically does not need any changes to the application or special features from the application. Horizontal scalability adds more nodes (computers) to the system. Support for vertical scalability needs to be planned from the early stages of the architecture lifecycle. It comes almost orthogonal to the application, but hardware limits the amount of resources possible within the same server. Therefore, we only focus on horizontal scalability in this discussion. To facilitate the discussion in this article, we define three scalability classes.
- Low scalability (2-5 nodes)
- Scalable (10 nodes)
- Highly scalable (100s to 1000s of nodes)
The low scalability class roughly denotes the limits of RDBMS where they can be scaled by adding few replicas. However, data synchronization is expensive and usually RDBMSs do not scale for more than 2-5 nodes. The "Scalable" class roughly denotes data sharded (partitioned) across many nodes, and high scalability means ultra scalable systems like Google.
It is worth noting that "the number of nodes" in the definition means the number of nodes in the data storage tier, not the application. For example, for some use cases, even the low scalability setup may support hundreds of application nodes if the load is small.
Nature of Retrieval
Applications store data for three primary use cases: Archival, Analytics (OLAP), and Online Transaction Processing (OLTP). Here archival use cases store data as a safeguard or for history, OLAP use cases analyze a collection of data to compute some results, and OLTP use cases are driven by end users and there are clients waiting in the other end for a responses.
Data storage solutions support data retrieval in different ways.
- Sequential access - client can access data item by item, and this model does not provide random access. This model is mainly used with archival use cases.
- Key based access - this model stores each data item against a key, and the client later retrieves the data item using the same key. This model provides random access, but does not support searching.
- Query based access - users provide a query, and data storage returns data that matches the query. User queries come in many forms.
- Search by properties (WHERE clause equivalent)
- JOIN based queries (they may include WHERE clause). These queries join data from multiple tables together.
- Semi-structured data supports queries that use underlying data structure. XQuery/XPath for tree like data and RDF Data Query Language for triple data are examples of such query forms.
- Temporal Queries - Also OLAP use cases have very complex queries, which often involve temporal (time-based) conditions (e.g. give me a maximum and minimum temperature in last 6 hours).
OLAP queries often take time to process the data, and therefore, are better implemented using methods like MapReduce.
Users retrieve unstructured data using keys, and unstructured data does not support search directly. There are two main methods to search unstructured data.
- a. Extracting or associating metadata with unstructured data items (e.g. attributes of a file), and using them to search for data items.
- b. Building an index based on the content of the data in the data items. Although, the exact meaning of each data item is not known, queries can provide important information. For example, most document search systems work in this manner by indexing words
Semi structured data cannot support detailed search as structured data does, but it can support search using the structure. For example, we can query XML using tools like XPath and XQuery, and graph based structures using graph searching techniques.
Finally, structured data mainly supports query-based retrieval although key based retrieval is also supported. We use the following four classes of data retrieval for structured data.
- Retrieve by Key
- Retrieve by WHERE clause or its equivalent
- Need Join Queries
- Retrieve by offline queries
We talk about consistency in terms of data operations like adding, editing, or deleting data. By consistency, we generally mean three guarantees. First, the failed operations do not leave side effects meaning the data changes are fully applied or they are not applied at all. Second, when multiple users are writing or reading, the data changes are applied in a way that readers always see the write as one atomic operation. In other words, readers never see an intermediate value. Finally, when there are multiple copies of the same data, the system applies updates to all copies in the same order ensuring the consistency.
We define three classes of consistency based on scope and how fast the changes are applied.
- Eventual Consistency: each data copy eventually comes to the same consistent state given enough time.
- Consistent Operations: Each operation single operation is applied in atomic manner. Difference between this and transactions is that transactions can apply multiple operations in atomic manner.
- Transactions: a series of operations can be applied with ACID guarantees.
Choosing the Right Storage Solution
This section provides Data storage recommendations. Each of the following subsections considers unstructured, semi-structured, and structured data, and other properties (Scalability, Consistency, and Types of Queries needed) are discussed within those sections.
Handling Unstructured Data
(Click on the image to enlarge it)
Listing 2 shows recommendations for unstructured data where the File Systems is the defacto solution. Different file system architectures provide different levels of scalability. For instance, File systems like Network File System (NSF) have limited scale and work well only within a LAN, but systems like Lustre are scalable and work across WANs. On the other hand, when individual files are not very large, Key-value based systems can be used, and they are massively scalable.
Due to the lack of structure, unstructured data only supports searching through metadata or content search through structure. In the table above, different rows show different search methods.
Metadata is data about files, and are often represented as name-value pairs although in some cases, the values can be of a complex type. File systems have metadata, but are often limited to a default set of properties. Metadata catalogs keep track of user-defined metadata about files and facilitate searching for files based on some criteria.
Another alternative is building an index. Index is a mapping from substrings (often words) in files to files. Search on substrings is supported through the index. Google uses this model to a great effect.
Since unstructured data storages do not understand the structure, they do not provide any constancy guarantees for data stored in them, and Parallel access is often handled through locks, where one process has exclusive access to the files while the data is being updated. Google File System has the concept of "append only files", where many processes can append records to files concurrently.
(Click on the image to enlarge it)
Listing 3 above depicts the recommendations for storing and managing the semi-structured data. They come in many forms (e.g. XML, Graphs), and different rows in the table represent each type. Each type often has data storages optimized for the data type and they are often the best storage choice for that type.
However, semi-structured data should only be stored as semi-structured data if one wants to store it as-is, and query the data through its structure. Otherwise, if the structure of data is known, it is better to convert the data to structured data and store it in structured form. For example, receiving XMLs as requests does not mean that one should have XML database as a storage mechanism. Most of the time, the schema for the XML is well known and it makes sense to extract the data and store in structured form.
At the same time, if the application never needs to search within XML through XPath, then XML can be stored in unstructured storage as XML database features are not useful to the application. XML databases help only in cases where there is XML whose schema is not known, and the application needs to search within that data using tools like XPath. The same is true for many semi-structured data types. In addition, it is worth noting that semi-structured data stores not widely used and proved in production much. Therefore, they should not be used unless they can provide some advantages over the structured model.
A good example of using semi-structured storage is storing a social graph. Since the social graph does not fit nicely with the structured data like relational data model, and since applications need to perform graph-based queries that are often very inefficient with the relational model, it often makes sense to keep social graphs in a graph database.
(Click on the image to enlarge it)
Table 4 summarizes the recommendations for structured data. It presents three dimensions: scale, consistency, and search support in a 2D form. The rows show different types of queries. We categorize columns first by scale, and then further categorize by consistency. Each cell provides recommendations for a specific scale, consistency model, and type of queries. The acronyms used in the table are as follow:
KV = Key-Value Systems, CF = Column Families, Doc = Document Based Systems
The recommendations in the table are based on the following general guidelines:
- Only databases have out-of-the-box support for transactions. It may be an option to implement transactions using transaction manager (e.g. Atomikos , Zookeeper, but that is likely to be slower than DB transactions.
- NoSQL solutions do not support JOIN operations unless it is offline processing where Map-Reduce can be used.
- If the use case doesn't need search functionality, but just needs the key based retrieval, the key/value based model is the best choice. Since they are often implemented with Distributed Hash Table (DHT) they are highly scalable and fast.
Let's consider each case in detail.
For the small-scale category, relational databases (DB) are the defacto solution. Since databases are widely used, accepted, and battle-tested, you should use them unless you need more scalability. However, in cases where transactions and joins are not required, one can use Key-Value storage, Column Families, or Document databases. One potential reason is that users may have a plan to scale up later. However, unless there are any perceivable benefits, one should stay with databases. Typical deployment in this case includes a database cluster that replicates data across 2-3 nodes. This provides a lot of benefits if the application has a read-dominated workload. However, the benefits of clustering are not as significant with more write-dominated use cases.
When search is not needed, Key-Value based systems are the best choice. Column Families are also useful, but the former is more scalable. The only exception for this is when transactions are needed where, either databases are needed or a custom transaction implementation should be used.
Offline data processing cases (such as data warehouses, Business Intelligence (BI), indexing) are not affected by most of the parameters with transactions being the only exception. There are custom multi-dimensional schemas (e.g. star schema) for these cases. Another alternative is to use NoSQL solutions like document-based systems and Column Families. In either case, joins can be supported through Map-Reduce queries.
JOIN queries work reliably only with databases, and do not work with large-scale systems. Parallel databases have explored the topic, and have developed methods to join two data sets that are on different machines. However, they have limited scalability, and do not work with all types of queries and data. Hence, this approach might work with medium-scale systems when amount of information that needs to be transferred while carrying out a join operation is small.
NoSQL storages like Column Families and Document databases do not support searches that have WHERE clauses inherently and support for those are implemented on top of a scatter-and-gather style where queries are scattered to all nodes and results are collected. This approach can handle medium-scale systems, but the same approach is unlikely to work with large-scale systems.
This section discuses some use cases, potential data solutions for those cases, and the rationale behind the choice of data storage solution.
- Document/ Video/ Image Repositories: Since the system serves the data as is and does not understand the data, these are typically handled through either file systems or Metadata catalogs. Most use cases support searching documents through text indexing and support searching Video and Images by associating metadata with each video and storing them in a metadata catalog. For example, most web sites are served from a file system, and most new agencies keep a repository of images and videos annotated by their metadata for quick retrieval.
- Social Networks and Relationships: Social graphs (e.g. Facebook), dependency graphs, and relationship graphs have graph structure as their basis. Furthermore, we can cast most queries on that type of data as graph searches. Therefore, semi-structured graph data storage is the best choice for such use cases.
- Analytical Processing: Analytical processing needs very fast write throughput to collect data and then walk through the data and process them to generate results. These results are either directly presented to the users or accessed by them at a later time. Often, Column family storages are used for these use cases as they provide high write-throughput. Since these use cases need data processing queries in an offline mode, lack of support for JOINs in the data storage implementation doesn't cause problems. Often, complex processing can be done using Map-Reduce methods.
- Web-based Retail Systems: Retail systems generate many read queries compared to few write queries that actually buy things. One potential design can keep the data about transactions in a RDBMS while keeping the data about items in a Column space. Column space serves the data for most listings while most of data about buying an item will be stored in the RDBMS.
This article provides data storage recommendations based on four decision parameters about a given application: type of data, scalability, consistency, and supported queries. We also describe the rationale for these choices while providing rules of thumb useful to the architects. It is worth noting that some use cases need more than one type of data storage as they inherently have several types of data. It makes sense to use two or more data storage types for those scenarios. The Analytical Processing use case is an example of such a scenario. However, such hybrid solutions cannot support transactions across multiple data stores. If needed, transactions can be added using an external transaction manager.
It is likely that there are other decision parameters and aspects of storage decisions that are not covered in this article. We would appreciate your feedback on any such omission or improvement.
About the Author
Srinath Perera works as a Senior Software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation, PMC and the Apache Web Service project. Srinath is also a committer of Apache open source projects Axis, Axis2, and Geronimo.
His primary research interest is scale in distributed systems. He has been working on Web Services, SOA, Grid, Cloud, System management, E-Science, and he has publish many peer reviewed articles as well as technical articles.
Srinath received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering from University of Moratuwa, Sri Lanka.
are relational databases really as unscalable as you say?
One sentiment that is reflected throughout your article is the feeling that relational databases do not scale beyond 2 to 5 nodes. Parallel databases with hundreds and even thousands of nodes are very much in operation and nothing in the relational database inherently prevents it from going to much larger configurations. (In 2002, the first Netezza system I worked on had 108 nodes, today systems with over 800 nodes are commonplace.)
Several applications have been developed with transparent sharding and tens of nodes running MySQL. Your characterization of table 4 therefore, where you entirely eschew traditional databases in scales over 3 nodes strikes me as excessively harsh. Parallel databases support distributed joins irrespective of the location of the tuples being joined.
An important aspect that I believe one should consider when choosing the database management system is the kind of "distribution" among the nodes. If your nodes are all going to be in one location, the considerations are very different from geographically distributed systems. I agree that if you are building a geographically distributed system, the implications of CAP theorem are significant and there is a reason to consider relaxing consistency the way Google and Amazon
did. However, for the vast majority of systems that require scalability, and where the "horizontal scaling" is accomplished with nodes that are essentially co-located, there is little practical reason to compromise consistency. For a detailed explanation take a look at bit.ly/rUF3gj.
Joins are an important aspect of a database management system because they eliminate anomalies. Without normalization and joins, data in a database can potentially become meaningless and in considering a NoSQL solution, one should wonder whether the kinds of things that they will do in the application could expose them to anomalies.
Re: are relational databases really as unscalable as you say?
I agree Parallel DBs scale much more than 2-3 nodes. Actually, I have them included in Scalable category as the only solution if you need transactions + scale to like 10+ nodes. Similar arguments can be made about Oracle. However, I felt scaling of parallel DBs really depend on many parameters. For example, if workload is mostly read, with very fast network, with join queries only have to move reasonable amount of data, then system will scale more. But if some of the parameters are not favorable, think start to get complicated.
Another side is that, in a Shareded/ Parallel DB setup, it needed lot of work/expertise to set up thing, optimize, make sure it can hold up etc. Often such a setup will need a reasonably big team monitoring and fine tuning. 2-3 node limit what I felt was possible to setup without too much trouble and keep running without too much trouble.
Re: are relational databases really as unscalable as you say?
Would love to chat with you more about this if you'd like.
I think it's a great article, and provide a good cover of the NoSQL market. However - you left out newSQL... Those solutions give you scalability (at the very least in the scalable level you describe, but some for the Highly Scalable level too) with SQL access. I think it's important to note those solutions since data is never independent - most of the time you need to access it from outside the application (backup, reporting), and the eco-system around SQL solutions is much more mature.
For full disclosure - I work for ScaleBase which does transparent sharding - so your relational database can run in this Highly Scalable environment that you mention. But the newSQL is very diverse and has some very good players there.
Great article. I want to point out this highly relevant discussion on Quora which discusses some great points regarding scalability of relational databases.
Yes, it takes someone that understands data to setup and maintain such systems, but in my opinion that is not a bad thing. I cringe whenever I hear people using this as a reason not to use a relational database. I don't care what the data store is - if you trust your architecture and data to someone that does not know the ins and outs of data management, sooner or later you will get burned.
MySQL at Facebook and Twitter
Re: MySQL at Facebook and Twitter
Also,if most queries and read-only, or using a shared nothing architectures, one can run lot of MySQL servers can scale up. But, then system scale due to the architecture, not due to the virtue of the storage system. Here, I am considering a work load with read/write mix.
Most systems are small
A few years ago, most systems were small and relational databases could handle those systems without any trouble. Therefore, the storage choices for architects and programmers were simple.
Why are we pretending that most systems are huge? No they're not. Most systems are still small and can be serviced by a classic RDBMS. In fact it's still statistically a safer choice to go with an RDBMS by default. I work in telecoms at the backend and we handle terabytes of data every day with Oracle.
This article reinforces the you will have billions of users scalability myth.
Every single architect has to understand what are the real maximum number of users of a system.
For enterprise systems the biggest that can be is Walmart at about 2million. Unless you are the next gen social networking site, you are never going to have anything comparable to even Walmart.
This article does bring value in the form of a matrix, though seems quite uninformed in places about practical state of the RDBMS'es.
Re: Most systems are small
However, at the same time, I believe the era of RDBMS is the solution for everything is over, and there are many other choices. And sometime so called NoSQL solutions are natural and cheaper choice for some cases. My goal of the article was to explore the choices, and nothing more.