The State of NoSQL
After at least four years of tough criticism, it's time to come to an intermediate conclusion about the state of NoSQL. So many things have happened around NoSQL that it is hard to get an overview and value what goals have been achieved and where NoSQL failed to deliver.
In many fields NoSQL has been more than successful in the industry and academics too. Universities are starting to understand that NoSQL must to be increasingly adopted by the curriculum. It is simply not enough to teach database normalization up and down. This, of course, does not mean that a profound relational foundation is wrong. To the contrary, NoSQL is certainly a perfect and important addition.
The NoSQL Space has exploded in just 4-5 years to about 50 to 150 new databases. nosql-database.org lists about 150 such databases, including some quite old but still strong dinosaurs like Object Databases. And, of course, some interesting mergers have happened, such as the CouchDB and Membase deal leading to CouchBase. But we will discuss each major system later in this article.
Many people have been assuming a huge consolidation in the NoSQL space. However this has not happened. The NoSQL space simply exploded and is still exploding. As with all areas in computer science - like e.g. programming languages - there are more and more gaps opening up for a huge amount of databases. And this is all in line with the explosion of the Internet, big-data, sensors and many more technologies in the future, leading to more data and different requirements about their treatments. In the past four years we saw only one significant system leaving the stage: the German graph database Sones. The vast amount of NoSQL databases continues to live happily either in the open-source space, without any considerable money turnaround, or in the commercial space.
Visibility and Money
Another important point is the visibility and industry adoption. In this space we can see a huge difference between the old industry - protecting the investment - and the new industry: mostly startups. While nearly all of the hot web-startups such as Pinterest or Instagram do have a hybrid (SQL + NoSQL) architecture, the 'old' industry is still struggling with NoSQL adoption. But the observation here is that more and more companies like these are trying to cut out a part of their data streams to be processed and later on analyzed with NoSQL solutions like Hadoop, MongoDB, Cassandra, etc.
And this leads as well to a strong increased demand on developers and architects with NoSQL knowledge. A recent survey showed the following latest developer skills requested by the industry:
- Mobile Apps
- Social Media
So there are 2 NoSQL databases in the top ten for technology requirements here. And even one before iOS. If this isn't a praise, what else?!
But NoSQL adoption is going faster and deeper as one might think at first glance. In a well known whitepaper Oracle stated in the summer of 2011 that NoSQL DBs feel like an ice cream flavor, but you should not get too attached because it may not be around for too long. Only a few months later Oracle showed its Hadoop integration into a Big Data Appliance. And even more, we saw the launch of their own NoSQL database, which was a revised BerkeleyDB. Since then, there has been a race for all vendors to integrate Hadoop. Microsoft, Sybase, IBM, Greenplum, Pervasive, and many more do already have a tight integration. A pattern that can be seen everywhere: can't fight it, embrace it.
But one of the strongest but silent signs of a broad NoSQL adoption is that NoSQL databases are getting a PaaS standard. Thanks to the easy setup and management of many NoSQL databases, DBs like Redis or MongoDB can be seen in dozens of Paa-Services as Cloud-Foundry, OPENSHIFT, dotCloud, Jelastic, etc. As everything moves more and more into the cloud this becomes a huge momentum for NoSQL to put pressure on classic relational databases. Having the choice to select either MySQL/PostGres or MongoDB/Redis, for example, will force them to think twice about their model, requirements and raise other important questions.
An interesting indicator for technologies is also the ThoughtWorks radar which always contains a lot of interesting stuff, even if you do not fully agree with everything contained in it. Let's have a look at their radar from October 2012 in picture 1:
Picture 1: ThoughtWorks Radar, October, 2012 - Platforms
In their platform quadrant they list five databases:
- Neo4j (adopt)
- MongoDB (tial but close to adopt)
- Riak (trial)
- CouchBase (trial)
- Datomic (assess)
If you look at this you see that at least four of these have received a lot of venture capital. If you add up all the venture capital in the entire NoSQL Space you will surely count up to something in between 100M and a billion dollars! Neo4j is one of one of these examples for getting 11m $ in a series B funding. Other companies that received $10-30M in funding were Aerospike, Cloudera, DataStax, MongoDB, CouchBase, etc. But let's have a look at the list again: Neo4j, MongoDB, Riak and CouchBase have been in this space for the last four years and have constantly proven to be among market leaders for specific requirements. Then, DB number 5 –Datomic - is more than astonishing, a complete new database, with a complete new paradigm written by a small team. Must be really hot stuff and we will dig into it a bit later when discussing all DBs briefly.
Many people have asked for NoSQL standards, failing to see that NoSQL covers a really wide range of models and requirements. Hence unified languages for all major areas such as Wide Column, Key/Value, Document and Graph Databases will surely not be available for a long time because it's impossible to cover all areas. Several approaches, such as Spring Data, try to add a unified layer but it's up to the reader to test if this layer is a leap forward in building a polyglot persistence environment or not.
Mostly the graph and the document databases have come up with standards in their own domain. The graph world is more successful with its tinkerpop blueprints, Gremlin, Sparql, and Cypher. In the document space we have UnQL and jaql filling up some niches, although the first lacks real world support by a NoSQL database. But with the force of Hadoop many projects are working on bridging famous ETL languages such as Pig or Hive to other NoSQL databases. So the standards world is highly fragmented, but only due to the fact that NoSQL luckily is a very wide area.
One of the best overviews of the database landscape has been given by Matt Aslett in a report of the 451 Group. He recently updated his picture giving us more insights to the categories he mentioned. As you can see in the following picture, the landscape is highly fragmented and overlapping:
(Click on the image to enlarge it)
Picture 2: The database landscape by Matt Aslett (451 group)
As you can see there are several dimensions in one picture. Relational vs. Non-relational, Analytic vs. Operational, NoSQL vs. NewSQL. The last two categories have the well known sub-categories Key-Value, Document, Graph and Big Tables for NoSQL and Storage-Engines, Clustering-Sharding, New Databases and Cloud Service Solutions. The interesting part of this picture is that it is increasingly difficult to put a database to an exact location. Everyone is now trying fiercely to integrate features from databases found in other spaces. NewSQL Systems implement core NoSQL features. NoSQL Systems try more and more to implement 'classic' features as SQL support or ACID or at least often configurable persistence.
It all started with the integration of Hadoop that tons of relational databases now offer. But there are many other examples: e.g. MarkLogic is now starting to ride the JSON wave and thus also hard to position. Furthermore more multi-model databases appear, such as ArangoDB, OrientDB or AlechemyDB (which is now a part of the promising Aerospike DB). They allow to start with one database model (e.g. document / JSON model) and add other models (graph or key-value) as new requirements pop up.
Another wonderful sign of a beginning maturity is the book market. After two German books published in 2010 and 2011 we saw the Wiley book by Shashank Tiwari. Structured like a hurricane and full of great deep insights. The race continued with two nice books in 2012. The 'Seven Databases in Seven Weeks' is surely a masterpiece. Freshly written and full of practical 'hands-on' insights: it takes six famous NoSQL databases and adds PostGreSQL to the mix, Making it a top recommendation. On the other side P.J. Sandalage and Martin Fowler take a more holistic approach to cover all the characteristics and help evaluating your path and decisions with NoSQL.
But there is more to come. It is just a matter of time till a Manning book appears on the scene: Dan McCreary and Ann Kelly are writing a book called: "Making Sense of NoSQL" and the first MEAP chapters are already available.
After starting with concepts and patterns, their chapter 3 will surely look attractive:
- Building NoSQL Big Data solutions
- Building NoSQL search solutions
- Building NoSQL high availability solutions
- Using NoSQL to increase agility
This is a new fresh approach and will surely be worth reading.
State of the Leaders
Let's give each NoSQL leader a quick consideration. As one of the clear market leaders, Hadoop is a strange animal. On one hand it has an enormous momentum. As mentioned before, each classic database vendor is in a hurry to announce Hadoop support. Companies such as Cloudera and MapR continue to grow and new Hadoop extensions and successors are announced every week.
Even Hive and Pig continue to get even better acceptance. Nevertheless, there is a fly in the ointment: Companies still complain about an unstructured mess (reading and parsing files could be even faster), MapReduce is far 'too batch' (even Google goes away from it), management is still hard, stability issues, and local training/consultants are still hard to find. But even if you could address some of the issues it's still a hot question, if Hadoop will grow as it is or it will change dramatically.
The second leader, MongoDB, also suffers from flame wars, and it might be the nature of things that leading DBs get the most criticism. Nevertheless, MongoDB goes at a fast pace and criticism mostly is:
a) concerning old versions or
b) due to the lack of knowledge on how to deal with it in a correct way. This recently culminated in absurd complaints that the 32 bit version can only handle 2GB, although MongoDB states this clearly in the download section and recommends the 64 bit version.
Anyway, MongoDBs partnerships and funding rounds push ambitious roadmaps with hot stuff:
- the industry called for some security / LDAP features which are currently being developed
- full text search will be in soon
- V8 for MapReduce is coming
- even a finer level then collection level locking will come
- and a Hash Shard Key is on the way
Especially this last point catches the interest of many architects. MongoDB was often blamed (also by competitors) for not implementing a concise consistent hashing which is not entirely correct because such a key can be easily defined. But in the future there will be a config for a hash shard key. This means the user is up to decide if a hash key for sharding is useful or if he needs some (perhaps even rare) advantages of selecting his own sharding key. Surely this increases the pressure on other vendors and will lead to fruitful discussion when to use a sharding key.
Cassandra is the next in line and quite doing well adding more and nicer features such as better querying. However rumors won't stop telling that running a Cassandra cluster is not piece of cake and requires some hard work. But the most attractive issue here is surely DataStax. The new Company on top of Cassandra - 25 Million round C funding - is mostly addressing analytics and some operational issues. Especially the analytics was a surprise for many because in the early days Cassandra was not known as a powerful query machine. But as this has changed in the latest version the query capabilities may be sufficient enough for some modern analytics.
CouchBase also looks like a brilliant solution in terms of scalability and latency despite the strong winds that Facebook and hence Zynga are now facing. It's surely not a hot query machine but if they could improve querying in the future the portfolio would be quite complete. The merger with the CouchDB founders was definitely a strong step and it's worthwhile to see the great influences of CouchDB in CouchBase. On every database conference it's also funny to hear the discussions, if CouchDB is doing better or worse after Damien, Chris and Jan have left. One can only hear extreme opinions here. But who cares as long as the DB is doing fine. And it looks like it does.
The last NoSQL DB to be mentioned here is of course Riak, which also improved dramatically in functionality and monitoring. It continues to have a good reputation mostly in terms of stability: "rock solid, invisible and good for your sleep". The Riak CS fork also looks interesting in terms of the modularity of this technology.
Beside the market leaders, newcomers are always interesting to evaluate. Let's dig into some of them.
Elastic Search surely is one of the hottest new NoSQL products and just got a 10m $ in series A funding, and that for a good reason. As a scalable search engine on top of Lucene it brings many advantages: a) a company on top providing services and b) leveraging all the achievements that Lucene has conceived in the last years. It will surely infiltrate the industry now more than before, attacking all the big players in the semi-structured information space.
Google also send it's small but fast LevelDB into the field. And it serves as a basis for many usages with specific requirements such as compression integration. Even Riak integrated LevelDB. It remains to be seen when all the new Google internal databases such as Dremel or Spanner will find their way out as open-source projects (like Apache Drill or Cloudera Impala).
Another tectonic shift surely was DynamoDB at the start of 2012. They call it the fastest growing service ever launched at Amazon. It's the ultimate scaling machine. New features are coming slowly but the focus on SSDs and latency is quite amazing.
Multi-model databases are also a field worthwhile to have a look on. OrientDB, its famous representative, is by far not a newcomer but it is improving its capabilities quite fast. Perhaps too fast because some customers might now be happy that OrientDB has reached Version 1.0 and thus hopefully gained a lot more stability. Graph, Document, Key-Value support combined with transactions and SQL are reasons enough to give it second try. Especially the good SQL support makes it interesting for analytic solutions such as Penthao. Another newcomer in this space is ArangoDB. It is moving fast and it doesn't flinch from comparing itself in benchmarks against the established players.
However, again the native JSON and graph support saves a lot of effort if new requirements have to be implemented and the new data has a different model that must be persisted.
By far the biggest surprise this year was Datomic. Written by some rock stars of the Clojure programming language in an incredible short time, it unveils a whole bunch of new paradigms. Furthermore it has made its way into the ThoughtWorks radar with the recommendation to have a look at it. And although it is 'just' a layer on top of established databases it brings a huge amount of advantages, such as:
- a time machine
- a fresh and powerful query approach
- a new schema approach
- caching & scaling features
Currently, DynamoDB, Riak, CouchBase, Infinispan and SQL are supported as the underlying storage engine. It even allows you to mix and query different DBs simultaneously. Many veterans have been surprised that such a radical paradigm shift can be possible. Luckily it is.
To conclude, let us address three points:
Some new articles by Eric Brewer on the CAP theorem should have come several years earlier. In this article he states that "2 of 3" is misleading, explaining the reasons, why the world is more complicated than a simple CP/AP i.e. ACID/BASE choice. Nevertheless, thousands of talks and articles kept on praising the CAP theorem without any critical review for years. Michael Stonebraker was the strongest censor of the NoSQL movement (and the NoSQL space owes him a lot), pointing to these issues some years ago! Unfortunately, not many are listening. But now that Eric Brewer updated his theorem, the time of simple CAP statements is definitely over. Please be at the very front in pointing out the true and diverse CAP implications.
As we all know, the weaknesses of the classical relational databases have lead to the NoSQL field. But it was just a matter of time for the empire to strike back. Under the term "NewSQL" we can see a bunch of new engines (such as database.com, VoltDB, GenieDB, etc. see picture 2), improving classic solutions, sharding and cloud solutions. Thanks to the NoSQL movement.
But as many DBs try to implement every feature, clear frontiers vanish.
The decision for a database is getting more complicated than ever.
You have to know about 50 use cases, 50 DBs and you should answer at least 50 questions. The latter have been gathered by the author in over 2 years of NoSQL consulting and can be found here: Select the Right Database, Choosing between NoSQL and NewSQL.
- It's common wisdom that every technology shift - since client-server and before - is about ten times more costly to switch to. For example, switching from Mainframe to Client-Server, Client-Server to SOA, SOA to WEB, RDBMS to Hybrid Persistence, etc. And as a consequence, many companies hesitate and struggle in adding NoSQL to their portfolio. But it is also known that the early adopters who are trying to get the best out from both worlds and thus integrate NoSQL fast will be better positioned for the future. In this regard, NoSQL solutions will be here to stay and always a gainful area for evaluations.
About the Author
Prof. Dr. Stefan Edlich is a senior lecturer at Beuth HS of Technology Berlin (University of App. Sc.). He wrote more than 10 IT books for publishers such as Apress, OReilly, Spektrum/Elsevier and others. He runs the NoSQL Archive, did NoSQL consulting, organizes NoSQL Conferences, wrote the world’s first two NoSQL books and is addicted to the Clojure programming language.
I am still very much working on it, including being the administrative project lead at the Apache Foundation.
Meeting the Challenges of Unstructured DataBasho
David Beyer, Olaf Carlson-Wee, Richard Minerich Aug 02, 2015