Michael Stonebraker: Major RDBMSes are legacy technology
Michael Stonebraker, co-founder of the Ingres and Postgres relational database management systems (RDBMS) and CTO of Vertica Systems, laid the framework for a debate in the database community by declaring that most major databases should be considered legacy technology.
Stonebraker begins his analysis by noting that the databases in question (IBM's DB2, Microsoft's SQL server, and Oracle) are based upon two platforms (System R and Ingres) which were architected more than 25 years ago, and that they are intended as general-purpose tools rather than as industry-specific products. He also points out that the environment for which they were designed is unlike the environment of today, with hardware characteristics and database usage scenarios being very different. In particular, Online Transaction Processing (OLTP) was the only area where databases were used at that time -- now, there are unrelated applications like data warehouses and semi-structured data to consider.
He goes on to argue that the "one size fits all" approach is now no longer the correct concept to use, and that "In every major application area I can think of, it is possible to build a SQL DBMS engine with vertical market-specific internals that outperforms the 'one size fits all' engines by a factor of 50 or so". He also says:
[...] my prediction is that column stores will take over the warehouse market over time, completely displacing row stores. Since many warehouse users are in considerable pain (can't load in the available load window, can't support ad-hoc queries, can't get better performance without a "fork-lift" upgrade), I expect this transition to column stores will occur fairly quickly, as customers search for better ways to improve performance.
In the longer term, I expect a transition of the same sort to occur in other markets where there is great user pain and the possibility of radical performance improvement from a specialized software architecture.
- Column databases store data based on a per-column basis, as opposed to on a per-row basis
- Because similar data is close together, column databases minimize disk read time for many types of queries (e.g. data warehouse queries)
- Google's BigTable is a column-oriented database which powers many Google applications (e.g. Google Maps and Google Reader)
Lai also points out that row databases have advantages over column databases, such as writing data to disk - writing a row is a single operation, whereas writing to multiple columns requires multiple writes.
Regarding the obsolescence question, one size fits all will be good enough for most for some time to come. Increasingly people are more than happy with lightweight options that are even less efficient on which they slap persistence layers that reduce performance even more just because it allows them to autogenerate all the code that deals with stuffing boring data in some storage. Not having to deal with that makes it irrelevant how the database works and allows you to focus on how you work with the data rather than worrying about tables, rows and ACID properties. Autogenerating code that interacts with the database allows you to do all sorts of interesting things in the generated code and the layers underneath.
And others agreeing with Stonebraker's argument:
Column stores are great (better than a row store) if you're just reading tons of data, but they're much more costly than a row store if you're writing tons of data.
Therefore, pick your method depending on your needs. Are you storing massive amounts of data? Column stores are probably not for you...Your application will run better on a row store, because writing to a row store is a simple matter of adding one more record to the file, whereas writing to a column store is often a matter of writing a record to many files...Obviously more costly.
On the other hand, are you dealing with a relatively static dataset, where you have far more reads than writes? Then a row store isn't the best bet, and you should try a column store. A query on a row store has to query entire rows, which means you'll often end up hitting fields you don't give a damn about while looking for the specific fields you want to return. With column stores, you can ignore any columns that aren't referenced in your query...Additionally, your data is homogenous in a column store, so you lose overhead attached to having to deal with different datatypes and can choose the best data compression by field rather than by data block.
Why do people insist that one size really does fit all?
The debate appears to only be getting started. What is your opinion?
I wrote about this recently in a post called The RDBMS is dead
Uh, where has this guy been for the last 20 year?
If Vertica is now at a point where their database can be used in production environments I would love to try it out. I agree with his assertions, and also hope that row-stores will be considered legacy reasonably soon. But, I also think that it will go considerably faster if there's a free OpenSource version of a C-Store available. Currently the main RDF store is Sesame, and more alternatives would be very interesting. Would a Community Edition of Vertica be too much to ask for? :-)
Only the beginning... but certainly an inexorable one :)
An overwhelming majority of the business applications though are still not intense enough to warrant a change in approach. This will change over time, but it will be evolutionary rather than revolutionary.
Another contributing factor is awareness - people go through entire careers thinking RDBMSes are the only destination for structured data. When it comes to data management, IT tends to be cautious, which is why this transition is going to take a long time - regardless of the potential benefits.