BT

Michael Stonebraker: Major RDBMSes are legacy technology

by Ryan Slobojan on Sep 07, 2007 |

Michael Stonebraker, co-founder of the Ingres and Postgres relational database management systems (RDBMS) and CTO of Vertica Systems, laid the framework for a debate in the database community by declaring that most major databases should be considered legacy technology.

Stonebraker begins his analysis by noting that the databases in question (IBM's DB2, Microsoft's SQL server, and Oracle) are based upon two platforms (System R and Ingres) which were architected more than 25 years ago, and that they are intended as general-purpose tools rather than as industry-specific products. He also points out that the environment for which they were designed is unlike the environment of today, with hardware characteristics and database usage scenarios being very different. In particular, Online Transaction Processing (OLTP) was the only area where databases were used at that time -- now, there are unrelated applications like data warehouses and semi-structured data to consider.

He goes on to argue that the "one size fits all" approach is now no longer the correct concept to use, and that "In every major application area I can think of, it is possible to build a SQL DBMS engine with vertical market-specific internals that outperforms the 'one size fits all' engines by a factor of 50 or so". He also says:

[...] my prediction is that column stores will take over the warehouse market over time, completely displacing row stores. Since many warehouse users are in considerable pain (can't load in the available load window, can't support ad-hoc queries, can't get better performance without a "fork-lift" upgrade), I expect this transition to column stores will occur fairly quickly, as customers search for better ways to improve performance.

In the longer term, I expect a transition of the same sort to occur in other markets where there is great user pain and the possibility of radical performance improvement from a specialized software architecture.

Erik Lai of ComputerWorld provided some background on column-oriented databases:

  • Column databases store data based on a per-column basis, as opposed to on a per-row basis
  • Because similar data is close together, column databases minimize disk read time for many types of queries (e.g. data warehouse queries)
  • Google's BigTable is a column-oriented database which powers many Google applications (e.g. Google Maps and Google Reader)

Lai also points out that row databases have advantages over column databases, such as writing data to disk - writing a row is a single operation, whereas writing to multiple columns requires multiple writes.

There has also been a lot of debate on Slashdot about this, with some disagreeing with the demise of the "One size fits all" concept:

Regarding the obsolescence question, one size fits all will be good enough for most for some time to come. Increasingly people are more than happy with lightweight options that are even less efficient on which they slap persistence layers that reduce performance even more just because it allows them to autogenerate all the code that deals with stuffing boring data in some storage. Not having to deal with that makes it irrelevant how the database works and allows you to focus on how you work with the data rather than worrying about tables, rows and ACID properties. Autogenerating code that interacts with the database allows you to do all sorts of interesting things in the generated code and the layers underneath.

And others agreeing with Stonebraker's argument:

Column stores are great (better than a row store) if you're just reading tons of data, but they're much more costly than a row store if you're writing tons of data.

Therefore, pick your method depending on your needs. Are you storing massive amounts of data? Column stores are probably not for you...Your application will run better on a row store, because writing to a row store is a simple matter of adding one more record to the file, whereas writing to a column store is often a matter of writing a record to many files...Obviously more costly.

On the other hand, are you dealing with a relatively static dataset, where you have far more reads than writes? Then a row store isn't the best bet, and you should try a column store. A query on a row store has to query entire rows, which means you'll often end up hitting fields you don't give a damn about while looking for the specific fields you want to return. With column stores, you can ignore any columns that aren't referenced in your query...Additionally, your data is homogenous in a column store, so you lose overhead attached to having to deal with different datatypes and can choose the best data compression by field rather than by data block.

Why do people insist that one size really does fit all?

 

The debate appears to only be getting started. What is your opinion?

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

RDBMSs by Arnon Rotem-Gal-Oz

It seems to me we start to see two distinct groups of applications. One is applications where whatever database is good enough and the others where something else would work better. In the first group we see smaller applications where you can afford to automate database code generation (e.g. Rails) and you care more about productivity. One the other hand you see sites like Google, Amazon, eBay etc. where the traditional RDBMS approach is limiting (where you need linear scalability, real high availability etc.)
I wrote about this recently in a post called The RDBMS is dead

Arnon

Uh, where has this guy been for the last 20 year? by Jonathan Allen

We already have column-oriented storage for performance, they are called indexes.

Finally by Rickard Öberg

I've been waiting for decent RDF stores for some time, and was pointed to C-Store (the academic version of Vertica) some time ago by the SIMILE guys. It looked very interesting, and the prospect of a commercial offering through Vertica was encouraging.

If Vertica is now at a point where their database can be used in production environments I would love to try it out. I agree with his assertions, and also hope that row-stores will be considered legacy reasonably soon. But, I also think that it will go considerably faster if there's a free OpenSource version of a C-Store available. Currently the main RDF store is Sesame, and more alternatives would be very interesting. Would a Community Edition of Vertica be too much to ask for? :-)

Only the beginning... but certainly an inexorable one :) by Zubin Wadia

I would agree with Arnon here. Clearly, there are more applications today that demand column-oriented stores vs. row-oriented ones, or atleast more applications that would benefit from one.

An overwhelming majority of the business applications though are still not intense enough to warrant a change in approach. This will change over time, but it will be evolutionary rather than revolutionary.

Another contributing factor is awareness - people go through entire careers thinking RDBMSes are the only destination for structured data. When it comes to data management, IT tends to be cautious, which is why this transition is going to take a long time - regardless of the potential benefits.

Cheers,

Zubin.

Column-orientation = vertical partitioning? by Ileana Somesan

Hi,

is the column-orientation-paradigm another name for vertical partitioning?
If so, isn't vertical partitioning a built-in feature in traditional RDBMS like Oracle?

Best regards,
Ileana

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

5 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT