William McKnight on Columnar Databases
Columnar databases offer better data storage capabilities for certain business use cases compared to the traditional relational database management systems (RDBMS). William McKnight spoke at the NoSQL Now 2011 Conference last week about the columnar databases and how they can be effective for certain data storage needs.
He said the data queries using RDBMS solutions (which are based on the row-wise design) send up a lot of data. Data Input/Output (I/O) has become the true bottleneck in the data processing needs today and when you do I/O, it’s better to get more data while you are there. The real way to avoid this problem is to only do the I/O that you really need. Columnar databases provide the ability to pick the columns needed instead of getting the whole row and not using the other columns (overhead) after the data retrieval. They offer a better solution in use cases where the work load needs a small percentage of the overall column bytes.
In columnar databases, the data is stored in columns keeping all columns in the same order. William discussed the data page layout of relational database record and compared it with that of a column database table. There is some overhead involved in the row page design (in RDBMS databases) because the row scan or index scan is used for data queries and it can be an expensive option given all the data involved. He showed an example of a use case where the data query took 500,000 I/Os for a row-based database versus 235 I/Os for a Columnar database.
There are different columnar data storage options like Decomposed Storage Model, Positional Representation, Modified B-Tree/Row Length Encryption, and Bitmap. He also talked about materialization strategies which include Function of 'projection', Early and Late Materialization.
William said that the relational row based data warehouses and data marts will still be there. Beside the data warehouse and Hadoop, you will have column databases to process the data lot faster. He concluded the session by saying the database designers should start with good design principles and then decide if you want to put the data in row based or column based solution.