BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Big Data Solutions with MS SQL ColumnStore Index

Big Data Solutions with MS SQL ColumnStore Index

Bookmarks

The latest solutions in big data processing demand working with increasingly large amounts of data in ever shorter periods of time. In 2012, MS SQL introduced the ColumnStore (CS) Index. It was their first step away from classic row store organization to a column orientated storage and the promise of faster performance with minimal extra work.

The primary purpose of the MS SQL CS Index was to enable the download of as much data as possible to memory and work with this memory when processing data, as opposed to reading it from the disc. Two advantages of this innovation were higher speed and lower HDD IOPS costs. But the product was not perfect. Even though the problem of read-only mode in the 2012 version was fixed with the 2014 clustered CS Index that lets you modify data in the table, it turned out with the clustered CS, it was impossible to have simple indices with calculable fields, foreign keys and triggers. When working with data, it is important to know which method of indexing is the most effective in which scenarios.

A Few Historical Context Facts

The idea behind CS codes is not an innovation of Microsoft. Column-based storage appeared together with the first classical RowStore database management systems (RDBMS) in the 1970s. One of the most well-known column-based relational database software systems was Sybase IQ, founded in the 1990s and currently owned by SAP. In later years, more products appeared, changing the landscape of the business intelligence (BI) and big data markets. Among the most famous of them were Vertica, ParAccel, Kognito, Infobright, and SAND. CS codes first appeared on the MS SQL Server in 2012. The basic idea was to store the data grouped by columns rather than by rows.

A classic RDBMS looks like the table below; for each page of data there is more than one column, and possibly more rows:

And here is a data example:

ID

NAME

STATE

CITY

1

John

NY

New York

2

Jim

CA

City 1

3

James

NY

City 2

4

Lee

NY

City 3

So far, so good.

Excess Data Reading and Reduced Productivity

Problems begin when there is a table with many columns, and to achieve a designated result, only one to three columns of data are needed. In this case, MS SQL will read the entire page with all the columns and "give back" only the necessary data. This leads to excess data reading, and therefore, reduces productivity.

To see how MS SQL helps to solve the issue, let’s consider the storage options MS offers:

ID

1

2

3

4

NAME

John

Jim

James

Lee

STATE

NY

CA

NY

NY

CITY

New York

City 1

City 2

City 3

In the above tables, data is grouped by columns; it has a high degree of uniformity, which increases the data compression ratio. This, in turn, allows for a greater amount of data to be downloaded straight to memory as opposed to reading it from the disc. If only one column of data is needed, the MSSQL server will read only the pages where the data highlighted in red is located. As I mentioned earlier, this demonstrates two advantages: higher speed and lower HDD IOPS costs.

There is a definite correlation between the degree of data uniformity and the data compression ratio. Column-based data organization often presupposes uniform data with only a few variables (in our example, just two: NY and CA), which means more data can be compressed and downloaded to memory. The more varied the data is, the slower the CPU and the lower the compression ratio, but the CS method still remains effective and should be used since only the necessary data is processed, instead of going through all of the pages containing data. Therefore, when working with data, it is important to remember the type of data selected influences the speed of processing.

It is also important to keep in mind the limitations of using SQL Server CS Indexes. There is a big difference between MSSQL 2012 and 2014. MSSQL 2014 supports both clustered and nonclustered CS Indexes; the 2012 version supports only nonclustered indexes. A few limitations and restrictions include:

  • The following data type cannot be included in the CS Index:
    • binary(n), varbinary(n) (can be included in 2014 or above except varbinary(max))
    • image, text, ntext, varchar(max), nvarchar(max);
    • SQL variant
    • xml
  • You must drop and create the index instead of using the ALTER INDEX command
  • A CS Index cannot be created in view or indexed view
  • A CS Index cannot be combined with the following features:
  • A CS Index cannot have more than 1,024 columns
  • A table cannot have unique constraints, primary key constraints, or foreign key constraints

The Hidden Challenges of CS Index

Even though all of this sounds very promising, in 2012, the server had serious limitations with the use of the CS Index.

One of the greatest inconveniences of the system was the inability to change, add, or delete data from the table that had a CS Index. In fact, the transition table went into read-only mode. Users were forced to act accordingly: remove the index – load changes – create an index. For large amounts of data, the whole process would take a long time and completely counteract the positive effect of using the CS Index.

In the 2014 version of MSSQL MS, a clustered option of the CS Index was announced that permits data modifications within the table. However, it turned out that, with the cluster CS, it was impossible to work with calculated fields, foreign keys and triggers. Therefore, when using the CS Index, the above mentioned precautions should be kept in mind.

The “Wow” Factor of CS Index

To experience the CS Index’s “wow” factor, let us look at the following example; sales data is stored in a table. There are more than 50 million lines in that table. Let's try to calculate total sales for each customer.

Executing the query:

- 
-CLEAR SQL SERVER QUERY CACHE
DBCC DROPCLEANBUFFERS 
DBCC FREEPROCCACHE 

SET STATISTICS IO ON 
SET STATISTICS TIME ON 

SELECT CUSTOMER_ID, SUM(AMOUNT) 
FROM ORDER_DETAILS 
GROUP BY CUSTOMER_ID

We get the following result:

Table 'ORDER_DETAILS'. Scan count 5, logical reads 132615, physical reads 0, read-ahead reads 132622, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 12699 ms, elapsed time = 11064 ms.

Execution Plan:

(Click on the image to enlarge it)

Details about clustered scans:

Now let’s build a clustered CS Index and execute the query again (the cache must be cleared prior to this query execution). It should be noted, with such a volume of data building, an index may take more than a minute. We must delete a clustered index because we cannot have two clustered indexes, and that’s a limitation of the CS Index, as mentioned earlier.

CREATE CLUSTERED COLUMNSTORE INDEX columnstore_idx ON ORDER_DETAILS 

Now let’s execute the previous query one more time:

--CLEAR SQL SERVER QUERY CACHE 

DBCC DROPCLEANBUFFERS 
DBCC FREEPROCCACHE 

SET STATISTICS IO ON 
SET STATISTICS TIME ON 

SELECT CUSTOMER_ID, SUM(AMOUNT) 
FROM ORDER_DETAILS 
GROUP BY CUSTOMER_ID 

We get the following result:

Table 'ORDER_DETAILS'. Scan count 4, logical reads 35262, physical reads 23, read-ahead reads 48195, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 1248 ms, elapsed time = 1634 ms.

Execution Plan:

Details about the ColumnStore index scan:

And there you have it – the “wow” factor we’ve been waiting for.

As you can see, the use of the CS Index decreased the number of reads by 4 times and decreased the execution time by 10.

But what will happen to the data search requests?

Let's try to run the following query on a table with a CS Index:

--CLEAR SQL SERVER QUERY CACHE 
DBCC DROPCLEANBUFFERS 
DBCC FREEPROCCACHE 

SET STATISTICS IO ON 
SET STATISTICS TIME ON 

declare @p1 float 
declare @p2 float 

set @p1 = 100.0 
set @p2 = 200.0 

SELECT CUSTOMER_ID, AMOUNT FROM ORDER_DETAILS 
WHERE CUSTOMER_ID = 651 AND AMOUNT BETWEEN @p1 AND @p2 

(1571 row(s) affected)

Table 'ORDER_DETAILS'. Scan count 4, logical reads 36031, physical reads 21, read-ahead reads 52794, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 79 ms, elapsed time = 1125 ms.

It looks okay to me.

Now let’s delete the CS Index and create the index we need:

CREATE INDEX idx1 ON ORDER_DETAILS(CUSTOMER_ID, AMOUNT)

Let’s make the same query one more time:

(1571 row(s) affected)

Table 'ORDER_DETAILS'. Scan count 1, logical reads 9, physical reads 1, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 254 ms.

Now the CS Index is clearly in the losing position. Certainly, there is no out-of-this world “wow” effect, and it’s safe to say in situations with a larger number of records and an increased server load, the CS Index will be less and less effective. All of this happens because it has to read a large amount of information and cannot apply the "search" indexes. The good news is Microsoft has announced support for both CS Indexes and B-Tree indexes in MSSQL 2016.

But what do we do if we have a need for a fast search with a simultaneous ability to make fast calculations and obtain statistical data? For the MSSQL 2012 version, we only have one option: creating a copy of the table at regular intervals, and adding the necessary clustered index. Due to the fact a clustered index was added in the 2014 MSSQL version, we now have the opportunity to update the data – adding and modifying it in the master table as well as in the copy. But if you ask me, the best way to do it is to have patience and wait for the 2016 version!

Here are the main changes we will see with SQL Server 2016:

  • Previously nonclustered CS Indexes were read-only; in 2016, rowstore tables will be able to have one updateable nonclustered CS Index.
  • Previously, the CS Index did not support nonclustered indexes. The nonclustered CS Index definition supports using a filtered condition. A clustered CS Index can have one or more nonclustered rowstore indexes.
  • It will provide support for primary keys and foreign keys by using a B-Tree index to enforce these constraints on a clustered CS Index.

To conclude, the CS Index is not a silver bullet, but can show phenomenal results. It works perfectly and boosts productivity when online analytical processing (OLAP) cubes are built with the main purpose of data aggregation. However, regular index use and regular table optimization methods also have their place in big data processing and produce great results – if you know how and when to use them.

About the Author

Aleksandr Shavlyuga graduated from Belarusian State University in 1999, and began working as a Delphi developer, software architect, database architect and administrator in different IT companies. He joined Itransition in 2007 as a .NET developer and was promoted to senior developer in 2008. Aleksandr also holds Brainbench ANSI SQL and Brainbench MS SQL Programmer certifications. Today, he specializes in building scalable web applications and database design. His technical interests are in .NET, as well as Oracle and MS SQL databases.

Rate this Article

Adoption
Style

BT