New Compression Features in SQL Server 2008

The SQL Server compression story actually starts with SQL Server 2005. The Enterprise and Developer editions added a new storage format called vardecimal. This table-level option affects decimal and numeric fields. Values that require a lower precision than the field allows, for example storing 1.5 in a decimal(18,9), the storage needs are correspondingly reduced. In effect, it is a numeric version of varchar.

SQL Server 2008 goes far beyond this comparably simple trick. Chad Boyd writes,

However, Sql Server 2008's data compression is vastly different from this (though it still supports/includes the vardecimal format as well) - so different in fact, that if you enable data compression on a given table/index, the underlying row/page format is different - yes, that's right, you heard correctly - if you use compression (ROW or PAGE), the row/page format for Sql 2008 will be DIFFERENT from the existing row/page format (only for the table/index(es) you are using compression for). So, in Sql 2008, there are 2, yes 2, row/page formats for data. You may now be wondering "well, if the row/page format changed, how in the world did you have enough time to re-engineer every component in Sql Server that is aware of the format in such a small amount of time? The answer is that we didn't - the Storage Engine is the ONLY component in Sql 2008 that is aware of the new row/page format.

Row level compression drastically reduces the meta-data needed for variable length columns. Instead of storing the length of each field in 2 bytes, it now takes a mere 3 bits. Fields themselves are also smaller. Storing a 1 in an int field now only takes a single byte, though of course larger values may use up to 4 bytes.

Page level compression allows common data to be shared across rows. The first two techniques that Chad discusses are column prefix and page dictionary.

Assume you have a column of data on a single page of rows that contain values like 'Chad', 'Chadwick', 'Chadly', 'Chad', 'Chadster', 'Chadwick', and 'Chadly' (values repeated purposely) - as you can tell, there's quite a bit of redundant data 'prefixing' each of the rows in this column on this page, yes? So, what you might end up with in a scenario like this would be a column prefix value of 'Chad' stored in the CI structure, and each column ending up with pointers to that prefix value - resulting in values like this on-disk: '', '1wick', '1ly', '', '1ster', '1wick', and '1ly'

So, using the same column value samples as above with the Chad's, after the "column prefix" values are calculated and stored as mentioned above, you'd potentially end up with a page dictionary that contained the values '1ly', and '1wick', then the in-line row-values would ultimately look something like '', '2', '3', '', '1ster', '3', and '2'. In this case, we went from something in the lines of our original ~25 bytes of storage to around ~17 bytes of storage - another ~30+% savings.

Each page is compressed separately, with the prefixes and dictionaries stored in the page itself. Compression does not occur until the page is nearly full, as pages are allocated atomically and compressing a half a page to a quarter page wouldn't gain anything.

There are performance trade-offs for using row and page compression. CPU utilization will go up, but I/O utilization and memory utilization will go down.

Backup Compression is another feature for 2008. This is done using normal file system-style compression techniques. There is no tuning options, backup compression is either on or off for a given database.

All of the compression options are most likely going to be an enterprise-only option, though non-enterprise servers can restore a compressed backup.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Infrastructure topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter