InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

New Compression Features in SQL Server 2008

Posted by Jonathan Allen on Nov 15, 2007

Sections
Operations & Infrastructure,
Enterprise Architecture,
Development,
Architecture & Design
Topics
.NET ,
Database Design ,
Data Warehousing
Tags
SQL Server 2008

The SQL Server compression story actually starts with SQL Server 2005. The Enterprise and Developer editions added a new storage format called vardecimal. This table-level option affects decimal and numeric fields. Values that require a lower precision than the field allows, for example storing 1.5 in a decimal(18,9), the storage needs are correspondingly reduced. In effect, it is a numeric version of varchar.

SQL Server 2008 goes far beyond this comparably simple trick. Chad Boyd writes,

However, Sql Server 2008's data compression is vastly different from this (though it still supports/includes the vardecimal format as well) - so different in fact, that if you enable data compression on a given table/index, the underlying row/page format is different - yes, that's right, you heard correctly - if you use compression (ROW or PAGE), the row/page format for Sql 2008 will be DIFFERENT from the existing row/page format (only for the table/index(es) you are using compression for). So, in Sql 2008, there are 2, yes 2, row/page formats for data. You may now be wondering "well, if the row/page format changed, how in the world did you have enough time to re-engineer every component in Sql Server that is aware of the format in such a small amount of time? The answer is that we didn't - the Storage Engine is the ONLY component in Sql 2008 that is aware of the new row/page format.

Row level compression drastically reduces the meta-data needed for variable length columns. Instead of storing the length of each field in 2 bytes, it now takes a mere 3 bits. Fields themselves are also smaller. Storing a 1 in an int field now only takes a single byte, though of course larger values may use up to 4 bytes.

Page level compression allows common data to be shared across rows. The first two techniques that Chad discusses are column prefix and page dictionary.

Assume you have a column of data on a single page of rows that contain values like 'Chad', 'Chadwick', 'Chadly', 'Chad', 'Chadster', 'Chadwick', and 'Chadly' (values repeated purposely) - as you can tell, there's quite a bit of redundant data 'prefixing' each of the rows in this column on this page, yes? So, what you might end up with in a scenario like this would be a column prefix value of 'Chad' stored in the CI structure, and each column ending up with pointers to that prefix value - resulting in values like this on-disk: '', '1wick', '1ly', '', '1ster', '1wick', and '1ly'

So, using the same column value samples as above with the Chad's, after the "column prefix" values are calculated and stored as mentioned above, you'd potentially end up with a page dictionary that contained the values '1ly', and '1wick', then the in-line row-values would ultimately look something like '', '2', '3', '', '1ster', '3', and '2'. In this case, we went from something in the lines of our original ~25 bytes of storage to around ~17 bytes of storage - another ~30+% savings.

Each page is compressed separately, with the prefixes and dictionaries stored in the page itself. Compression does not occur until the page is nearly full, as pages are allocated atomically and compressing a half a page to a quarter page wouldn't gain anything.

There are performance trade-offs for using row and page compression. CPU utilization will go up, but I/O utilization and memory utilization will go down.

Backup Compression is another feature for 2008. This is done using normal file system-style compression techniques. There is no tuning options, backup compression is either on or off for a given database.

All of the compression options are most likely going to be an enterprise-only option, though non-enterprise servers can restore a compressed backup.

50 New features of SQL Server 2008 by Dutt Dutt Posted
  1. Back to top

    50 New features of SQL Server 2008

    by Dutt Dutt

Educational Content

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.

Beauty Is in the Eye of the Beholder

Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.

Architecting Visa for Massive Scale and Continuous Innovation

John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.

Max Protect: Scalability and Caching at ESPN.com

Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.

The Seven Deadly Sins of Enterprise Agile Adoption

Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.

Questions for an Enterprise Architect

Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?

Wrap Your SQL Head Around Riak MapReduce

Sean Cribbs explains what Map-Reduce and Riak are, why and how to use Map-Reduce with Riak, and how to convert SQL queries into their Map-Reduce equivalents.