Cloud Foundry: Design and Architecture
Derek Collison discusses the goals, the design premises and patterns employed in creating the architecture of Cloud Foundry, VMware’s open source PaaS, unveiling internal architectural details.
The content has been bookmarked!
There was an error bookmarking this content! Please retry.
Posted by Jonathan Allen on Mar 07, 2011
Like most major databases, SQL Server will normally store a table as B-tree if they have a clustered index, otherwise it uses Heap. Both methods are essentially row based, where the number of rows per page varies depending on the overall row size. Starting with SQL Server 2011 a third option becomes available. By applying a “Columnstore Index”, SQL Server will store data in terms of columns instead of rows.
When using a table with 1 TB of data and 1.44 billion rows, Microsoft claims that column-orientated queries saw a 16X speed-up in CPU time and a whopping 455X improvement in elapsed time. In real terms this means a query that took 501 seconds originally was reduced to merely 1.1 seconds. This test was performed on a 32-logical processor machine with 256GB of RAM.
This amazing improvement is gained by isolating each column to its own set of pages. When a query is performed only the columns in the result set are loaded from disk. The pages containing the other columns are simply ignored.
This is very much like having a covering index for every conceivable combination of columns. But instead of costing huge amounts of hard drive space, it actually takes less than a traditional table. Since SQL Server compression occurs at the page level, and a column is more likely to have repeating data than a row, tables with columnstore indexes are expected to higher compression levels.
The decision to use a columnstore index cannot be taken lightly. First and foremost, they are not updatable. Once the columnstore index has been created no new inserts, updates, or deletes are allowed against the table. Microsoft expects most shops to use a daily refresh cycle and otherwise treat the data as read-only. During the refresh cycle the index is dropped, data is updated, and then the index is reapplied. As this is certainly an expensive operation, one could use vertical partitioning to limit the churn to a subset of the logical table.
The use of columnstore indexes can also result in performance degradation. If you are working with most or all of the columns, recombining the rows can be quite expensive. This means OLTP-style queries should be avoided in favor of OLAP-style queries. Or in other words, if you find yourself writing “SELECT *” or pulling back one row at a time then columnstore isn’t appropriate for you.
Introducing SQLFire: a memory-optimized, high performance SQL database
Automating Error Reporting for .NET Applications
Visual Studio vNext: ALM features for Agile Planning, Team Collaboration
Want to know how software releases can be stress-free and happen with one click? Try Go free!
Improving Software Delivery Cycles: Pre-requisites and Inhibitors
Go: Agile Release Management Solutions. Go enables predictable, defect-free and timely software releases.
Column-base storage, create only, no insert/update/delete? I'm thinking lite edition of Sybase IQ.
But it is certainly a great value-add for MSSQL anyway.
Microsoft in 2011 will be able to do what Google could do in 1997!
Too bad Google didn't sell their engine to be used for any purpose to the public...
Hadoop, and there's plenty others.
Derek Collison discusses the goals, the design premises and patterns employed in creating the architecture of Cloud Foundry, VMware’s open source PaaS, unveiling internal architectural details.
Andrew Watson talks about the work of the OMG, where CORBA is alive and well (hint: in your car), UML and UML Profiles vs. custom Modeling languages, DDS and other middleware, and much more.
Sohil Shah discusses creating iPhone and Android enterprise mobile applications based on cloud services using the open source platform OpenMobster.
Paul Sanford presents the transformations supported by data throughout its life cycle, and how that can be better done with Splunk, an engine for monitoring and analyzing machine-generated data.
A common “best practice” for unit tests is to only write a one assertion in each test. I intend to question this advice by showing that multiple assertions per test are both necessary and beneficial.
John Rauser presents the architectural and technological evolution of Amazon retail websites starting with 1994 and ending with adopting Amazon Web Services.
Michael Stal discusses system architecture quality, how to avoid architectural erosion, how to deal with refactoring, and design principles for architecture evolution.
Every developer has had to integrate with another system, API or component. Tis article provides strategies to handle the change and for he separating system boundaries.
4 comments
Watch Thread Reply