InfoQ

News

Microsoft Claims to Hold the ETL Record at 1 TB in 30 Minutes

Posted by Jonathan Allen on Feb 28, 2008 12:53 AM

Community
.NET
Topics
Data Warehousing ,
Data Access ,
Performance & Scalability
Tags
SQL Server Integration Services ,
SQL Server 2008

Microsoft and Unisys are claiming that they hold the record for loading information into a relational database. The unofficial benchmark was 1 TB of TPC-H data moved in under 30 minutes using an Extract, Transform, and Load (ETL) tool. The previous record for that volume was 45 minutes and was held by Informatica.

The specific claim is,

More than one terabyte of data was parsed from flat files, transferred over the network and loaded into the destination database in less than 30 minutes, a world record beating all previously published results using an ETL tool. That is a rate in excess of 2 TB per hour (650+ MB/second). To be precise, 1.18TB of flat file data was loaded in 1794 seconds. This is equivalent to 1.00TB in 25 minutes 20 seconds or 2.36TB per hour.

The ETL benchmark uses TPC-H data but is not an official benchmark of the Transaction Processing Performance Council. But this has not stopped companies such as Informatica from bragging about their performance. Microsoft says that ETL benchmarks are important because they represent common real world scenarios.

It is rare in businesses today that data is always available on the destination system, and does not need to be standardized or corrected for errors before loading. These rare cases are the times that bulk loading data makes sense. Data integration can involve complex transformation rules, error checking and data standardization techniques. ETL tools like SSIS can perform these functions such as moving data between systems, reformatting data, integrity checking, key lookups, tracking lineage, and more. SSIS has proven itself to be a versatile ETL tool, and now it is shown to be the fastest one as well.

The hardware to perform this feat was certainly non-standard and out of the reach of most companies.

The database server ran on a Unisys ES7000/one Enterprise Server , with 32 socket dual core Intel® XeonTM 3.4 Ghz (7140M) processors , 256 GB RAM and 8 dual port 4Gbit HBA’s . The SQL Server data was stored on an EMC Clariion CX3-80 SAN with 165 (146 GB/15 krpm) spindles. The database server ran a pre-release build of SQL Server 2008 Enterprise Edition (V10.0.1300.4, built just before the “February 2008 CTP”) on the Windows Server 2008 x64 Datacenter Edition operating system.

Four servers acted as data sources, modeling the fact that data comes from a variety of systems in a modern enterprise. Each source server ran SSIS packages that sent data across the network to the database server. The source servers ran SSIS from SQL Server build V10.0.1300.4, on the Windows Server 2008 operating system. Source data came from flat files, as it was generated by DBGEN.

For the source servers, 4 Unisys ES3220L servers with Windows2008 x64 Enterprise Edition were used. Each server is equipped with 2 x 2.0GHz quad core Intel® processors, 4GB RAM, a dual port 4Gbit Emulex HBA and Intel PRO1000/PT network card. The source data was read from 2 x EMC Clariion CX600 SAN’s with 45 spindles each.

The white paper on the benchmark has not been released yet.

No comments

Reply

Exclusive Content

Ruby.rewrite(Ruby)

In this RubyFringe talk, Reginald Braithwaite writes Ruby code to read, write, and rewrite Ruby. Demos include extending Ruby with conditional expressions, call-by-name and more.

Book Except and Interview : Aptana RadRails, An IDE for Rails Development

Aptana RadRails: An IDE for Rails Development by Javier Ramírez discusses the latest Aptana RadRails IDE, a development environment for creating Ruby on Rails applications.

Fast Bytecodes for Funny Languages

Cliff Click discusses how to optimize generated bytecode for running on the JVM. Click analyzes and reports on several JVM languages and shows several places where they could increase performance.

Scott Ambler On Agile’s Present and Future

Scott Ambler, Practice Lead for Agile Development at IBM, speaks on the current status of the Agile community and practices having a look at the perspective of the Agile’s future.

Manager's Introduction to Test-Driven Development

Dave Nicolette and Karl Scotland try to introduce non-technical managers to one of the most popular Agile development techniques: Test-Driven Development (TDD).

Structured Event Streaming with Smooks

Smooks is best known for its transformation capabilities, but in this article Tom Fennelly describes how you can also use it for structured event streaming.

How to Work With Business Leaders to Manage Architectural Change

Successful architectures evolve over time to meet changing business requirements. Luke Hohmann presents how to collaborate with key members of your business to manage architectural changes.

Colors and the UI

In this article, Dr. Tobias Komischke explains how colors used in a GUI can influence our interaction with a computer and offers advice on using the appropriate colors for the interface.