Google Scalability Session Report

| by Stefan Tilkov Follow 5 Followers on Jun 25, 2007. Estimated reading time: 1 minute |

In a blog post, Microsoft’s Dare Obasanjo shared his notes on a session given by Jeff Dean from Google at the Google Seattle Conference on Scalability, “MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets”. According to Dare, the talk covered the three main elements of Google’s massively scalable architecture: GFS (the Google File System), MapReduce, an infrastructure capable of processing large datasets in parallel, and BigTable, Google’s distributed store for structured data.

The report contains some fascinating details about Google’s infrastructure. About GFS:

There are currently over 200 GFS clusters at Google, some of which have over 5000 machines. They now have pools of tens of thousands of machines retrieving data from GFS clusters that run as large as 5 petabytes of storage with read/write throughput of over 40 gigabytes/second across the cluster.

On MapReduce:

A developer only has to write their specific map and reduce operations for their data sets which could run as low as 25 - 50 lines of code while the MapReduce infrastructure deals with parallelizing the task and distributing it across different machines, handling machine failures and error conditions in the data, optimizations such as moving computation close to the data to reduce I/O bandwidth consumed, providing system monitoring and making the service scalable across hundreds to thousands of machines.

Concerning BigTable:

BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Instead it is more like a multi-level map data structure. It is a large scale, fault tolerant, self managing system with terabytes of memory and petabytes of storage space which can handle millions of reads/writes per second. BigTable is now used by over sixty Google products and projects as the platform for storing and retrieving structured data.

For those who want to try these ideas out on their own, the Apache Lucene Hadoop subproject, which contains an implementation of MapReduce and a HDFS, a GFS-like distributed file system, might be a good start.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread


Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you