Google Scalability Session Report

In a blog post, Microsoft’s Dare Obasanjo shared his notes on a session given by Jeff Dean from Google at the Google Seattle Conference on Scalability, “MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets”. According to Dare, the talk covered the three main elements of Google’s massively scalable architecture: GFS (the Google File System), MapReduce, an infrastructure capable of processing large datasets in parallel, and BigTable, Google’s distributed store for structured data.

The report contains some fascinating details about Google’s infrastructure. About GFS:

There are currently over 200 GFS clusters at Google, some of which have over 5000 machines. They now have pools of tens of thousands of machines retrieving data from GFS clusters that run as large as 5 petabytes of storage with read/write throughput of over 40 gigabytes/second across the cluster.

On MapReduce:

A developer only has to write their specific map and reduce operations for their data sets which could run as low as 25 - 50 lines of code while the MapReduce infrastructure deals with parallelizing the task and distributing it across different machines, handling machine failures and error conditions in the data, optimizations such as moving computation close to the data to reduce I/O bandwidth consumed, providing system monitoring and making the service scalable across hundreds to thousands of machines.

Concerning BigTable:

BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Instead it is more like a multi-level map data structure. It is a large scale, fault tolerant, self managing system with terabytes of memory and petabytes of storage space which can handle millions of reads/writes per second. BigTable is now used by over sixty Google products and projects as the platform for storing and retrieving structured data.

For those who want to try these ideas out on their own, the Apache Lucene Hadoop subproject, which contains an implementation of MapReduce and a HDFS, a GFS-like distributed file system, might be a good start.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Enterprise Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter