InfoQ

News

Run Your Own Google Style Computing Cluster with Hadoop and Amazon EC2

Posted by Scott Delap on Nov 10, 2006 09:01 AM

Community
Java
Topics
Grid Computing ,
Clustering & Caching
Tags
Amazon ,
EC2 ,
MapReduce ,
Hadoop
Clustered grid computing software does not simply happen. Efficient architectures must be designed. One of the core technologies used by Google is the MapReduce programming model which allows for the processing and generation of large data sets. By defining a scalable program structure upfront Map Reduce allows algorithms to easily scale across machines:

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

Doug Cutting the creator of Lucene and now an employee of Yahoo has been working on an open source implementation of MapReduce and called Hadoop written in Java which also includes a distributed file system. Hadoop has already been tested on clusters up to 600 nodes.

Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

Amazon recently released their EC2 Elastic Computing cloud which allows developers to acquisition computing power a the rate of $0.10 per hour consumed. Recently work has been done to allow Hadoop to run on EC2. This combination will allow developers to write scalable algorithms and then bring up large numbers of servers for computing power which can then be then shut them down when they are not needed.

typo by anjan bacchu Posted Nov 10, 2006 6:22 PM
  1. Back to top

    typo

    Nov 10, 2006 6:22 PM by anjan bacchu

    "Recently work as been done to " you mean : "Recently work has been done to " ? BR, ~A

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.