InfoQ

News

Open Source Google-Like Infrastructure Project Hadoop Gains Momentum

Posted by Scott Delap on Aug 10, 2007 12:09 PM

Community
Java
Topics
Clustering & Caching,
Grid Computing
Tags
EC2,
Hadoop
While it has been in existence for over a year, open source Google-like infrastructure project Hadoop is just now receiving wider noticed by the development community. From the Hadoop website:
Hadoop is a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...

Hadoop, which is written using Java, has been backed extensively by Yahoo in recent years with project lead Doug Cutting being hired by Yahoo to work full time on the project in January of 2006. The University of Washington has when so far as to run a course on distributed computing using Hadoop as a base. The course material has been posted on Google Code for other developers interested in the technology.

Recently Yahoo's Jeremy Zawodny provided a status update of Hadoop:

For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges ... The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy ... To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we've taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project...

Zawodny goes on to provide data sort benchmarks over the last year. In the tests each node sorts the same amount of input data. So as an example of ratios, 20 nodes might sort 100 records each for a total of 2000 records while 100 nodes would sort 100 records each for a total of 10000. Recent benchmarks are as follows:

Date:   Nodes Hours
April 2006 188 47.9
May 2006 500 42.0
December 2006 20 1.8
December 2006 100 3.3
December 2006 500 5.2
December 2006 900 7.8
July 2007 20 1.2
July 2007 100 1.3
July 2007 500 2.0
July 2007 900 2.5

Tim O'Reilly picked up on Zawodny's post and confirm that support comes from the top of the Yahoo organization:

...Yahoo! had hired Doug Cutting, the creator of hadoop, back in January. But Doug's talk at Oscon was kind of a coming out party for Hadoop, and Yahoo! wanted to make clear just how important they think the project is. In fact, I even had a call from David Filo to make sure I knew that the support is coming from the top...

...why is Yahoo!'s involvement so important? First, it indicates a kind of competitive tipping point in Web 2.0, where a large company that is a strong #2 in a space (search) realizes that open source is a great competitive weapon against their dominant competitor ... Supporting Hadoop and other Apache projects not only gets Yahoo! deeply involved in open source software projects they can use, it helps give them renewed "geek cred." ... Second, and perhaps equally important, Yahoo! gives hadoop an opportunity to be tested out at scale...

Blogger John Munsch has summed up the Yahoo involement by saying "Hadoop And The Opposite Of The Not-Invented-Here Syndrome".

Microsoft's Sriram Krishnan shifts the discussion considering the problem facing startups and developers with the industry moving to such large scale solutions and evolving solutions like Hadoop and Amazon EC2:

...Most of the value in Web 2.0 comes from data (generated by lots of users). E.g - del.ico.us, Digg, Facebook ... It is beyond the financial means of any one person to run large scale server software E.g Gmail, Google Search, Live, Y! Search ... Your typical long-haired geek probably can't get his hands on [# Large scale blob storage (S3, Google File System), Large scale structured storage ( Google's Bigtable), Tools to run code across such infrastructure (MapReduce, Dryad).]...I'm not sure how far along Doug Cutting's open source equivalents have come along for these - so that might be an answer...

9 comments

Reply

Hadoop is kewl. Might wanna try this as well. by ARI ZILKA Posted Aug 10, 2007 1:45 PM
Parallel Grid Agents by Cameron Purdy Posted Aug 10, 2007 11:00 PM
Hadoop - good for distributed files - what about distributed objects? by Nati Shalom Posted Aug 11, 2007 2:11 PM
Erlang? by Thamaraiselvan Poomalai Posted Aug 13, 2007 1:55 AM
Re: Erlang? by Cameron Purdy Posted Aug 13, 2007 9:59 AM
Re: Erlang? by Sven Heyll Posted Sep 17, 2007 7:32 PM
What to do..... by Dan Creswell Posted Aug 13, 2007 3:30 PM
Re: What to do..... by ARI ZILKA Posted Aug 14, 2007 10:52 PM
Re: What to do..... by Cameron Purdy Posted Sep 6, 2007 1:45 AM
  1. FWIW, Terracotta users enjoy hadoop-style distribution of workload using the WorkManager pattern and java.util.concurrent without too much more. I looked at some of the stuff the Hadoop guys @ Yahoo showed me and the interface is definitely elegant. Still, TC might be worth looking at because it seemed a tad easier to me to ramp onto the concept of work queues, units of work, and push/pop as my interface between masters and workers: http://www.terracotta.org/confluence/display/labs/WorkManager And to read more about it... http://jonasboner.com/2007/01/29/how-to-build-a-pojo-based-data-grid-using-open-terracotta/

  2. Back to top

    Parallel Grid Agents

    Aug 10, 2007 11:00 PM by Cameron Purdy

    .. a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...
    How does MapReduce compare to the once-and-only-once parallel execution guarantees provided by Coherence? http://wiki.tangosol.com/display/COH32UG/Provide+a+Data+Grid Peace, Cameron Purdy Oracle Coherence: The Java Data Grid

  3. From what i could tell Hadoop is designed primarily to process distributed files. It is designed for the following use-cases:

    distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation
    The core principles behind hadoop are common to other patterns used in parallel computing since the 80's such as Black Board System and TupleSpaces/JavaSpaces While Hadoop is primarily designed for processing of distributed files the later was primarily designed to do the same for distributed objects. One of the key aspect behind TupleSpaces/JavaSpaces model is the fact that it is transactional and therefore can handle partial failure scenario. It also guaranties once and only once execution of a task. You can also use it in synchronous and asynchronous model. A significantly simplified abstraction that utilizes the power behind the JavaSpaces/Master-Worker pattern is provided through our openspaces/spring framework as part of a work that was done collaboratively with interface21. This abstraction makes parallel execution as simple as a remote call to a remote POJO Service. The details on how transaction is handled, whether execution will be synchronous or asynchronous and in addition to that how to control the association between the execution and the location of the data through affinity key is done behind the scenes in a declarative manner - keeping your code pretty much unaware of all those details. You can find example and more details on OpenSpaces and how it utilizes the SpringRemoting abstraction here Nati S GigaSpaces Write Once Scale Anywhere

  4. Back to top

    Erlang?

    Aug 13, 2007 1:55 AM by Thamaraiselvan Poomalai

    Hey, How about erlang?. Erlang just fits well for aforementioned discussion points. As distribution, concurrency built into erlang language, it can accomplish more in less time than any other language implementation.

  5. Back to top

    Re: Erlang?

    Aug 13, 2007 9:59 AM by Cameron Purdy

    Hey, How about erlang?. Erlang just fits well for aforementioned discussion points. As distribution, concurrency built into erlang language, it can accomplish more in less time than any other language implementation.
    It depends on the problem. Erlang is a functional language, so it can be incredible effective for a certain class of problems. It has not proven as effective in the "general purpose language" field, which is still dominated by imperative languages (e.g. Java). Peace, Cameron Purdy Oracle Coherence: The Java Data Grid

  6. Back to top

    What to do.....

    Aug 13, 2007 3:30 PM by Dan Creswell

    Cameron asks: "How does MapReduce compare to the once-and-only-once parallel execution guarantees provided by Coherence?" Nati says: "While Hadoop is primarily designed for processing of distributed files......." So I think overall whichever way you go, you end up with roughly the same patterns and effect. The difference is likely for the most part in the level of granularity, overall scale, cost vs performance and programming model areas. One element of the tradeoff would be: Large amounts of data that can’t be easily held entirely in memory (kind of implying you’ve got it all in files on disk) won’t favour Coherence or GigaSpaces. By contrast lots of small bits of data that can all be fitted into memory doesn’t favour MapReduce. So it seems to me to come down to cost of computation versus cost of I/O and whether you can avoid I/O (i.e. moving data to node). And avoiding I/O might come down to whether or not you can afford enough memory for all your data. If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too). If you can't cram it all in memory well Hadoop might well be your best friend.... As an aside, MapReduce also handles failures but not with transactions, rather it throws away a chunk of results and redoes them (basically, it’s using checkpointing and restarts). Horses for courses.....

  7. Back to top

    Re: What to do.....

    Aug 14, 2007 10:52 PM by ARI ZILKA

    If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too).
    FWIW, Terracotta is not restricted to an in-memory dataset. It spills to disk as space is needed OR can be configured to keep all data consistent on disk. Very sound advice in terms of building a test that ACCURATELY depicts your use case (which is harder than most people realize) and picking which works best for you. Each has its design targets and associated sweet spot use cases. Enjoy, --Ari

  8. Back to top

    Re: What to do.....

    Sep 6, 2007 1:45 AM by Cameron Purdy

    If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too).
    FWIW, Terracotta is not restricted to an in-memory dataset. It spills to disk as space is needed OR can be configured to keep all data consistent on disk.
    Although completely different in terms of what it's solving, Coherence has the same disk capabilities. However, benchmarks with data on disk (i.e. enough data that it's not going to fit in the OS cache) are obviously much worse than benchmarks with data all in memory. If you need efficient disk data access, you're may need to choose a much different architectural model than a data grid, for example. Peace, Cameron Purdy Oracle Coherence: The Java Data Grid

  9. Back to top

    Re: Erlang?

    Sep 17, 2007 7:32 PM by Sven Heyll

    I am actually curious what these problems are, that erlang does not perform as good as java does. Also what are the problems of "functional programming"?

Exclusive Content

Rob Windsor on WCF with REST, JSON and RSS

WCF is not just for SOAP based services and can be used with popular protocols like RSS, REST and JSON. Join Rob Windsor as he introduces WCF 3.5 and its new native support for non-SOAP services.

Christophe Coenraets Discusses Flex 3, AIR, and BlazeDS

Christophe Coenraets discusses Flex 3, Flex Builder, AIR, BlazeDS, Adobe and open source, integrating Flex with existing applications, and integrating RIAs with search engines and browsers.

Debunking Common Refactoring Misconceptions

Danijel Arsenovski attempts to dispel some of the myths around refactoring and how it applies to .NET developers.

REST Eye for the SOA Guy

In this presentation, recorded at QCon San Francisco, CORBA guru Steve Vinoski explains REST from the view of someone who comes to SOA from a traditional, RPC-oriented background.

Choose Feature Teams over Component Teams for Agility

Feature teams are key to scaling agility for large teams. In an excerpt from "Scaling Lean and Agile Development," Larman & Vodde show how feature teams resolve traditional problems & raise new issues

Billy Newport explains Virtualization

Billy Newport talks about virtualization, eXtreme Transaction Processing (XTP) and WebSphere Virtual Enterprise. He discusses hardware, hypervisor, JVM, application and data virtualization.

Virtualization and Security

While virtualization provides many benefits, security can not be a forgotten concept in its application.

Introduction to Agile for Traditional Project Managers

This session is specifically aimed at traditionally trained project managers who are new to Agile, and who would like to be able to relate the PMI's best practices to their Agile equivalents.