InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Open Source Google-Like Infrastructure Project Hadoop Gains Momentum

Posted by Scott Delap on Aug 10, 2007

Sections
Operations & Infrastructure,
Enterprise Architecture,
Development,
Architecture & Design
Topics
Java ,
Clustering & Caching ,
Grid Computing
Tags
Hadoop ,
EC2
While it has been in existence for over a year, open source Google-like infrastructure project Hadoop is just now receiving wider noticed by the development community. From the Hadoop website:
Hadoop is a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...

Hadoop, which is written using Java, has been backed extensively by Yahoo in recent years with project lead Doug Cutting being hired by Yahoo to work full time on the project in January of 2006. The University of Washington has when so far as to run a course on distributed computing using Hadoop as a base. The course material has been posted on Google Code for other developers interested in the technology.

Recently Yahoo's Jeremy Zawodny provided a status update of Hadoop:

For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges ... The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy ... To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we've taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project...

Zawodny goes on to provide data sort benchmarks over the last year. In the tests each node sorts the same amount of input data. So as an example of ratios, 20 nodes might sort 100 records each for a total of 2000 records while 100 nodes would sort 100 records each for a total of 10000. Recent benchmarks are as follows:

Date:   Nodes Hours
April 2006 188 47.9
May 2006 500 42.0
December 2006 20 1.8
December 2006 100 3.3
December 2006 500 5.2
December 2006 900 7.8
July 2007 20 1.2
July 2007 100 1.3
July 2007 500 2.0
July 2007 900 2.5

Tim O'Reilly picked up on Zawodny's post and confirm that support comes from the top of the Yahoo organization:

...Yahoo! had hired Doug Cutting, the creator of hadoop, back in January. But Doug's talk at Oscon was kind of a coming out party for Hadoop, and Yahoo! wanted to make clear just how important they think the project is. In fact, I even had a call from David Filo to make sure I knew that the support is coming from the top...

...why is Yahoo!'s involvement so important? First, it indicates a kind of competitive tipping point in Web 2.0, where a large company that is a strong #2 in a space (search) realizes that open source is a great competitive weapon against their dominant competitor ... Supporting Hadoop and other Apache projects not only gets Yahoo! deeply involved in open source software projects they can use, it helps give them renewed "geek cred." ... Second, and perhaps equally important, Yahoo! gives hadoop an opportunity to be tested out at scale...

Blogger John Munsch has summed up the Yahoo involement by saying "Hadoop And The Opposite Of The Not-Invented-Here Syndrome".

Microsoft's Sriram Krishnan shifts the discussion considering the problem facing startups and developers with the industry moving to such large scale solutions and evolving solutions like Hadoop and Amazon EC2:

...Most of the value in Web 2.0 comes from data (generated by lots of users). E.g - del.ico.us, Digg, Facebook ... It is beyond the financial means of any one person to run large scale server software E.g Gmail, Google Search, Live, Y! Search ... Your typical long-haired geek probably can't get his hands on [# Large scale blob storage (S3, Google File System), Large scale structured storage ( Google's Bigtable), Tools to run code across such infrastructure (MapReduce, Dryad).]...I'm not sure how far along Doug Cutting's open source equivalents have come along for these - so that might be an answer...

Hadoop is kewl. Might wanna try this as well. by ARI ZILKA Posted
Parallel Grid Agents by Cameron Purdy Posted
Hadoop - good for distributed files - what about distributed objects? by Nati Shalom Posted
Erlang? by Thamaraiselvan Poomalai Posted
Re: Erlang? by Cameron Purdy Posted
Re: Erlang? by Sven Heyll Posted
What to do..... by Dan Creswell Posted
Re: What to do..... by ARI ZILKA Posted
Re: What to do..... by Cameron Purdy Posted
  1. Back to top

    Hadoop is kewl. Might wanna try this as well.

    by ARI ZILKA

    FWIW, Terracotta users enjoy hadoop-style distribution of workload using the WorkManager pattern and java.util.concurrent without too much more. I looked at some of the stuff the Hadoop guys @ Yahoo showed me and the interface is definitely elegant. Still, TC might be worth looking at because it seemed a tad easier to me to ramp onto the concept of work queues, units of work, and push/pop as my interface between masters and workers:

    www.terracotta.org/confluence/display/labs/Work...

    And to read more about it...
    jonasboner.com/2007/01/29/how-to-build-a-pojo-b...

  2. Back to top

    Parallel Grid Agents

    by Cameron Purdy

    .. a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...


    How does MapReduce compare to the once-and-only-once parallel execution guarantees provided by Coherence?

    wiki.tangosol.com/display/COH32UG/Provide+a+Dat...

    Peace,

    Cameron Purdy
    Oracle Coherence: The Java Data Grid

  3. Back to top

    Hadoop - good for distributed files - what about distributed objects?

    by Nati Shalom

    From what i could tell Hadoop is designed primarily to process distributed files. It is designed for the following use-cases:


    distributed grep distributed sort web link-graph reversal
    term-vector per host web access log stats inverted index construction
    document clustering machine learning statistical machine translation


    The core principles behind hadoop are common to other patterns used in parallel computing since the 80's such as Black Board System and TupleSpaces/JavaSpaces

    While Hadoop is primarily designed for processing of distributed files the later was primarily designed to do the same for distributed objects. One of the key aspect behind TupleSpaces/JavaSpaces model is the fact that it is transactional and therefore can handle partial failure scenario. It also guaranties once and only once execution of a task. You can also use it in synchronous and asynchronous model.

    A significantly simplified abstraction that utilizes the power behind the JavaSpaces/Master-Worker pattern is provided through our openspaces/spring framework as part of a work that was done collaboratively with interface21.

    This abstraction makes parallel execution as simple as a remote call to a remote POJO Service. The details on how transaction is handled, whether execution will be synchronous or asynchronous and in addition to that how to control the association between the execution and the location of the data through affinity key is done behind the scenes in a declarative manner - keeping your code pretty much unaware of all those details.

    You can find example and more details on OpenSpaces and how it utilizes the SpringRemoting abstraction here


    Nati S
    GigaSpaces
    Write Once Scale Anywhere

  4. Back to top

    Erlang?

    by Thamaraiselvan Poomalai

    Hey, How about erlang?. Erlang just fits well for aforementioned discussion points. As distribution, concurrency built into erlang language, it can accomplish more in less time than any other language implementation.

  5. Back to top

    Re: Erlang?

    by Cameron Purdy

    Hey, How about erlang?. Erlang just fits well for aforementioned discussion points. As distribution, concurrency built into erlang language, it can accomplish more in less time than any other language implementation.


    It depends on the problem. Erlang is a functional language, so it can be incredible effective for a certain class of problems. It has not proven as effective in the "general purpose language" field, which is still dominated by imperative languages (e.g. Java).

    Peace,

    Cameron Purdy
    Oracle Coherence: The Java Data Grid

  6. Back to top

    What to do.....

    by Dan Creswell

    Cameron asks:

    "How does MapReduce compare to the once-and-only-once parallel execution guarantees provided by Coherence?"

    Nati says:

    "While Hadoop is primarily designed for processing of distributed files......."

    So I think overall whichever way you go, you end up with roughly the same patterns and effect. The difference is likely for the most part in the level of granularity, overall scale, cost vs performance and programming model areas.

    One element of the tradeoff would be:

    Large amounts of data that can’t be easily held entirely in memory (kind of implying you’ve got it all in files on disk) won’t favour Coherence or GigaSpaces. By contrast lots of small bits of data that can all be fitted into memory doesn’t favour MapReduce.

    So it seems to me to come down to cost of computation versus cost of I/O and whether you can avoid I/O (i.e. moving data to node). And avoiding I/O might come down to whether or not you can afford enough memory for all your data.

    If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too). If you can't cram it all in memory well Hadoop might well be your best friend....

    As an aside, MapReduce also handles failures but not with transactions, rather it throws away a chunk of results and redoes them (basically, it’s using checkpointing and restarts).

    Horses for courses.....

  7. Back to top

    Re: What to do.....

    by ARI ZILKA


    If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too).


    FWIW, Terracotta is not restricted to an in-memory dataset. It spills to disk as space is needed OR can be configured to keep all data consistent on disk.

    Very sound advice in terms of building a test that ACCURATELY depicts your use case (which is harder than most people realize) and picking which works best for you. Each has its design targets and associated sweet spot use cases.

    Enjoy,

    --Ari

  8. Back to top

    Re: What to do.....

    by Cameron Purdy


    If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too).


    FWIW, Terracotta is not restricted to an in-memory dataset. It spills to disk as space is needed OR can be configured to keep all data consistent on disk.


    Although completely different in terms of what it's solving, Coherence has the same disk capabilities. However, benchmarks with data on disk (i.e. enough data that it's not going to fit in the OS cache) are obviously much worse than benchmarks with data all in memory. If you need efficient disk data access, you're may need to choose a much different architectural model than a data grid, for example.

    Peace,

    Cameron Purdy
    Oracle Coherence: The Java Data Grid

  9. Back to top

    Re: Erlang?

    by Sven Heyll

    I am actually curious what these problems are, that erlang does not perform as good as java does. Also what are the problems of "functional programming"?

Educational Content

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.