BT

Open Source Google-Like Infrastructure Project Hadoop Gains Momentum

by Scott Delap on Aug 10, 2007 |
While it has been in existence for over a year, open source Google-like infrastructure project Hadoop is just now receiving wider noticed by the development community. From the Hadoop website:
Hadoop is a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...

Hadoop, which is written using Java, has been backed extensively by Yahoo in recent years with project lead Doug Cutting being hired by Yahoo to work full time on the project in January of 2006. The University of Washington has when so far as to run a course on distributed computing using Hadoop as a base. The course material has been posted on Google Code for other developers interested in the technology.

Recently Yahoo's Jeremy Zawodny provided a status update of Hadoop:

For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges ... The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy ... To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we've taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project...

Zawodny goes on to provide data sort benchmarks over the last year. In the tests each node sorts the same amount of input data. So as an example of ratios, 20 nodes might sort 100 records each for a total of 2000 records while 100 nodes would sort 100 records each for a total of 10000. Recent benchmarks are as follows:

Date:   Nodes Hours
April 2006 188 47.9
May 2006 500 42.0
December 2006 20 1.8
December 2006 100 3.3
December 2006 500 5.2
December 2006 900 7.8
July 2007 20 1.2
July 2007 100 1.3
July 2007 500 2.0
July 2007 900 2.5

Tim O'Reilly picked up on Zawodny's post and confirm that support comes from the top of the Yahoo organization:

...Yahoo! had hired Doug Cutting, the creator of hadoop, back in January. But Doug's talk at Oscon was kind of a coming out party for Hadoop, and Yahoo! wanted to make clear just how important they think the project is. In fact, I even had a call from David Filo to make sure I knew that the support is coming from the top...

...why is Yahoo!'s involvement so important? First, it indicates a kind of competitive tipping point in Web 2.0, where a large company that is a strong #2 in a space (search) realizes that open source is a great competitive weapon against their dominant competitor ... Supporting Hadoop and other Apache projects not only gets Yahoo! deeply involved in open source software projects they can use, it helps give them renewed "geek cred." ... Second, and perhaps equally important, Yahoo! gives hadoop an opportunity to be tested out at scale...

Blogger John Munsch has summed up the Yahoo involement by saying "Hadoop And The Opposite Of The Not-Invented-Here Syndrome".

Microsoft's Sriram Krishnan shifts the discussion considering the problem facing startups and developers with the industry moving to such large scale solutions and evolving solutions like Hadoop and Amazon EC2:

...Most of the value in Web 2.0 comes from data (generated by lots of users). E.g - del.ico.us, Digg, Facebook ... It is beyond the financial means of any one person to run large scale server software E.g Gmail, Google Search, Live, Y! Search ... Your typical long-haired geek probably can't get his hands on [# Large scale blob storage (S3, Google File System), Large scale structured storage ( Google's Bigtable), Tools to run code across such infrastructure (MapReduce, Dryad).]...I'm not sure how far along Doug Cutting's open source equivalents have come along for these - so that might be an answer...

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Hadoop is kewl. Might wanna try this as well. by ARI ZILKA

FWIW, Terracotta users enjoy hadoop-style distribution of workload using the WorkManager pattern and java.util.concurrent without too much more. I looked at some of the stuff the Hadoop guys @ Yahoo showed me and the interface is definitely elegant. Still, TC might be worth looking at because it seemed a tad easier to me to ramp onto the concept of work queues, units of work, and push/pop as my interface between masters and workers:

www.terracotta.org/confluence/display/labs/Work...

And to read more about it...
jonasboner.com/2007/01/29/how-to-build-a-pojo-b...

Parallel Grid Agents by Cameron Purdy

.. a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...


How does MapReduce compare to the once-and-only-once parallel execution guarantees provided by Coherence?

wiki.tangosol.com/display/COH32UG/Provide+a+Dat...

Peace,

Cameron Purdy
Oracle Coherence: The Java Data Grid

Hadoop - good for distributed files - what about distributed objects? by Nati Shalom

From what i could tell Hadoop is designed primarily to process distributed files. It is designed for the following use-cases:


distributed grep distributed sort web link-graph reversal
term-vector per host web access log stats inverted index construction
document clustering machine learning statistical machine translation


The core principles behind hadoop are common to other patterns used in parallel computing since the 80's such as Black Board System and TupleSpaces/JavaSpaces

While Hadoop is primarily designed for processing of distributed files the later was primarily designed to do the same for distributed objects. One of the key aspect behind TupleSpaces/JavaSpaces model is the fact that it is transactional and therefore can handle partial failure scenario. It also guaranties once and only once execution of a task. You can also use it in synchronous and asynchronous model.

A significantly simplified abstraction that utilizes the power behind the JavaSpaces/Master-Worker pattern is provided through our openspaces/spring framework as part of a work that was done collaboratively with interface21.

This abstraction makes parallel execution as simple as a remote call to a remote POJO Service. The details on how transaction is handled, whether execution will be synchronous or asynchronous and in addition to that how to control the association between the execution and the location of the data through affinity key is done behind the scenes in a declarative manner - keeping your code pretty much unaware of all those details.

You can find example and more details on OpenSpaces and how it utilizes the SpringRemoting abstraction here


Nati S
GigaSpaces
Write Once Scale Anywhere

Erlang? by Thamaraiselvan Poomalai

Hey, How about erlang?. Erlang just fits well for aforementioned discussion points. As distribution, concurrency built into erlang language, it can accomplish more in less time than any other language implementation.

Re: Erlang? by Cameron Purdy

Hey, How about erlang?. Erlang just fits well for aforementioned discussion points. As distribution, concurrency built into erlang language, it can accomplish more in less time than any other language implementation.


It depends on the problem. Erlang is a functional language, so it can be incredible effective for a certain class of problems. It has not proven as effective in the "general purpose language" field, which is still dominated by imperative languages (e.g. Java).

Peace,

Cameron Purdy
Oracle Coherence: The Java Data Grid

What to do..... by Dan Creswell

Cameron asks:

"How does MapReduce compare to the once-and-only-once parallel execution guarantees provided by Coherence?"

Nati says:

"While Hadoop is primarily designed for processing of distributed files......."

So I think overall whichever way you go, you end up with roughly the same patterns and effect. The difference is likely for the most part in the level of granularity, overall scale, cost vs performance and programming model areas.

One element of the tradeoff would be:

Large amounts of data that can’t be easily held entirely in memory (kind of implying you’ve got it all in files on disk) won’t favour Coherence or GigaSpaces. By contrast lots of small bits of data that can all be fitted into memory doesn’t favour MapReduce.

So it seems to me to come down to cost of computation versus cost of I/O and whether you can avoid I/O (i.e. moving data to node). And avoiding I/O might come down to whether or not you can afford enough memory for all your data.

If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too). If you can't cram it all in memory well Hadoop might well be your best friend....

As an aside, MapReduce also handles failures but not with transactions, rather it throws away a chunk of results and redoes them (basically, it’s using checkpointing and restarts).

Horses for courses.....

Re: What to do..... by ARI ZILKA


If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too).


FWIW, Terracotta is not restricted to an in-memory dataset. It spills to disk as space is needed OR can be configured to keep all data consistent on disk.

Very sound advice in terms of building a test that ACCURATELY depicts your use case (which is harder than most people realize) and picking which works best for you. Each has its design targets and associated sweet spot use cases.

Enjoy,

--Ari

Re: What to do..... by Cameron Purdy


If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too).


FWIW, Terracotta is not restricted to an in-memory dataset. It spills to disk as space is needed OR can be configured to keep all data consistent on disk.


Although completely different in terms of what it's solving, Coherence has the same disk capabilities. However, benchmarks with data on disk (i.e. enough data that it's not going to fit in the OS cache) are obviously much worse than benchmarks with data all in memory. If you need efficient disk data access, you're may need to choose a much different architectural model than a data grid, for example.

Peace,

Cameron Purdy
Oracle Coherence: The Java Data Grid

Re: Erlang? by Sven Heyll

I am actually curious what these problems are, that erlang does not perform as good as java does. Also what are the problems of "functional programming"?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

9 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT