Bindings, Platforms, and Innovation
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
Tracking change and innovation in the enterprise software development community
Posted by Scott Delap on Aug 10, 2007 12:09 PM
While it has been in existence for over a year, open source Google-like infrastructure project Hadoop is just now receiving wider noticed by the development community. From the Hadoop website:Hadoop is a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...
Hadoop, which is written using Java, has been backed extensively by Yahoo in recent years with project lead Doug Cutting being hired by Yahoo to work full time on the project in January of 2006. The University of Washington has when so far as to run a course on distributed computing using Hadoop as a base. The course material has been posted on Google Code for other developers interested in the technology.
Recently Yahoo's Jeremy Zawodny provided a status update of Hadoop:
For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges ... The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy ... To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we've taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project...
Zawodny goes on to provide data sort benchmarks over the last year. In the tests each node sorts the same amount of input data. So as an example of ratios, 20 nodes might sort 100 records each for a total of 2000 records while 100 nodes would sort 100 records each for a total of 10000. Recent benchmarks are as follows:
| Date: | Nodes | Hours | |
| April | 2006 | 188 | 47.9 |
| May | 2006 | 500 | 42.0 |
| December | 2006 | 20 | 1.8 |
| December | 2006 | 100 | 3.3 |
| December | 2006 | 500 | 5.2 |
| December | 2006 | 900 | 7.8 |
| July | 2007 | 20 | 1.2 |
| July | 2007 | 100 | 1.3 |
| July | 2007 | 500 | 2.0 |
| July | 2007 | 900 | 2.5 |
Tim O'Reilly picked up on Zawodny's post and confirm that support comes from the top of the Yahoo organization:
...Yahoo! had hired Doug Cutting, the creator of hadoop, back in January. But Doug's talk at Oscon was kind of a coming out party for Hadoop, and Yahoo! wanted to make clear just how important they think the project is. In fact, I even had a call from David Filo to make sure I knew that the support is coming from the top......why is Yahoo!'s involvement so important? First, it indicates a kind of competitive tipping point in Web 2.0, where a large company that is a strong #2 in a space (search) realizes that open source is a great competitive weapon against their dominant competitor ... Supporting Hadoop and other Apache projects not only gets Yahoo! deeply involved in open source software projects they can use, it helps give them renewed "geek cred." ... Second, and perhaps equally important, Yahoo! gives hadoop an opportunity to be tested out at scale...
Blogger John Munsch has summed up the Yahoo involement by saying "Hadoop And The Opposite Of The Not-Invented-Here Syndrome".
Microsoft's Sriram Krishnan shifts the discussion considering the problem facing startups and developers with the industry moving to such large scale solutions and evolving solutions like Hadoop and Amazon EC2:
...Most of the value in Web 2.0 comes from data (generated by lots of users). E.g - del.ico.us, Digg, Facebook ... It is beyond the financial means of any one person to run large scale server software E.g Gmail, Google Search, Live, Y! Search ... Your typical long-haired geek probably can't get his hands on [# Large scale blob storage (S3, Google File System), Large scale structured storage ( Google's Bigtable), Tools to run code across such infrastructure (MapReduce, Dryad).]...I'm not sure how far along Doug Cutting's open source equivalents have come along for these - so that might be an answer...
5 Ways to Ensure Application Performance
Give-away eBook – Confessions of an IT Manager
The Role of Open Source in Data Integration
FWIW, Terracotta users enjoy hadoop-style distribution of workload using the WorkManager pattern and java.util.concurrent without too much more. I looked at some of the stuff the Hadoop guys @ Yahoo showed me and the interface is definitely elegant. Still, TC might be worth looking at because it seemed a tad easier to me to ramp onto the concept of work queues, units of work, and push/pop as my interface between masters and workers: http://www.terracotta.org/confluence/display/labs/WorkManager And to read more about it... http://jonasboner.com/2007/01/29/how-to-build-a-pojo-based-data-grid-using-open-terracotta/
.. a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...
How does MapReduce compare to the once-and-only-once parallel execution guarantees provided by Coherence?
http://wiki.tangosol.com/display/COH32UG/Provide+a+Data+Grid
Peace,
Cameron Purdy
Oracle Coherence: The Java Data Grid
From what i could tell Hadoop is designed primarily to process distributed files. It is designed for the following use-cases:
distributed grep distributed sort web link-graph reversal
term-vector per host web access log stats inverted index construction
document clustering machine learning statistical machine translation
The core principles behind hadoop are common to other patterns used in parallel computing since the 80's such as Black Board System and TupleSpaces/JavaSpaces
While Hadoop is primarily designed for processing of distributed files the later was primarily designed to do the same for distributed objects. One of the key aspect behind TupleSpaces/JavaSpaces model is the fact that it is transactional and therefore can handle partial failure scenario. It also guaranties once and only once execution of a task. You can also use it in synchronous and asynchronous model.
A significantly simplified abstraction that utilizes the power behind the JavaSpaces/Master-Worker pattern is provided through our openspaces/spring framework as part of a work that was done collaboratively with interface21.
This abstraction makes parallel execution as simple as a remote call to a remote POJO Service. The details on how transaction is handled, whether execution will be synchronous or asynchronous and in addition to that how to control the association between the execution and the location of the data through affinity key is done behind the scenes in a declarative manner - keeping your code pretty much unaware of all those details.
You can find example and more details on OpenSpaces and how it utilizes the SpringRemoting abstraction here
Nati S
GigaSpaces
Write Once Scale Anywhere
Hey, How about erlang?. Erlang just fits well for aforementioned discussion points. As distribution, concurrency built into erlang language, it can accomplish more in less time than any other language implementation.
Hey, How about erlang?. Erlang just fits well for aforementioned discussion points. As distribution, concurrency built into erlang language, it can accomplish more in less time than any other language implementation.
It depends on the problem. Erlang is a functional language, so it can be incredible effective for a certain class of problems. It has not proven as effective in the "general purpose language" field, which is still dominated by imperative languages (e.g. Java).
Peace,
Cameron Purdy
Oracle Coherence: The Java Data Grid
Cameron asks: "How does MapReduce compare to the once-and-only-once parallel execution guarantees provided by Coherence?" Nati says: "While Hadoop is primarily designed for processing of distributed files......." So I think overall whichever way you go, you end up with roughly the same patterns and effect. The difference is likely for the most part in the level of granularity, overall scale, cost vs performance and programming model areas. One element of the tradeoff would be: Large amounts of data that can’t be easily held entirely in memory (kind of implying you’ve got it all in files on disk) won’t favour Coherence or GigaSpaces. By contrast lots of small bits of data that can all be fitted into memory doesn’t favour MapReduce. So it seems to me to come down to cost of computation versus cost of I/O and whether you can avoid I/O (i.e. moving data to node). And avoiding I/O might come down to whether or not you can afford enough memory for all your data. If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too). If you can't cram it all in memory well Hadoop might well be your best friend.... As an aside, MapReduce also handles failures but not with transactions, rather it throws away a chunk of results and redoes them (basically, it’s using checkpointing and restarts). Horses for courses.....
If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too).
FWIW, Terracotta is not restricted to an in-memory dataset. It spills to disk as space is needed OR can be configured to keep all data consistent on disk.
Very sound advice in terms of building a test that ACCURATELY depicts your use case (which is harder than most people realize) and picking which works best for you. Each has its design targets and associated sweet spot use cases.
Enjoy,
--Ari
Although completely different in terms of what it's solving, Coherence has the same disk capabilities. However, benchmarks with data on disk (i.e. enough data that it's not going to fit in the OS cache) are obviously much worse than benchmarks with data all in memory. If you need efficient disk data access, you're may need to choose a much different architectural model than a data grid, for example.
Peace,
Cameron Purdy
Oracle Coherence: The Java Data Grid
If you can cram it all in memory and you figure you can keep it thay way you want Coherence or Giga or Terracotta (benchmark all of them with a realistic test to get an idea of which you should choose, factor in things like ease of installation etc too).
FWIW, Terracotta is not restricted to an in-memory dataset. It spills to disk as space is needed OR can be configured to keep all data consistent on disk.
I am actually curious what these problems are, that erlang does not perform as good as java does. Also what are the problems of "functional programming"?
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.
This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.
This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.
This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.
After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.
IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.
Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.
9 comments
Watch Thread Reply