VMware Infrastructure 3 Book Excerpt and Author Interview
VMware Infrastructure 3: Advanced Technical Design Guide and Advanced Operations Guide provides a wealth of practical insights into setting up virtualization in todays corporate environments.
Tracking change and innovation in the enterprise software development community
Posted by Scott Delap on Jan 18, 2008 03:40 PM
A recent article on the Database Column by David J. DeWitt and Michael Stonebraker attempts to compare the increasingly popular MapReduce programming paradigm to a relational database. The article goes so far as to say:
...As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968....Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.
The article goes on to list criteria such as:
The blogsphere has quickly called foul on the comparison and its reasoning. Greg Jorgensen provides a detailed rebuttal. Among the items he notes are that MapReduce is not a database but an algorithmic technique for distributed processing and should not be compared to one. Jorgensen proposes that a better comparison would have been to SimpleDB:
...What the authors really want to gripe about is distributed “cloud” data management systems like Amazon’s SimpleDB; in fact if you change “MapReduce” to “SimpleDB” the original article almost makes sense...
Rich Skrenta comments on the angle of disruption:
...The thing that disrupts you is always uglier and worse in some way. Less features, less developed. But if there's a 10X price win in there somewhere, the cheap rickety thing wins in the end. Think Linux vs. AT&T Unix, or mysql vs. Oracle...
Lengthy debate and comment on the topic can also be found on reddit and ycombinator.
Introducing application infrastructure virtualization and WebSphere Virtual Enterprise
The Agile Business Analyst: Skills and Techniques needed for Agile
WebSphere Virtual Enterprise 3 minute demo
Guide to Calculating ROI with Terracotta Open Source JVM Clustering
I think authors either do not understand Map/Reduce (doubtfully) or clearly misplacing where it’s used and for what scenarios. What I think authors are also missing is the combination of data portioning and affinity map/reduce that is becoming prevailing design pattern for grid applications. I blogged about it here in more details. All in all, it is sad to read such a misguided piece… Best, Nikita Ivanov. GridGain – Grid Computing Made Simple
MapReduce is not a DB. I don't see the parallel here to Teradata or any others of similar ilk. Swapping it with SimpleDB or BigTable is a more logical perspective. Also - if they were referring to BigTable, then in fact, it does support indexes and doesn't do brute force searches.
The authors would have done well to read the introduction to the MapReduce paper they cited:
Prior to our development of MapReduce, the authors and many others
at Google implemented hundreds of special-purpose computations that
process large amounts of raw data, such as crawled documents, Web
request logs, etc., to compute various kinds of derived data, such as
inverted indices, various representations of the graph structure of Web
documents, summaries of the number of pages crawled per host, and
the set of most frequent queries in a given day. Most such computa-
tions are conceptually straightforward. However, the input data is usu-
ally large and the computations have to be distributed across hundreds
or thousands of machines in order to finish in a reasonable amount of
time. The issues of how to parallelize the computation, distribute the
data, and handle failures conspire to obscure the original simple com-
putation with large amounts of complex code to deal with these issues.
MapReduce is for doing computation on raw data. In Google's case this data is usually crawled from the web. Google likely stores some of the data they glean from raw data they process using MapReduce in a ... DBMS. *sigh*
Yeah, I definitely think that the relational database goo-roos who annually inflict billions of dollars in monetary damages on unsuspecting IT departments are in a grand position to tell Google how to do search. Perhaps David and Michael would also like to offer me parenting advice...?
Not that I parent as well as Google does search. Oh, you know what I meant...
VMware Infrastructure 3: Advanced Technical Design Guide and Advanced Operations Guide provides a wealth of practical insights into setting up virtualization in todays corporate environments.
Can a system that is so large it cannot be comprehended be "designed" in a conventional sense? The foundations of computing are about to change. In this talk, Richard P. Gabriel explores why and how.
Ruby 1.9's Fibers and non-blocking I/O are getting more attention - we talked to Mohammad A. Ali of the NeverBlock project and Tony Arcieri of the Revactor project.
Tim Mackinnon talks about the aspirations behind the Agile principles and practices, the desire to become efficient, to write quality code which does not end up being thrown away.
Brian Goetz discusses the difficulties of creating multithreaded programs correctly, incorrect synchronization, race conditions, deadlock, STM, concurrency, alternatives to threads, Erlang, Scala.
Often the hardest part of changing technologies is language syntax differences. This new article provides Java developers with a transition guide to Actionscript which forms the foundation of Flex.
Neal Ford talks about having multiple languages running on one of the two major platforms: Java and .NET. He also presents the advantages offered by Ruby compared to static languages like Java or C#.
David Anderson talks about the history of Agile, the current status of it and his vision for the future. The role of Agile consists in finding ways to implement its principles.
5 comments
Reply