InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

MapReduce A Step Backwards: Is Comparison to Relational Databases Fair?

Posted by Scott Delap on Jan 18, 2008

Sections
Operations & Infrastructure,
Enterprise Architecture,
Development,
Architecture & Design
Topics
Grid Computing ,
Java
Tags
MapReduce ,
Hadoop
A recent article on the Database Column by David J. DeWitt and Michael Stonebraker attempts to compare the increasingly popular MapReduce programming paradigm to a relational database. The article goes so far as to say:

...As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968....Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.

The article goes on to list criteria such as:

  • MapReduce is a poor implementation (in comparison to B-trees)
  • MapReduce is not novel
  • MapReduce is missing features (such as loading and indexing)
  • MapReduce is incompatible with the DBMS tools

The blogsphere has quickly called foul on the comparison and its reasoning. Greg Jorgensen provides a detailed rebuttal. Among the items he notes are that MapReduce is not a database but an algorithmic technique for distributed processing and should not be compared to one. Jorgensen proposes that a better comparison would have been to SimpleDB:

...What the authors really want to gripe about is distributed “cloud” data management systems like Amazon’s SimpleDB; in fact if you change “MapReduce” to “SimpleDB” the original article almost makes sense...

Rich Skrenta comments on the angle of disruption:

...The thing that disrupts you is always uglier and worse in some way. Less features, less developed. But if there's a 10X price win in there somewhere, the cheap rickety thing wins in the end. Think Linux vs. AT&T Unix, or mysql vs. Oracle...

Lengthy debate and comment on the topic can also be found on reddit and ycombinator.

Sad... by Nikita Ivanov Posted
more research needed. by Zubin Wadia Posted
Stooopid by Kevin Teague Posted
Holy crap... by Kurt Christensen Posted
Re: Holy crap... by Kurt Christensen Posted
  1. Back to top

    Sad...

    by Nikita Ivanov

    I think authors either do not understand Map/Reduce (doubtfully) or clearly misplacing where it’s used and for what scenarios. What I think authors are also missing is the combination of data portioning and affinity map/reduce that is becoming prevailing design pattern for grid applications. I blogged about it here in more details.

    All in all, it is sad to read such a misguided piece…

    Best,
    Nikita Ivanov.
    GridGain – Grid Computing Made Simple

  2. Back to top

    more research needed.

    by Zubin Wadia

    MapReduce is not a DB. I don't see the parallel here to Teradata or any others of similar ilk.

    Swapping it with SimpleDB or BigTable is a more logical perspective.

    Also - if they were referring to BigTable, then in fact, it does support indexes and doesn't do brute force searches.

  3. Back to top

    Stooopid

    by Kevin Teague

    The authors would have done well to read the introduction to the MapReduce paper they cited:


    Prior to our development of MapReduce, the authors and many others
    at Google implemented hundreds of special-purpose computations that
    process large amounts of raw data, such as crawled documents, Web
    request logs, etc., to compute various kinds of derived data, such as
    inverted indices, various representations of the graph structure of Web
    documents, summaries of the number of pages crawled per host, and
    the set of most frequent queries in a given day. Most such computa-
    tions are conceptually straightforward. However, the input data is usu-
    ally large and the computations have to be distributed across hundreds
    or thousands of machines in order to finish in a reasonable amount of
    time. The issues of how to parallelize the computation, distribute the
    data, and handle failures conspire to obscure the original simple com-
    putation with large amounts of complex code to deal with these issues.


    MapReduce is for doing computation on raw data. In Google's case this data is usually crawled from the web. Google likely stores some of the data they glean from raw data they process using MapReduce in a ... DBMS. *sigh*

  4. Back to top

    Holy crap...

    by Kurt Christensen

    Yeah, I definitely think that the relational database goo-roos who annually inflict billions of dollars in monetary damages on unsuspecting IT departments are in a grand position to tell Google how to do search. Perhaps David and Michael would also like to offer me parenting advice...?

  5. Back to top

    Re: Holy crap...

    by Kurt Christensen

    Not that I parent as well as Google does search. Oh, you know what I meant...

Educational Content

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.

Beauty Is in the Eye of the Beholder

Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.

Architecting Visa for Massive Scale and Continuous Innovation

John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.

Max Protect: Scalability and Caching at ESPN.com

Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.

The Seven Deadly Sins of Enterprise Agile Adoption

Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.

Questions for an Enterprise Architect

Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?