Google's Daniel Peng and Frank Dabek published a paper on "Large-scale Incremental Processing Using Distributed Transactions and Notifications” explaining that databases do not meet the storage or throughput requirements for Google's indexing system which stores tens of petabytes of data and processes billions of updates per day on thousands of machines.
Jay Kreps of LinkedIn presented some informative details of how they process data at the recent Hadoop Summit. Kreps described how LinkedIn crunches 120 billion relationships per day and blends large scale data computation with high volume, low latency site serving.
The Hadoop Summit of 2010 started off with a vuvuzela blast from Blake Irving, Chief Product Officer for Yahoo. Yahoo delivered keynote addresses that outlined the scale of their use, technical directions for their contributions, and architectural patterns in how they apply the technology.
Recently Adobe released Puppet recipes that they are using to automate Hadoop/HBase deployments to the community. InfoQ spoke with Luke Kanies, founder of PuppetLabs, to learn more about what this means.
The Apache Mahout project, a set of highly scalable machine-learning libraries, recently announced it's first public release. InfoQ spoke with Grant Ingersoll, co-founder of Mahout and a member of the technical staff at Lucid Imagination, to learn more about this project and machine learning in general.
It has been possible to run Hadoop on EC2 for a while. Today Amazon simplified the process by announcing Amazon Elastic MapReduce which automatically deploys EC2 instances for computational use and includes a API for interacting with them.
Cascading is a new processing API for data processing on Hadoop clusters, and supports building complex processing workflows using an expressive, declarative API.
Aster Data Systems has announced an in-database MapReduce implementation for their nCluster database platform.
The MapReduce design pattern to distribute data processing was introduced by Google in 2004, and came first with a C++ implementation. A new Ruby implementation is now available under the name of Skynet released by Adam Pisoni. InfoQ had the chance to catch up with Adam about its features and how it compares to an existing Ruby implementation called Starfish.
A recent article on the Database Column by David J. DeWitt and Michael Stonebraker attempts to compare the increasingly popular MapReduce programming paradigm to a relational database. The blogsphere has quickly called foul on the comparison and its reasoning.
Amazon's EC2 Elastic Computing cloud allows developers to acquisition computing power a the rate of $0.10 per hour consumed. Work as been done to allow Hadoop an open source MapReduce implementation written in Java to run on EC2. This combination will allow developers to write scalable algorithms and then bring up large numbers of servers to use as computing power for them as needed.