InfoQ

News

Cascading - Data Processing API for Hadoop MapReduce

Posted by R.J. Lorimer on Oct 10, 2008 01:09 PM

Community
Java
Topics
Cloud Computing
Tags
Hadoop ,
MapReduce
Cascading is a new processing API for data processing on Hadoop clusters, and supports building complex processing workflows using an expressive API as opposed to directly implementing Hadoop MapReduce algorithms.
The processing API lets the developer quickly assemble complex distributed processes without having to "think" in MapReduce. And to efficiently schedule them based on their dependencies and other available meta-data.
The core concepts of the cascading API are pipes and flows. A pipe is a series of processing steps (parsing, looping, filtering, etc) that defines the data processing to be done, and a flow is the association of a pipe (or set of pipes) with a data-source and data-sink. In other words, a flow is a pipe with data flowing through it. Going one step further, a cascade is the chaining, branching and grouping of multiple flows.

There are a number of key features provided by this API:
  • Dependency-Based 'Toplogical Scheduler' and MapReduce Planning - Two key components of the cascading API are its ability to schedule the invocation of flows based on dependency; with the execution order being independent of construction order, often allowing for concurrent invocation of portions of flows and cascades. In addition, the steps of the various flows are intelligently converted into map-reduce invocations against the hadoop cluster.
  • Event Notification - The various steps of the flow can perform notifications via callbacks, allowing for the host application to report and respond to the progress of the data processing.
  • Scriptable - The Cascading API has scriptable interfaces for Jython, Groovy, and JRuby - making it readily accessible for popular dynamic JVM languages
There are a number of documents available for learning the concepts and implementation of the cascading API. There is an introductory overview presentation PDF that walks through, at a high level, the core concepts of the cascading API. There is also a 'gentle introduction' example available that walks through creating a simple Apache log parser. Lastly, there is a full Javadoc of the Cascading API.

No comments

Watch Thread Reply

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.