Cascading - Data Processing API for Hadoop MapReduce

| by R.J. Lorimer Follow 0 Followers on Oct 10, 2008. Estimated reading time: 1 minute |

A note to our readers: You asked so we have developed a set of features that allow you to reduce the noise: you can get email and web notifications for topics you are interested in. Learn more about our new features.

Cascading is a new processing API for data processing on Hadoop clusters, and supports building complex processing workflows using an expressive API as opposed to directly implementing Hadoop MapReduce algorithms.
The processing API lets the developer quickly assemble complex distributed processes without having to "think" in MapReduce. And to efficiently schedule them based on their dependencies and other available meta-data.
The core concepts of the cascading API are pipes and flows. A pipe is a series of processing steps (parsing, looping, filtering, etc) that defines the data processing to be done, and a flow is the association of a pipe (or set of pipes) with a data-source and data-sink. In other words, a flow is a pipe with data flowing through it. Going one step further, a cascade is the chaining, branching and grouping of multiple flows.

There are a number of key features provided by this API:
  • Dependency-Based 'Toplogical Scheduler' and MapReduce Planning - Two key components of the cascading API are its ability to schedule the invocation of flows based on dependency; with the execution order being independent of construction order, often allowing for concurrent invocation of portions of flows and cascades. In addition, the steps of the various flows are intelligently converted into map-reduce invocations against the hadoop cluster.
  • Event Notification - The various steps of the flow can perform notifications via callbacks, allowing for the host application to report and respond to the progress of the data processing.
  • Scriptable - The Cascading API has scriptable interfaces for Jython, Groovy, and JRuby - making it readily accessible for popular dynamic JVM languages
There are a number of documents available for learning the concepts and implementation of the cascading API. There is an introductory overview presentation PDF that walks through, at a high level, the core concepts of the cascading API. There is also a 'gentle introduction' example available that walks through creating a simple Apache log parser. Lastly, there is a full Javadoc of the Cascading API.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread


Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you