InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Cascading - Data Processing API for Hadoop MapReduce

Posted by R.J. Lorimer on Oct 10, 2008

Sections
Architecture & Design,
Development,
Operations & Infrastructure
Topics
Java ,
Cloud Computing
Tags
Hadoop ,
MapReduce
Cascading is a new processing API for data processing on Hadoop clusters, and supports building complex processing workflows using an expressive API as opposed to directly implementing Hadoop MapReduce algorithms.
The processing API lets the developer quickly assemble complex distributed processes without having to "think" in MapReduce. And to efficiently schedule them based on their dependencies and other available meta-data.
The core concepts of the cascading API are pipes and flows. A pipe is a series of processing steps (parsing, looping, filtering, etc) that defines the data processing to be done, and a flow is the association of a pipe (or set of pipes) with a data-source and data-sink. In other words, a flow is a pipe with data flowing through it. Going one step further, a cascade is the chaining, branching and grouping of multiple flows.

There are a number of key features provided by this API:
  • Dependency-Based 'Toplogical Scheduler' and MapReduce Planning - Two key components of the cascading API are its ability to schedule the invocation of flows based on dependency; with the execution order being independent of construction order, often allowing for concurrent invocation of portions of flows and cascades. In addition, the steps of the various flows are intelligently converted into map-reduce invocations against the hadoop cluster.
  • Event Notification - The various steps of the flow can perform notifications via callbacks, allowing for the host application to report and respond to the progress of the data processing.
  • Scriptable - The Cascading API has scriptable interfaces for Jython, Groovy, and JRuby - making it readily accessible for popular dynamic JVM languages
There are a number of documents available for learning the concepts and implementation of the cascading API. There is an introductory overview presentation PDF that walks through, at a high level, the core concepts of the cascading API. There is also a 'gentle introduction' example available that walks through creating a simple Apache log parser. Lastly, there is a full Javadoc of the Cascading API.

No comments

Watch Thread Reply

Educational Content

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.