MapReduce Patterns, Algorithms, and Use Cases
With the explosion of Hadoop and big data usage, many people are currently looking for approaches to convert their existing implementations into MapReduce. Unfortunately, with the notable exception of "Data-Intensive Text Processing with MapReduce" and "Mahout in Action" there are very few publications dedicated to the designing of MapReduce implementations. In his new article, "MapReduce Patterns, Algorithms, and Use Cases" Ilya Katsov provides a systematic overview of problems that can be solved using a MapReduce framework.
It starts with a fairly straightforward usage of MapReduce as a general purpose parallel execution framework, which can be applicable to many implementations requiring leveraging of large clusters for compute and data intensive calculations, including physical and engineering simulations, numerical analysis, performance testing, etc. The next group of algorithms, commonly used in Log Analysis, ETL and Data Querying, includes counting and summing, data collating (based on specific functions), filtering, parsing, validation and sorting.
The second large group of MapReduce patterns, discussed by Katsov includes multiple relational MapReduce patterns, often used by data warehousing applications. These patterns are widely leveraged by Hive and Pig implementations and include predicate/function based data selection, data projection, data union, difference and intersection and groupBy aggregations. A separate discussion is dedicated to implementing data joins and include such algorithms as repartition joins and replicated joins
Moving further up the chain of complexity, the article discusses more complex MapReduce processing algorithms, including graph processing, search algorithms (breadth first search), page rank and data aggregation algorithms that can be leveraged in graph analysis, web indexing and general search applications. It also covers common text analysis and market analysis use cases requiring cross correlation calculation. This part covers both "pairs" and "stripes" design patterns and their comparative merits.
Finally, Katsov provides a good bibliography of more complex MapReduce implementations in the field of machine learning.
Most of the algorithms, described in the article are accompanied by pseudo code and basic information for their applicability, advantages and disadvantages and some real world use cases.
Many people today are still struggling with applicability of Hadoop and MapReduce for solving their business problems. Some still consider it a "technical approach in search of a business problem". The article is an important step in filling an existing void in the field of MapReduce algorithms, use cases and design patterns. It shows MapReduce’s power far beyond infamous "word count" and the ways it can be leveraged for solving a wide range of practical problems.
Tom Gilb & Kai Gilb Jan 26, 2015