BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Twitter Open-Sources its MapReduce Streaming Framework Summingbird

by Michael Hausenblas on Jan 16, 2014 |

Twitter has open sourced their MapReduce streaming framework, called Summingbird. Available under the Apache 2 license, Summingbird is a large-scale data processing system enabling developers to uniformly execute code in either batch-mode (Hadoop/MapReduce-based) or stream-mode (Storm-based) or a combination thereof, called hybrid mode.

In order for Twitter to be able to keep up processing 500 millions tweets and growing, they had to find a replacement for their existing stack that required manually integrating MapReduce (Pig/Scalding) and streaming-based (Storm) code. The main motivation to create Summingbird, mentioned by the Twitter engineers, came from the realization that running a fully real-time system on Storm was difficult due to:

  • Re-computation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log-loading mechanism.
  • Storm is focused on message passing and random-write databases are harder to maintain.

This insight led to Summingbird, a flexible and general solution addressing the engineers’ practical issues with the existing approach:

  • Two sets of aggregation logic have to be kept in sync in two different systems
  • Keys and values must be serialized consistently between each system and the client
  • The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results

Summingbird is also one of the first openly available Lambda Architecture compliant systems. Similar projects include Yahoo’s Storm-YARN and a Spanish start-up’s upcoming Lambdoop, a Java framework for developing Big Data applications in a Lambda Architecture conformant way. The characteristics of the Lambda Architecture - immutable master dataset and the combination of batch, serving, and speed layer - enables people to build robust large-scale data processing systems that can deal with both batch and stream processing and has use cases from social media platforms (such as Twitter, LinkedIn, etc.) over the Internet of Things (smart city, wearables, manufacturing, etc.) to the financial sector (fraud detection, recommendations).

The main authors of Summingbird - Oscar Boykin, Sam Ritchie (nephew of computer science legend Dennis Ritchie) and Ashutosh Singhal - have further revealed the roadmap for Summingbird:

  • support for Apache Spark and the columnar data storage format Parquet
  • libraries of higher-level mathematics and machine learning code on top of Summingbird’s Producer primitives, and
  • deeper integration with related open source projects such as Algebird or Storehaus.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT