Twitter Open-Sources its MapReduce Streaming Framework Summingbird

| by Michael Hausenblas Follow 1 Followers on Jan 16, 2014. Estimated reading time: 1 minute |

Twitter has open sourced their MapReduce streaming framework, called Summingbird. Available under the Apache 2 license, Summingbird is a large-scale data processing system enabling developers to uniformly execute code in either batch-mode (Hadoop/MapReduce-based) or stream-mode (Storm-based) or a combination thereof, called hybrid mode.

In order for Twitter to be able to keep up processing 500 millions tweets and growing, they had to find a replacement for their existing stack that required manually integrating MapReduce (Pig/Scalding) and streaming-based (Storm) code. The main motivation to create Summingbird, mentioned by the Twitter engineers, came from the realization that running a fully real-time system on Storm was difficult due to:

  • Re-computation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log-loading mechanism.
  • Storm is focused on message passing and random-write databases are harder to maintain.

This insight led to Summingbird, a flexible and general solution addressing the engineers’ practical issues with the existing approach:

  • Two sets of aggregation logic have to be kept in sync in two different systems
  • Keys and values must be serialized consistently between each system and the client
  • The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results

Summingbird is also one of the first openly available Lambda Architecture compliant systems. Similar projects include Yahoo’s Storm-YARN and a Spanish start-up’s upcoming Lambdoop, a Java framework for developing Big Data applications in a Lambda Architecture conformant way. The characteristics of the Lambda Architecture - immutable master dataset and the combination of batch, serving, and speed layer - enables people to build robust large-scale data processing systems that can deal with both batch and stream processing and has use cases from social media platforms (such as Twitter, LinkedIn, etc.) over the Internet of Things (smart city, wearables, manufacturing, etc.) to the financial sector (fraud detection, recommendations).

The main authors of Summingbird - Oscar Boykin, Sam Ritchie (nephew of computer science legend Dennis Ritchie) and Ashutosh Singhal - have further revealed the roadmap for Summingbird:

  • support for Apache Spark and the columnar data storage format Parquet
  • libraries of higher-level mathematics and machine learning code on top of Summingbird’s Producer primitives, and
  • deeper integration with related open source projects such as Algebird or Storehaus.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread


Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you