Yahoo! Releases S4, a Real Time, Distributed Stream Computing Platform

This month, Yahoo! released a new open source framework for "processing continuous, unbounded streams of data." The framework, named S4, allows for massively distributed computations over data that is constantly changing.

Although described by some as "real-time map reduce," S4 is better compared to a streaming database or a complex event processing system. Such systems have been around for many years from vendors like StreamBase, Oracle, Tibco and Sybase as well as in some open source projects like Active Insight Esper and OpenESB. Many of these vendor systems have been folded into larger Enterprise Service Bus offerings.

Yahoo's announcement has left many developers confused by what exactly S4 does. The Yahoo! team admits that the documentation is a little bit behind, but say that they wanted to get the code out in public as soon as possible to allow developers to try it out. To help orient new developers, the Yahoo! team points to the example implementations that come with S4.

To understand the examples, developers should know that the central concepts in S4 are Events and Processing Elements. An event is any arbitrary Java object floating through S4. Processing Elements are pieces of logic that take in events and do something based on them. Most often, a processing element takes in an event and emits another event. Most events have a "key" and S4 makes sure that two events with the same key end up being processed on the same machine. Crucial to the scalability of S4 is the idea that every instance of a processing element handles only one key. To paraphrase the S4 Overview, this means, for example, a word counting application will create one instance of a processing element per word being counted.

One example implementation tracks the top ten hash tags on Twitter. The example code consists of only eight classes with relatively simple jobs. First, a TwitterFeedListener converts JSON from the Twitter "fire hose" to Java events for S4 to use. Then a TopicExtractorPE pulls hash tags out of each tweet and creates one new event per hash tag. The TopicExtractorPE gives each event a "key" that matches the hash tag from the tweet. Next a TopicCountAndReportPE (one per hash tag) counts the number of times its hash tag has been seen, emitting a new event with the hash tag and the count. Lastly, a single TopNTopicPE consumes all hash tag counts and keeps a sorted list of the top 10.

Other typical uses for computing over continuous streams of data include: listening for stock trading signals, watching for fraud in transactions, and monitoring process logs to look for signs of trouble.

S4's sweet spot is in processing huge volumes of short-lived data where most of what the business wants is aggregation, not keeping every detail. The way S4 works, it keeps track of data locality and fault detection and lets the developer concentrate on only writing logic.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Parallel Programming topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter