Yahoo! Releases S4, a Real Time, Distributed Stream Computing Platform
This month, Yahoo! released a new open source framework for "processing continuous, unbounded streams of data." The framework, named S4, allows for massively distributed computations over data that is constantly changing.
Although described by some as "real-time map reduce," S4 is better compared to a streaming database or a complex event processing system. Such systems have been around for many years from vendors like StreamBase, Oracle, Tibco and Sybase as well as in some open source projects like Active Insight Esper and OpenESB. Many of these vendor systems have been folded into larger Enterprise Service Bus offerings.
Yahoo's announcement has left many developers confused by what exactly S4 does. The Yahoo! team admits that the documentation is a little bit behind, but say that they wanted to get the code out in public as soon as possible to allow developers to try it out. To help orient new developers, the Yahoo! team points to the example implementations that come with S4.
To understand the examples, developers should know that the central concepts in S4 are Events and Processing Elements. An event is any arbitrary Java object floating through S4. Processing Elements are pieces of logic that take in events and do something based on them. Most often, a processing element takes in an event and emits another event. Most events have a "key" and S4 makes sure that two events with the same key end up being processed on the same machine. Crucial to the scalability of S4 is the idea that every instance of a processing element handles only one key. To paraphrase the S4 Overview, this means, for example, a word counting application will create one instance of a processing element per word being counted.
One example implementation tracks the top ten hash tags on Twitter. The example code consists of only eight classes with relatively simple jobs. First, a TwitterFeedListener converts JSON from the Twitter "fire hose" to Java events for S4 to use. Then a TopicExtractorPE pulls hash tags out of each tweet and creates one new event per hash tag. The TopicExtractorPE gives each event a "key" that matches the hash tag from the tweet. Next a TopicCountAndReportPE (one per hash tag) counts the number of times its hash tag has been seen, emitting a new event with the hash tag and the count. Lastly, a single TopNTopicPE consumes all hash tag counts and keeps a sorted list of the top 10.
Other typical uses for computing over continuous streams of data include: listening for stock trading signals, watching for fraud in transactions, and monitoring process logs to look for signs of trouble.
S4's sweet spot is in processing huge volumes of short-lived data where most of what the business wants is aggregation, not keeping every detail. The way S4 works, it keeps track of data locality and fault detection and lets the developer concentrate on only writing logic.
StreamInsight = CEP + LINQ
The interesting part is that CEP processing is written as a LINQ query. The LINQ query (which looks like SQL) 'observes' the stream of data passing by and 'yields' values whenever the query conditions are satisfied.
Also, because LINQ is a monad, you can compose various queries together to handle very complex scenarios.
We plan to test StreamInsight with low-level RFID events to generate meaningful business events. We'll see how goes it.
S4 is API driven (read - more work) and lacks some usability features - currently. But the distributed feature is the most appealing.
S4 is *not* Complex Event Processing
Re: S4 is *not* Complex Event Processing
Re: S4 is *not* Complex Event Processing
To compare classify S4 as a complex event processing solution is inaccurate. S4 is missing a number of key components required to be classified as such. Those components are 1) time or length based sliding windows, 2) pattern matching, 3) continuous query, and 4) a language specifically incorporating these aspects. I don't know how to best characterize S4, but it's certainly not CEP and bears very little resemblance to products from either Oracle or StreamBase.
I believe the author is stating that S4 is not an identical implementation as ESPER ,IBM InfoSphere Stream, StreamBase,etc.. but it is a solution that addresses the same business problems other stream computing platforms tackle..
Ronny Kohavi Dec 12, 2013
Christian Legnitto Dec 12, 2013