Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Stream Processing and Lambda Architecture Challenges

Stream Processing and Lambda Architecture Challenges

This item in japanese

Lire ce contenu en français


Kartik Paramasivam, who works in streams infrastructure at LinkedIn, wrote two articles over the summer on why and how his company attempts to avoid Lambda architecture using Apache Samza for data processing.

Stream processing technologies are useful when it comes to getting quick results out of streams of data, but they may fall short for demanding use cases with high consistency and robustness requirements.

Lambda architecture has been a popular solution that combines batch and stream processing. The fundamental idea is to have two separate data lanes: a speed layer to provide low-latency results using stream processing technologies and a batch layer to run jobs to provide accurate results on bulk data. Lambda architecture is complex to implement as it relies on multiple technologies and implies the merge of results from both data lanes.

At LinkedIn, Samza is used to perform the processing on data streamed from Apache Kafka. One of the problems described in the article is the late arrival of events. A RocksDB-based key value store was added to Samza to hold the input events for a longer time. On a late arrival, the framework will re-emit any result for the affected window-based computation, given there’s enough information locally to recompute it. A RocksDB based solution was found preferable since relying on an external store (e.g. NoSQL) would imply network and CPU overhead for communication and serialization.

Apache Flink, another stream processing framework, is capable of computing over time windows based on timestamps, that can be event specific or the ingestion timestamp, and provide consistent results for out-of-order streams. It holds the data in memory and triggers window computation on the receipt of a watermark event. Watermarks represent a clock tick and provide Flink a notion of time (that can be event specific).

Other problems such as the processing of duplicated messages due to at-least-once delivery guarantees have been addressed by most frameworks with internal checkpoint mechanisms.

The last set of stream processing problems is the ability to experiment with data as it flows through the system, in an interactive fashion. Agile experimentation is usually performed on batch systems using high-level languages such as SQL, available on commercial products such as Azure Stream Analytics.

Samza SQL is described by Julian Hyde as an effort that applies Apache Calcite to Samza stream processor. Samza SQL  is not production-ready yet. Instead, LinkedIn uses Hadoop’s batch nature to perform iterative experimentations with offline datasets, some copied from the same live services databases that Samza-based stream processors query when handling a stream.

Flink is also working towards a robust stream SQL support. Flink’s 1.1, released in August 2016, supports filters and unions on streaming data. Flink also supports Complex Event Processing as a high-level description on how to react to sequences of events.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • What most stream processing frameworks ignore

    by peter lin,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Stream processing is fundamentally tied to temporal logic, yet most frameworks ignore it. Many stream processing frameworks just provide sliding window or sliding count, but that tends to result in garbage for a couple of reasons. When new developers try to aggregate data and compute some value, they aren't taught how to model the problem. They just try something simple like calculate the average or a rolling sum. It's when you start trying to answer questions like "for what time period was this trend valid?" that the problems appear and developer end up having to rewrite the application.

    The first step to building a good stream processing framework is understanding concepts like temporal data, temporal distance, temporal patterns, temporal entities and temporal database. Without a solid understanding of these concepts and how it affects the design and implementation, you end up building a tool that is only good for a few use cases. There's literally 30 years of research in this field, but people aren't reading the literature.

  • Re: What most stream processing frameworks ignore

    by Ant hony,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    What literature would you recommend for developers that want to get into big data and stream processing, which also explains all the concepts you mention & provides a good foundation?

  • Re: What most stream processing frameworks ignore

    by peter lin,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    First place to start is understand what the terms means on wikipedia.

    Next is to search acmqueue for papers on these topics. There's literally too many to list in a response. For temporal databases, timedb provides a decent introduction

    Unlike data "at rest" in a traditional database, streaming data can be temporal while "in-flight" but non-temporal when persisted. Take for example sensor data from road traffic sensors. The threshold for detecting congestion or potential accident varies depending on the road. This means a static or simplistic filter that says "no movement" on a street for 5min indicates traffic jam. It could be there's a funeral processing and police officers are directing traffic. To capture this kind of pattern requires support for First Order Logic and temporal logic. When the sensor data is stored, it is temporal in nature. By that I mean the system has to store each data point for the time of the event. This way, the application can run a temporal query for a specific time interval. If the data was stored without the time element, it would be much harder to figure out what is happening.

    Another example is trying to detect trends in real-time. Every piece of data has to have a timestamp, either generated at the source or when it is received by the event processing engine. Related to all of this is issue of totally ordered, partialy ordered and unordered data. How a system reasons over the data differs dramatically depending on the ordering of the data.

  • Re: What most stream processing frameworks ignore

    by Vincent Ye,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Could you give some references tot= those literatures? Thank you!

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p