BT

Your opinion matters! Please fill in the InfoQ Survey!

Stream Processing and Lambda Architecture Challenges

| by Alexandre Rodrigues  Followers on Oct 19, 2016. Estimated reading time: 2 minutes |

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

Kartik Paramasivam, who works in streams infrastructure at LinkedIn, wrote two articles over the summer on why and how his company attempts to avoid Lambda architecture using Apache Samza for data processing.

Stream processing technologies are useful when it comes to getting quick results out of streams of data, but they may fall short for demanding use cases with high consistency and robustness requirements.

Lambda architecture has been a popular solution that combines batch and stream processing. The fundamental idea is to have two separate data lanes: a speed layer to provide low-latency results using stream processing technologies and a batch layer to run jobs to provide accurate results on bulk data. Lambda architecture is complex to implement as it relies on multiple technologies and implies the merge of results from both data lanes.

At LinkedIn, Samza is used to perform the processing on data streamed from Apache Kafka. One of the problems described in the article is the late arrival of events. A RocksDB-based key value store was added to Samza to hold the input events for a longer time. On a late arrival, the framework will re-emit any result for the affected window-based computation, given there’s enough information locally to recompute it. A RocksDB based solution was found preferable since relying on an external store (e.g. NoSQL) would imply network and CPU overhead for communication and serialization.

Apache Flink, another stream processing framework, is capable of computing over time windows based on timestamps, that can be event specific or the ingestion timestamp, and provide consistent results for out-of-order streams. It holds the data in memory and triggers window computation on the receipt of a watermark event. Watermarks represent a clock tick and provide Flink a notion of time (that can be event specific).

Other problems such as the processing of duplicated messages due to at-least-once delivery guarantees have been addressed by most frameworks with internal checkpoint mechanisms.

The last set of stream processing problems is the ability to experiment with data as it flows through the system, in an interactive fashion. Agile experimentation is usually performed on batch systems using high-level languages such as SQL, available on commercial products such as Azure Stream Analytics.

Samza SQL is described by Julian Hyde as an effort that applies Apache Calcite to Samza stream processor. Samza SQL  is not production-ready yet. Instead, LinkedIn uses Hadoop’s batch nature to perform iterative experimentations with offline datasets, some copied from the same live services databases that Samza-based stream processors query when handling a stream.

Flink is also working towards a robust stream SQL support. Flink’s 1.1, released in August 2016, supports filters and unions on streaming data. Flink also supports Complex Event Processing as a high-level description on how to react to sequences of events.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

What most stream processing frameworks ignore by peter lin

Stream processing is fundamentally tied to temporal logic, yet most frameworks ignore it. Many stream processing frameworks just provide sliding window or sliding count, but that tends to result in garbage for a couple of reasons. When new developers try to aggregate data and compute some value, they aren't taught how to model the problem. They just try something simple like calculate the average or a rolling sum. It's when you start trying to answer questions like "for what time period was this trend valid?" that the problems appear and developer end up having to rewrite the application.

The first step to building a good stream processing framework is understanding concepts like temporal data, temporal distance, temporal patterns, temporal entities and temporal database. Without a solid understanding of these concepts and how it affects the design and implementation, you end up building a tool that is only good for a few use cases. There's literally 30 years of research in this field, but people aren't reading the literature.

Re: What most stream processing frameworks ignore by Ant hony

What literature would you recommend for developers that want to get into big data and stream processing, which also explains all the concepts you mention & provides a good foundation?

Re: What most stream processing frameworks ignore by peter lin

First place to start is understand what the terms means on wikipedia.

en.wikipedia.org/wiki/Temporal_logic

Next is to search acmqueue for papers on these topics. There's literally too many to list in a response. For temporal databases, timedb provides a decent introduction www.timeconsult.com/TemporalData/TemporalData.html

Unlike data "at rest" in a traditional database, streaming data can be temporal while "in-flight" but non-temporal when persisted. Take for example sensor data from road traffic sensors. The threshold for detecting congestion or potential accident varies depending on the road. This means a static or simplistic filter that says "no movement" on a street for 5min indicates traffic jam. It could be there's a funeral processing and police officers are directing traffic. To capture this kind of pattern requires support for First Order Logic and temporal logic. When the sensor data is stored, it is temporal in nature. By that I mean the system has to store each data point for the time of the event. This way, the application can run a temporal query for a specific time interval. If the data was stored without the time element, it would be much harder to figure out what is happening.

Another example is trying to detect trends in real-time. Every piece of data has to have a timestamp, either generated at the source or when it is received by the event processing engine. Related to all of this is issue of totally ordered, partialy ordered and unordered data. How a system reasons over the data differs dramatically depending on the ordering of the data.

Re: What most stream processing frameworks ignore by Vincent Ye

Could you give some references tot= those literatures? Thank you!

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

4 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT