BT

Questions About the Lambda Architecture

by Seth Cousins on Sep 03, 2014 |

In a blog post suggesting limits to the usefulness and applicability of the Lambda Architecture, Jay Kreps argues that Lambda contains valuable ideas but that ultimately it is a temporary solution due to immature tools rather than the future of big data. He offers an alternative architecture based on his experience building Kafka and Samza at Linkedin which he claims offers similar performance characteristics with better development and operational characteristics.

The Lambda Architecture is an approach to large scale stream processing which mixes batch processing with realtime processing to achieve both scalability and fault tolerance. The pattern is typically implemented via MapReduce on Apache Hadoop and Apache Storm respectively. Nathan Marz designed it based on his experiences at Twitter and announced it in a blog post titled How To Beat the CAP Theorem. He has a web site and an upcoming book on the architecture.

Kreps acknowledges two important tenets of Lambda: that data inputs are immutable and that raw data inputs can be reprocessed to recompute outputs. Retaining raw inputs allows processing data even in ways that had not been considered originally as well as providing a recovery mechanism in the case that bad data is stored for any reason. Reprocessing is required when new output fields are required or when a bug in a previous version of the processing code has rendered the output incorrect.

At the same time Kreps believes that Lambda includes inherent development and operational complexity. Lambda requires that all algorithms are implemented twice, once for the batch system and once for the realtime system, and that queries stitch together the results from each system. Given the challenges with implementing complex algorithms correctly even once, the task of doing so twice and debugging the inevitable issues is clearly larger. On top of that, operating two distributed multi-node services will certainly be more complex than operating one.

Kreps summarizes his high level guidance:

These days, my advice is to use a batch processing framework like MapReduce if you aren't latency sensitive, and use a stream processing framework if you are, but not to try to do both at the same time unless you absolutely must.

So, why the excitement about the Lambda Architecture? I think the reason is because people increasingly need to build complex, low-latency processing systems. What they have at their disposal are two things that don't quite solve their problem: a scalable high-latency batch system that can process historical data and a low-latency stream processing system that can't reprocess results. By duct taping these two things together, they can actually build a working solution.

For use cases requiring both low latency and large historical data sets, Kreps suggests that a realtime steam processing system is sufficient. When reprocessing is required, increasing parallelism and replaying history very quickly is the solution. Linkedin currently uses this solution, implemented with Kafka and Samza. Kreps sees no reason why a similar approach would not work with Storm or other stream processing systems. He believes that the runtime efficiencies of these two architectures are roughly equal while the single system is significantly easier to develop against and to debug.

Community reaction to Kreps' post was almost universally positive with several people agreeing that it matched their experience. @jcsalterego wrote "very well written, thank you. much of it resonates with what i've seen, though on a much smaller scale than linkedin" and @jijoejv wrote "Nice write up. We've been using the "Kappa" arch at @Nextag for 1.5 years now, with Storm+Kafka (90d hist). Very easy to fix bugs." Nathan Marz does not appear to have responded yet.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

The Past, Present and Future will be Simulated by William Louth

This is much ado about nothing really at least in terms of the evolution of system architectures. We need something that offers much more bang for the buck considering the investment any change incurs and that would be mirrored simulation.

I wrote some articles about this:

The Past, Present and Future will be Simulated
www.autoletics.com/posts/the-past-present-and-f...

Transcending Code, Containers and The Cloud
www.autoletics.com/posts/transcending-code-cont...

A Time Lord for the Java and JVM Universe
www.autoletics.com/posts/a-time-lord-for-the-ja...

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT