BT

Spark, Storm and Real Time Analytics

by Alex Giamas on Jan 31, 2014 |

Big Data Analytics have been advancing in the past years as the amount of information has exploded. Hadoop is definitely the platform of choice for Big Data analysis and computation. While data Volume, Variety and Velocity increases, Hadoop as a batch processing framework cannot cope with the requirement for real time analytics.

Databricks, the company behind Apache Spark recently raised $14 million to accelerate development of Spark and Shark. Spark is an engine for large-scale data processing written in Scala, while Shark is a Hive compatible variation of Spark.

Like Spark, Storm also aims to come around Hadoop’s batch nature by providing event processing and distributed computation capabilities. By designing a topology of transformations in a Directed Acyclic Graph, the architect can perform arbitrarily complex computations, one transformation at a time.

Nathan Marz experienced it first hand and came up with the lambda architecture paradigm to solve this fundamental architectural problem. Lambda architecture consists of a serving layer that gets updated infrequently from the batch layer and a speed layer that computes real time analytics to compensate for the slow batch layer. Essentially, Hadoop is computing analytics in batches and in between batch runs, the speed layer is incrementally updating metrics by examining events in a streaming fashion.

Both Spark and Storm can operate in a Hadoop cluster and access Hadoop storage. Storm-YARN is Yahoo’s open source implementation of Storm and Hadoop convergence. Spark is providing native integration for Hadoop. Integration with Hadoop is achieved through YARN (NextGen MapReduce). Integrating real time analytics with Hadoop based systems allows for better utilization of cluster resources through computational elasticity and being in the same cluster means that network transfers can be minimal.

In terms of commercial support, Cloudera has already announced support for Spark and included it in CDH (Cloudera’s Distribution Including Apache Hadoop). Hortonworks is planning to include Apache Storm in HDP (Hortonworks Data Platform) in the first half of 2014.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT