Spark, Storm and Real Time Analytics
Big Data Analytics have been advancing in the past years as the amount of information has exploded. Hadoop is definitely the platform of choice for Big Data analysis and computation. While data Volume, Variety and Velocity increases, Hadoop as a batch processing framework cannot cope with the requirement for real time analytics.
Databricks, the company behind Apache Spark recently raised $14 million to accelerate development of Spark and Shark. Spark is an engine for large-scale data processing written in Scala, while Shark is a Hive compatible variation of Spark.
Like Spark, Storm also aims to come around Hadoop’s batch nature by providing event processing and distributed computation capabilities. By designing a topology of transformations in a Directed Acyclic Graph, the architect can perform arbitrarily complex computations, one transformation at a time.
Nathan Marz experienced it first hand and came up with the lambda architecture paradigm to solve this fundamental architectural problem. Lambda architecture consists of a serving layer that gets updated infrequently from the batch layer and a speed layer that computes real time analytics to compensate for the slow batch layer. Essentially, Hadoop is computing analytics in batches and in between batch runs, the speed layer is incrementally updating metrics by examining events in a streaming fashion.
Both Spark and Storm can operate in a Hadoop cluster and access Hadoop storage. Storm-YARN is Yahoo’s open source implementation of Storm and Hadoop convergence. Spark is providing native integration for Hadoop. Integration with Hadoop is achieved through YARN (NextGen MapReduce). Integrating real time analytics with Hadoop based systems allows for better utilization of cluster resources through computational elasticity and being in the same cluster means that network transfers can be minimal.
In terms of commercial support, Cloudera has already announced support for Spark and included it in CDH (Cloudera’s Distribution Including Apache Hadoop). Hortonworks is planning to include Apache Storm in HDP (Hortonworks Data Platform) in the first half of 2014.
Juergen Hoeller Jul 22, 2014