Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Spark, Storm and Real Time Analytics

Spark, Storm and Real Time Analytics

Lire ce contenu en français


Big Data Analytics have been advancing in the past years as the amount of information has exploded. Hadoop is definitely the platform of choice for Big Data analysis and computation. While data Volume, Variety and Velocity increases, Hadoop as a batch processing framework cannot cope with the requirement for real time analytics.

Databricks, the company behind Apache Spark recently raised $14 million to accelerate development of Spark and Shark. Spark is an engine for large-scale data processing written in Scala, while Shark is a Hive compatible variation of Spark.

Like Spark, Storm also aims to come around Hadoop’s batch nature by providing event processing and distributed computation capabilities. By designing a topology of transformations in a Directed Acyclic Graph, the architect can perform arbitrarily complex computations, one transformation at a time.

Nathan Marz experienced it first hand and came up with the lambda architecture paradigm to solve this fundamental architectural problem. Lambda architecture consists of a serving layer that gets updated infrequently from the batch layer and a speed layer that computes real time analytics to compensate for the slow batch layer. Essentially, Hadoop is computing analytics in batches and in between batch runs, the speed layer is incrementally updating metrics by examining events in a streaming fashion.

Both Spark and Storm can operate in a Hadoop cluster and access Hadoop storage. Storm-YARN is Yahoo’s open source implementation of Storm and Hadoop convergence. Spark is providing native integration for Hadoop. Integration with Hadoop is achieved through YARN (NextGen MapReduce). Integrating real time analytics with Hadoop based systems allows for better utilization of cluster resources through computational elasticity and being in the same cluster means that network transfers can be minimal.

In terms of commercial support, Cloudera has already announced support for Spark and included it in CDH (Cloudera’s Distribution Including Apache Hadoop). Hortonworks is planning to include Apache Storm in HDP (Hortonworks Data Platform) in the first half of 2014.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • exciting prospect

    by Nitin Singh,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    The adoption of Spark by commercial big data platform providers is a really good news for those already on a Scala/Spark learning journey.

  • Spark, Storm and Real Time Analytics

    by Sonam Gupta,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    ike Storm, Spark is designed for massive scalability, and the Spark team has documented users of the system running production clusters with thousands of nodes. In addition, Spark won the recent 2014 Daytona GraySort contest, turning in the best time for a shouldering workload consisting of sorting 100TB of data. The Spark team also documents Spark ETL operations with production workloads in the multiple Petabyte range. More at

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p