InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Twitter Storm: Open Source Real-time Hadoop

Posted by Bienvenido David III on Sep 26, 2011

Sections
Architecture & Design,
Development
Topics
Java ,
Languages ,
Websphere ,
IBM ,
Application Servers ,
Parallel Programming ,
Agile in the Enterprise ,
Companies ,
Programming ,
Agile ,
Twitter ,
Storm ,
Hadoop

Twitter has open-sourced Storm, its distributed, fault-tolerant, real-time computation system, at GitHub under the Eclipse Public License 1.0. Storm is the real-time processing system developed by BackType, which is now under the Twitter umbrella. The latest package available from GitHub is Storm 0.5.2, and is mostly written in Clojure.

Storm provides a set of general primitives for doing distributed real-time computation. It can be used for "stream processing", processing messages and updating databases in real-time. This is an alternative to managing your own cluster of queues and workers. Storm can be used for "continuous computation", doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for "distributed RPC", running an expensive computation in parallel on the fly. According to Nathan Marz, the lead engineer of Storm:

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.

The important properties of Storm are:

  1. Simple programming model. Similar to how MapReduce lowers the complexity of doing parallel batch processing, Storm lowers the complexity for doing real-time processing.
  2. Runs any programming language. You can use any programming language on top of Storm. Clojure, Java, Ruby, Python are supported by default. Support for other languages can be added by implementing a simple Storm communication protocol.
  3. Fault-tolerant. Storm manages worker processes and node failures.
  4. Horizontally scalable. Computations are done in parallel using multiple threads, processes and servers.
  5. Guaranteed message processing. Storm guarantees that each message will be fully processed at least once. It takes care of replaying messages from the source when a task fails.
  6. Fast. The system is designed so that messages are processed quickly and uses ØMQ as the underlying message queue.
  7. Local mode. Storm has a "local mode" where it simulates a Storm cluster completely in-process. This lets you develop and unit test topologies quickly.

The Storm cluster is composed of a master node and worker nodes. The master node runs a daemon called "Nimbus" which is responsible for distributing code, assigning tasks, and checking for failures. Each worker node runs a daemon called the "Supervisor" which listens for work and starts and stops worker processes. Nimbus and Supervisor daemons are fail-fast and stateless, which makes them robust, and coordination between them is handled by Apache ZooKeeper.

Storm terminology includes Streams, Spouts, Bolts, Tasks, Workers, Stream Groupings, and Topologies. Streams are the data being processed. Sprouts are the data source. Bolts process the data. Tasks are threads that run within a Spout or Bolt. Workers are the processes that run these threads. Stream Groupings specify what data a Bolt receives as input. Data can be randomly distributed (Shuffle), "sticky" by field value (Fields), broadcasted (All), always goes to a single task (Global), don't care (None), or determined by custom logic (Direct). Topology is the network of Spouts and Bolts nodes connected by Stream Groupings. These terms are described in more detail in the Storm Concepts page.

Systems comparable to Storm are Esper, Streambase, HStreaming and Yahoo S4. Among these, the closest comparable system is S4. The biggest difference between Storm and S4 is that Storm guarantees message processing. Some of these systems have a built-in data storage layer while Storm does not. You will need to use an external database like Cassandra or Riak with your Storm Topologies if you need persistence.

A good way to get started is to read the official Storm Tutorial at GitHub. If talks about the different Storm concepts and abstractions, and shows sample code so you can run a Storm Topology. During development, run Storm on local mode so you can develop and test topologies in process on your local machine. When ready, run Storm in remote mode and submit Topologies for execution on a cluster of machines. Maven users can use the storm dependency from the clojars.org repository at http://clojars.org/repo.

To run a Storm Cluster, you will need Apache Zookeeper, ØMQ, JZMQ, Java 6 and Python 2.6.6. ZooKeeper is used to manage the different components of the cluster, ØMQ is used as the internal messaging system and JZMQ is the Java Binding for ØMQ. There is also the storm-deploy sub project, which allows one click deployments of Storm clusters on AWS. Read Setting up a Storm cluster from the Storm Wiki for detailed instructions.

For more information about Storm, visit the official Storm Wiki. You can also join the Storm mailing list and the Storm IRC (#storm-user) at freenode.

  • This article is part of a featured topic series on Agile
Comparable to actor framework such as Akka? by Rainer Guessner Posted
It's not real-time by Tim Fox Posted
Is it different from Berkeley's Spark? by BM Sundar Posted
I don't understand the Guys@GridGain... by Danny Trieu Posted
  1. Back to top

    Comparable to actor framework such as Akka?

    by Rainer Guessner

    While the terminology is somewhat different, I'd say the overlap of features that Akka has compared to S4 and Storm is apparent. Akka is an actor framework for Scala framework, also aiming to event based processing.

  2. Back to top

    It's not real-time

    by Tim Fox

    Sounds very nice, but you're losing credibility by calling it real-time when it's not.

    Real-time has a specific, well defined meaning in computer science en.wikipedia.org/wiki/Real-time_computing . Real-time does *not* mean anything that is not batch processing.

  3. Back to top

    Is it different from Berkeley's Spark?

    by BM Sundar

    Whether Storm is similar to Spark? (Ref:www.ibm.com/developerworks/library/os-spark/)

  4. Back to top

    I don't understand the Guys@GridGain...

    by Danny Trieu

    3 years ago, at a security software company I developed a POC that pretty much what Storm does. The cool thing is, our realtime MapReduce can expand and contract adapting to the throughput and bandwidth. It has multi-Stream as its out put task execution are dynamic. And I am using nothing more than just GridGain. Since then I kept saying the guys at GridGain must not know what they sale. The tools is so cool and so useful that you can do beyond java framework for MapReduce. I can do dynamic push of catch and/or content to cluster; Control and monitor cluster health; Dynamic task execution and much more..... I used similar approach at an Ads company and the best I can sell is moving log data in realtime(not bad). And when I am at Google's satellite office, all the guys think I am nuts and BUZZ word..... I've been thinking and wanting to do something like this to scale Esper in a large scale and and in or near real-time for years. It is very difficult to do such when added on top message ordering, message correlation and window slicing.

Educational Content

Eventually Consistent HTTP with Statebox and Riak

Bob Ippolito explains how to solve concurrent update conflicts with Statebox, an open source library for automatic conflict resolution, running on top of Riak.

Java.next

Erik Onnen attempts to demonstrate that Java is still the best programming language for the JVM if simplified idioms are used along with proper tooling.

Evolution in Data Integration From EII to Big Data

Approaches to integrating data are changing with emergence of cloud computing.

Winning Hearts and Minds: How to Embed UX from Scratch in a Large Organization

Michele Ide-Smith presents the lessons learned in the process of introducing UX principles and techniques into a large organization through a series of small steps.

LMAX Disruptor: 100K TPS at Less than 1ms Latency

Dave Farley and Martin Thompson discuss solutions for doing low-latency high throughput transactions based on the Disruptor concurrency pattern.

Thoughts on Test Automation in Agile

Rajneesh Namta shares his thoughts, experiences, and some of the critical lessons learned while implementing software test automation on a recent Agile project.

Actor Interaction Patterns

Dale Schumacher presents several patterns of actor interaction that can be used in collaborative programs written in any language.

Scalaz: Functional Programming in Scala

Rúnar Bjarnason discusses Scalaz, a Scala library of pure data structures, type classes, highly generalized functions, and concurrency abstractions to perform functional programming in Scala.