BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Twitter Storm: Open Source Real-time Hadoop

Twitter Storm: Open Source Real-time Hadoop

This item in japanese

Lire ce contenu en français

Bookmarks

Twitter has open-sourced Storm, its distributed, fault-tolerant, real-time computation system, at GitHub under the Eclipse Public License 1.0. Storm is the real-time processing system developed by BackType, which is now under the Twitter umbrella. The latest package available from GitHub is Storm 0.5.2, and is mostly written in Clojure.

Storm provides a set of general primitives for doing distributed real-time computation. It can be used for "stream processing", processing messages and updating databases in real-time. This is an alternative to managing your own cluster of queues and workers. Storm can be used for "continuous computation", doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for "distributed RPC", running an expensive computation in parallel on the fly. According to Nathan Marz, the lead engineer of Storm:

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.

 

The important properties of Storm are:

  1. Simple programming model. Similar to how MapReduce lowers the complexity of doing parallel batch processing, Storm lowers the complexity for doing real-time processing.
  2. Runs any programming language. You can use any programming language on top of Storm. Clojure, Java, Ruby, Python are supported by default. Support for other languages can be added by implementing a simple Storm communication protocol.
  3. Fault-tolerant. Storm manages worker processes and node failures.
  4. Horizontally scalable. Computations are done in parallel using multiple threads, processes and servers.
  5. Guaranteed message processing. Storm guarantees that each message will be fully processed at least once. It takes care of replaying messages from the source when a task fails.
  6. Fast. The system is designed so that messages are processed quickly and uses ØMQ as the underlying message queue.
  7. Local mode. Storm has a "local mode" where it simulates a Storm cluster completely in-process. This lets you develop and unit test topologies quickly.

 

The Storm cluster is composed of a master node and worker nodes. The master node runs a daemon called "Nimbus" which is responsible for distributing code, assigning tasks, and checking for failures. Each worker node runs a daemon called the "Supervisor" which listens for work and starts and stops worker processes. Nimbus and Supervisor daemons are fail-fast and stateless, which makes them robust, and coordination between them is handled by Apache ZooKeeper.

Storm terminology includes Streams, Spouts, Bolts, Tasks, Workers, Stream Groupings, and Topologies. Streams are the data being processed. Sprouts are the data source. Bolts process the data. Tasks are threads that run within a Spout or Bolt. Workers are the processes that run these threads. Stream Groupings specify what data a Bolt receives as input. Data can be randomly distributed (Shuffle), "sticky" by field value (Fields), broadcasted (All), always goes to a single task (Global), don't care (None), or determined by custom logic (Direct). Topology is the network of Spouts and Bolts nodes connected by Stream Groupings. These terms are described in more detail in the Storm Concepts page.

Systems comparable to Storm are Esper, Streambase, HStreaming and Yahoo S4. Among these, the closest comparable system is S4. The biggest difference between Storm and S4 is that Storm guarantees message processing. Some of these systems have a built-in data storage layer while Storm does not. You will need to use an external database like Cassandra or Riak with your Storm Topologies if you need persistence.

A good way to get started is to read the official Storm Tutorial at GitHub. If talks about the different Storm concepts and abstractions, and shows sample code so you can run a Storm Topology. During development, run Storm on local mode so you can develop and test topologies in process on your local machine. When ready, run Storm in remote mode and submit Topologies for execution on a cluster of machines. Maven users can use the storm dependency from the clojars.org repository at http://clojars.org/repo.

To run a Storm Cluster, you will need Apache Zookeeper, ØMQ, JZMQ, Java 6 and Python 2.6.6. ZooKeeper is used to manage the different components of the cluster, ØMQ is used as the internal messaging system and JZMQ is the Java Binding for ØMQ. There is also the storm-deploy sub project, which allows one click deployments of Storm clusters on AWS. Read Setting up a Storm cluster from the Storm Wiki for detailed instructions.

For more information about Storm, visit the official Storm Wiki. You can also join the Storm mailing list and the Storm IRC (#storm-user) at freenode.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Comparable to actor framework such as Akka?

    by Rainer Guessner,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    While the terminology is somewhat different, I'd say the overlap of features that Akka has compared to S4 and Storm is apparent. Akka is an actor framework for Scala framework, also aiming to event based processing.

  • It's not real-time

    by Tim Fox,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Sounds very nice, but you're losing credibility by calling it real-time when it's not.

    Real-time has a specific, well defined meaning in computer science en.wikipedia.org/wiki/Real-time_computing . Real-time does *not* mean anything that is not batch processing.

  • Is it different from Berkeley's Spark?

    by BM Sundar,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Whether Storm is similar to Spark? (Ref:www.ibm.com/developerworks/library/os-spark/)

  • I don't understand the Guys@GridGain...

    by Danny Trieu,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    3 years ago, at a security software company I developed a POC that pretty much what Storm does. The cool thing is, our realtime MapReduce can expand and contract adapting to the throughput and bandwidth. It has multi-Stream as its out put task execution are dynamic. And I am using nothing more than just GridGain. Since then I kept saying the guys at GridGain must not know what they sale. The tools is so cool and so useful that you can do beyond java framework for MapReduce. I can do dynamic push of catch and/or content to cluster; Control and monitor cluster health; Dynamic task execution and much more..... I used similar approach at an Ads company and the best I can sell is moving log data in realtime(not bad). And when I am at Google's satellite office, all the guys think I am nuts and BUZZ word..... I've been thinking and wanting to do something like this to scale Esper in a large scale and and in or near real-time for years. It is very difficult to do such when added on top message ordering, message correlation and window slicing.

  • Event Processors & High-Velocity Data

    by Tim Thumb,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Storm and Yahoo S4 now have a competitor in Linkedin's Samza based on Kafka (CEP). Databases are also getting into the game with streaming persistence to handle high-velocity data more here scaledb.com/high-velocity-data.php

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT