BT

Twitter Storm: Open Source Real-time Hadoop

by Bienvenido David on Sep 26, 2011 |

Twitter has open-sourced Storm, its distributed, fault-tolerant, real-time computation system, at GitHub under the Eclipse Public License 1.0. Storm is the real-time processing system developed by BackType, which is now under the Twitter umbrella. The latest package available from GitHub is Storm 0.5.2, and is mostly written in Clojure.

Storm provides a set of general primitives for doing distributed real-time computation. It can be used for "stream processing", processing messages and updating databases in real-time. This is an alternative to managing your own cluster of queues and workers. Storm can be used for "continuous computation", doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for "distributed RPC", running an expensive computation in parallel on the fly. According to Nathan Marz, the lead engineer of Storm:

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.

 

The important properties of Storm are:

  1. Simple programming model. Similar to how MapReduce lowers the complexity of doing parallel batch processing, Storm lowers the complexity for doing real-time processing.
  2. Runs any programming language. You can use any programming language on top of Storm. Clojure, Java, Ruby, Python are supported by default. Support for other languages can be added by implementing a simple Storm communication protocol.
  3. Fault-tolerant. Storm manages worker processes and node failures.
  4. Horizontally scalable. Computations are done in parallel using multiple threads, processes and servers.
  5. Guaranteed message processing. Storm guarantees that each message will be fully processed at least once. It takes care of replaying messages from the source when a task fails.
  6. Fast. The system is designed so that messages are processed quickly and uses ØMQ as the underlying message queue.
  7. Local mode. Storm has a "local mode" where it simulates a Storm cluster completely in-process. This lets you develop and unit test topologies quickly.

 

The Storm cluster is composed of a master node and worker nodes. The master node runs a daemon called "Nimbus" which is responsible for distributing code, assigning tasks, and checking for failures. Each worker node runs a daemon called the "Supervisor" which listens for work and starts and stops worker processes. Nimbus and Supervisor daemons are fail-fast and stateless, which makes them robust, and coordination between them is handled by Apache ZooKeeper.

Storm terminology includes Streams, Spouts, Bolts, Tasks, Workers, Stream Groupings, and Topologies. Streams are the data being processed. Sprouts are the data source. Bolts process the data. Tasks are threads that run within a Spout or Bolt. Workers are the processes that run these threads. Stream Groupings specify what data a Bolt receives as input. Data can be randomly distributed (Shuffle), "sticky" by field value (Fields), broadcasted (All), always goes to a single task (Global), don't care (None), or determined by custom logic (Direct). Topology is the network of Spouts and Bolts nodes connected by Stream Groupings. These terms are described in more detail in the Storm Concepts page.

Systems comparable to Storm are Esper, Streambase, HStreaming and Yahoo S4. Among these, the closest comparable system is S4. The biggest difference between Storm and S4 is that Storm guarantees message processing. Some of these systems have a built-in data storage layer while Storm does not. You will need to use an external database like Cassandra or Riak with your Storm Topologies if you need persistence.

A good way to get started is to read the official Storm Tutorial at GitHub. If talks about the different Storm concepts and abstractions, and shows sample code so you can run a Storm Topology. During development, run Storm on local mode so you can develop and test topologies in process on your local machine. When ready, run Storm in remote mode and submit Topologies for execution on a cluster of machines. Maven users can use the storm dependency from the clojars.org repository at http://clojars.org/repo.

To run a Storm Cluster, you will need Apache Zookeeper, ØMQ, JZMQ, Java 6 and Python 2.6.6. ZooKeeper is used to manage the different components of the cluster, ØMQ is used as the internal messaging system and JZMQ is the Java Binding for ØMQ. There is also the storm-deploy sub project, which allows one click deployments of Storm clusters on AWS. Read Setting up a Storm cluster from the Storm Wiki for detailed instructions.

For more information about Storm, visit the official Storm Wiki. You can also join the Storm mailing list and the Storm IRC (#storm-user) at freenode.

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Comparable to actor framework such as Akka? by Rainer Guessner

While the terminology is somewhat different, I'd say the overlap of features that Akka has compared to S4 and Storm is apparent. Akka is an actor framework for Scala framework, also aiming to event based processing.

It's not real-time by Tim Fox

Sounds very nice, but you're losing credibility by calling it real-time when it's not.

Real-time has a specific, well defined meaning in computer science en.wikipedia.org/wiki/Real-time_computing . Real-time does *not* mean anything that is not batch processing.

Is it different from Berkeley's Spark? by BM Sundar

Whether Storm is similar to Spark? (Ref:www.ibm.com/developerworks/library/os-spark/)

I don't understand the Guys@GridGain... by Danny Trieu

3 years ago, at a security software company I developed a POC that pretty much what Storm does. The cool thing is, our realtime MapReduce can expand and contract adapting to the throughput and bandwidth. It has multi-Stream as its out put task execution are dynamic. And I am using nothing more than just GridGain. Since then I kept saying the guys at GridGain must not know what they sale. The tools is so cool and so useful that you can do beyond java framework for MapReduce. I can do dynamic push of catch and/or content to cluster; Control and monitor cluster health; Dynamic task execution and much more..... I used similar approach at an Ads company and the best I can sell is moving log data in realtime(not bad). And when I am at Google's satellite office, all the guys think I am nuts and BUZZ word..... I've been thinking and wanting to do something like this to scale Esper in a large scale and and in or near real-time for years. It is very difficult to do such when added on top message ordering, message correlation and window slicing.

Event Processors & High-Velocity Data by Tim Thumb

Storm and Yahoo S4 now have a competitor in Linkedin's Samza based on Kafka (CEP). Databases are also getting into the game with streaming persistence to handle high-velocity data more here scaledb.com/high-velocity-data.php

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

5 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT