Twitter Storm: Open Source Real-time Hadoop
Twitter has open-sourced Storm, its distributed, fault-tolerant, real-time computation system, at GitHub under the Eclipse Public License 1.0. Storm is the real-time processing system developed by BackType, which is now under the Twitter umbrella. The latest package available from GitHub is Storm 0.5.2, and is mostly written in Clojure.
Storm provides a set of general primitives for doing distributed real-time computation. It can be used for "stream processing", processing messages and updating databases in real-time. This is an alternative to managing your own cluster of queues and workers. Storm can be used for "continuous computation", doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for "distributed RPC", running an expensive computation in parallel on the fly. According to Nathan Marz, the lead engineer of Storm:
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
The important properties of Storm are:
- Simple programming model. Similar to how MapReduce lowers the complexity of doing parallel batch processing, Storm lowers the complexity for doing real-time processing.
- Runs any programming language. You can use any programming language on top of Storm. Clojure, Java, Ruby, Python are supported by default. Support for other languages can be added by implementing a simple Storm communication protocol.
- Fault-tolerant. Storm manages worker processes and node failures.
- Horizontally scalable. Computations are done in parallel using multiple threads, processes and servers.
- Guaranteed message processing. Storm guarantees that each message will be fully processed at least once. It takes care of replaying messages from the source when a task fails.
- Fast. The system is designed so that messages are processed quickly and uses ØMQ as the underlying message queue.
- Local mode. Storm has a "local mode" where it simulates a Storm cluster completely in-process. This lets you develop and unit test topologies quickly.
The Storm cluster is composed of a master node and worker nodes. The master node runs a daemon called "Nimbus" which is responsible for distributing code, assigning tasks, and checking for failures. Each worker node runs a daemon called the "Supervisor" which listens for work and starts and stops worker processes. Nimbus and Supervisor daemons are fail-fast and stateless, which makes them robust, and coordination between them is handled by Apache ZooKeeper.
Storm terminology includes Streams, Spouts, Bolts, Tasks, Workers, Stream Groupings, and Topologies. Streams are the data being processed. Sprouts are the data source. Bolts process the data. Tasks are threads that run within a Spout or Bolt. Workers are the processes that run these threads. Stream Groupings specify what data a Bolt receives as input. Data can be randomly distributed (Shuffle), "sticky" by field value (Fields), broadcasted (All), always goes to a single task (Global), don't care (None), or determined by custom logic (Direct). Topology is the network of Spouts and Bolts nodes connected by Stream Groupings. These terms are described in more detail in the Storm Concepts page.
Systems comparable to Storm are Esper, Streambase, HStreaming and Yahoo S4. Among these, the closest comparable system is S4. The biggest difference between Storm and S4 is that Storm guarantees message processing. Some of these systems have a built-in data storage layer while Storm does not. You will need to use an external database like Cassandra or Riak with your Storm Topologies if you need persistence.
A good way to get started is to read the official Storm Tutorial at GitHub. If talks about the different Storm concepts and abstractions, and shows sample code so you can run a Storm Topology. During development, run Storm on local mode so you can develop and test topologies in process on your local machine. When ready, run Storm in remote mode and submit Topologies for execution on a cluster of machines. Maven users can use the storm dependency from the clojars.org repository at http://clojars.org/repo.
To run a Storm Cluster, you will need Apache Zookeeper, ØMQ, JZMQ, Java 6 and Python 2.6.6. ZooKeeper is used to manage the different components of the cluster, ØMQ is used as the internal messaging system and JZMQ is the Java Binding for ØMQ. There is also the storm-deploy sub project, which allows one click deployments of Storm clusters on AWS. Read Setting up a Storm cluster from the Storm Wiki for detailed instructions.
Comparable to actor framework such as Akka?
It's not real-time
Real-time has a specific, well defined meaning in computer science en.wikipedia.org/wiki/Real-time_computing . Real-time does *not* mean anything that is not batch processing.
Is it different from Berkeley's Spark?
I don't understand the Guys@GridGain...