Eventually Consistent HTTP with Statebox and Riak
Bob Ippolito explains how to solve concurrent update conflicts with Statebox, an open source library for automatic conflict resolution, running on top of Riak.
The content has been bookmarked!
There was an error bookmarking this content! Please retry.
Posted by Bienvenido David III on Sep 26, 2011
Twitter has open-sourced Storm, its distributed, fault-tolerant, real-time computation system, at GitHub under the Eclipse Public License 1.0. Storm is the real-time processing system developed by BackType, which is now under the Twitter umbrella. The latest package available from GitHub is Storm 0.5.2, and is mostly written in Clojure.
Storm provides a set of general primitives for doing distributed real-time computation. It can be used for "stream processing", processing messages and updating databases in real-time. This is an alternative to managing your own cluster of queues and workers. Storm can be used for "continuous computation", doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for "distributed RPC", running an expensive computation in parallel on the fly. According to Nathan Marz, the lead engineer of Storm:
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
The important properties of Storm are:
The Storm cluster is composed of a master node and worker nodes. The master node runs a daemon called "Nimbus" which is responsible for distributing code, assigning tasks, and checking for failures. Each worker node runs a daemon called the "Supervisor" which listens for work and starts and stops worker processes. Nimbus and Supervisor daemons are fail-fast and stateless, which makes them robust, and coordination between them is handled by Apache ZooKeeper.
Storm terminology includes Streams, Spouts, Bolts, Tasks, Workers, Stream Groupings, and Topologies. Streams are the data being processed. Sprouts are the data source. Bolts process the data. Tasks are threads that run within a Spout or Bolt. Workers are the processes that run these threads. Stream Groupings specify what data a Bolt receives as input. Data can be randomly distributed (Shuffle), "sticky" by field value (Fields), broadcasted (All), always goes to a single task (Global), don't care (None), or determined by custom logic (Direct). Topology is the network of Spouts and Bolts nodes connected by Stream Groupings. These terms are described in more detail in the Storm Concepts page.
Systems comparable to Storm are Esper, Streambase, HStreaming and Yahoo S4. Among these, the closest comparable system is S4. The biggest difference between Storm and S4 is that Storm guarantees message processing. Some of these systems have a built-in data storage layer while Storm does not. You will need to use an external database like Cassandra or Riak with your Storm Topologies if you need persistence.
A good way to get started is to read the official Storm Tutorial at GitHub. If talks about the different Storm concepts and abstractions, and shows sample code so you can run a Storm Topology. During development, run Storm on local mode so you can develop and test topologies in process on your local machine. When ready, run Storm in remote mode and submit Topologies for execution on a cluster of machines. Maven users can use the storm dependency from the clojars.org repository at http://clojars.org/repo.
To run a Storm Cluster, you will need Apache Zookeeper, ØMQ, JZMQ, Java 6 and Python 2.6.6. ZooKeeper is used to manage the different components of the cluster, ØMQ is used as the internal messaging system and JZMQ is the Java Binding for ØMQ. There is also the storm-deploy sub project, which allows one click deployments of Storm clusters on AWS. Read Setting up a Storm cluster from the Storm Wiki for detailed instructions.
For more information about Storm, visit the official Storm Wiki. You can also join the Storm mailing list and the Storm IRC (#storm-user) at freenode.
Monitor your Production Java App - includes JMX! Low Overhead - Free download
Using Drools? See what you're missing! Get the Power of Drools with the Assurance of Red Hat
In today’s hyper-competitive world, later may be too late to adopt Agile development and this Roadmap for Success will help you get started. Download "Agile Development: A Manager's Roadmap for Success" now!
While the terminology is somewhat different, I'd say the overlap of features that Akka has compared to S4 and Storm is apparent. Akka is an actor framework for Scala framework, also aiming to event based processing.
Sounds very nice, but you're losing credibility by calling it real-time when it's not.
Real-time has a specific, well defined meaning in computer science en.wikipedia.org/wiki/Real-time_computing . Real-time does *not* mean anything that is not batch processing.
Whether Storm is similar to Spark? (Ref:www.ibm.com/developerworks/library/os-spark/)
3 years ago, at a security software company I developed a POC that pretty much what Storm does. The cool thing is, our realtime MapReduce can expand and contract adapting to the throughput and bandwidth. It has multi-Stream as its out put task execution are dynamic. And I am using nothing more than just GridGain. Since then I kept saying the guys at GridGain must not know what they sale. The tools is so cool and so useful that you can do beyond java framework for MapReduce. I can do dynamic push of catch and/or content to cluster; Control and monitor cluster health; Dynamic task execution and much more..... I used similar approach at an Ads company and the best I can sell is moving log data in realtime(not bad). And when I am at Google's satellite office, all the guys think I am nuts and BUZZ word..... I've been thinking and wanting to do something like this to scale Esper in a large scale and and in or near real-time for years. It is very difficult to do such when added on top message ordering, message correlation and window slicing.
Bob Ippolito explains how to solve concurrent update conflicts with Statebox, an open source library for automatic conflict resolution, running on top of Riak.
Erik Onnen attempts to demonstrate that Java is still the best programming language for the JVM if simplified idioms are used along with proper tooling.
Approaches to integrating data are changing with emergence of cloud computing.
Michele Ide-Smith presents the lessons learned in the process of introducing UX principles and techniques into a large organization through a series of small steps.
Dave Farley and Martin Thompson discuss solutions for doing low-latency high throughput transactions based on the Disruptor concurrency pattern.
Rajneesh Namta shares his thoughts, experiences, and some of the critical lessons learned while implementing software test automation on a recent Agile project.
Dale Schumacher presents several patterns of actor interaction that can be used in collaborative programs written in any language.
Rúnar Bjarnason discusses Scalaz, a Scala library of pure data structures, type classes, highly generalized functions, and concurrency abstractions to perform functional programming in Scala.
4 comments
Watch Thread Reply