Storm is a distributed, fault-tolerant, real-time computation system that was originally developed at BackType and later open sourced by Twitter. Storm Applied is a new book from Manning that aims to provide a practical guide on using Storm, both in a development and in a production setting. InfoQ has spoken with two of the book’s authors, Sean T. Allen and Matthew Jankowski.
Storm Applied illustrates how Storm can be used for several different use cases. The first four chapters introduce Storm’s core concepts of spouts, bolts, topology, processing guarantees, etc. The book then moves on to describe what a cluster is in detail, including its components, its role for fault-tolerance, and how to deploy a given topology on a cluster. Chapters 6 to 8 are especially valuable in a production setting and deal with tuning a Storm topology, metrics collection, resource contention, and debugging techniques. Finally, chapter 9 covers Trident, a higher-level abstraction built on top of Storm and aimed at making it easier to build and work with Storm tolopologies.
Most chapters in the book present a concrete use cases that provides the necessary context to make discussing Storm’s features more concrete and easy to understand. Furthermore, a reference implementation in Java is also provided. The use cases allow the authors to gradually introduce new concepts and typical concerns in a stream processing system, such as how to handle unreliable sources, identify and solve bottlenecks by means of parallelization, reliability, message replaying, etc.
Overall, the book is very easy to read thanks to the gradual approach taken by its authors, who also provide many diagrams to help illustrate Storm fundamental concepts. No prior knowledge about Storm or stream processing is required, still the book succeeds in dealing with complex topics such as scale, tuning, and debugging. Discussion is always strictly on topic, so the book remains concise.
InfoQ has spoken with two of the book’s authors, Sean T. Allen and Matthew Jankowski.
InfoQ: What was the main motivation that drove you to write this book?
We were pretty early adopters of Storm. We went from an organization that used Storm for a few non-essential processes to one where Storm was a large part of our infrastructure, driving several essential business processes. Throughout this transformation, we felt like we gained a lot of hard-earned knowledge in terms of running Storm in a “production” environment.
While there was a decent amount of online documentation for Storm, we didn’t feel like there was a lot of practical, experience-driven advice out there. The main motivation for writing this book was wanting to share our experience, hopefully allowing people to sidestep some of the pitfalls we encountered.
InfoQ: Who should read your book and what could they expect to learn from it?
We feel this book really caters to a wide audience; from the novice just getting started with Storm to the advanced Storm user looking to find an answer to a specific question. Our initial audience for the book was more advanced users, however when mapping out the chapters, we found that establishing a solid understanding of the core concepts in Storm was necessary. Because of this, we feel that someone with little to no experience with Storm can pick up the book and find value in the earlier chapters. The more advanced user should be able to find value in some of the later chapters. So really, anyone who is interested in big data technologies, and stream processing systems in particular, will find value in this book.
InfoQ: What has been the impact of Storm on the way highly scalable, batch or real time stream processing systems are implemented today?
For the most part, we don’t think this is a particularly answerable question. There’s a lot of work going on in the “big data” space and saying what the impact of any particular project was is pretty hard to know. Unless a project comes out, like the Heron and Hailstorm, name-checking Storm it’s mostly guess work. We would be comfortable saying that we think Storm played a large role in bringing stream processing to prominence in terms of developer mind share.
InfoQ: What makes Storm the right choice for real-time stream processing?
There is no right choice for stream processing. As with anything engineering related, it’s a matter of tradeoffs. We think Storm is an excellent choice for people just dipping their toes into the stream processing waters. It provides a large number of features you are going to want (fault tolerance, reliability, visibility via the UI, etc…). As you use Storm, it will also get you thinking about what you need in a stream processor and what you don’t need (where Storm meets/doesn’t meet your needs, where Storm adds additional complexity you don’t need, etc…). Once you know those, you can be in a position to make an educated decision about sticking with Storm, using another processor or writing your own.
InfoQ: What topics have you left out of the book due any considerations, such as space, complexity, being too advanced material?
There are several topics that were left out due to space and time constraints. A couple of topics that come to mind are testing and using Storm with different languages.
We felt an entire chapter could have been dedicated to testing, but there were so many other topics we wanted to cover. We did our best to provide some examples of testing Storm topologies in the sample code with the book.
Being able to use Storm with different programming languages is a characteristic that appeals to many people. At the end of the day, we wanted the focus to be on Storm itself and not necessarily the language being used with Storm, so we went with Java. We actually use Scala for a majority of our own Storm topologies, so we definitely understood the disappointment when it was decided we would stick with a single programming language throughout the book.
InfoQ: How has Storm evolved since it was open-sourced in 2011?
The core concepts of Storm have remained fairly unchanged since the beginning, which has been quite nice. It has allowed us to continue to build topologies in a consistent manner. In terms of actual changes to Storm, some of the improvements that come to mind are a faster internal queuing system, improved metric-collecting capabilities, better visualizations in the Storm UI and overall improve stability.
We think one thing that has definitely changed since Storm was open-sourced in 2011 is the general community support. Storm has a much larger community today than the early days. That and the number of technologies, projects, and frameworks being built that can be used with Storm are what excites us the most.
InfoQ: In which areas do you think Storm should improve or offer additional features?
The areas we think Storm most needs to improve are chronicled in Twitter’s initial blog post and subsequent paper on Heron. We would draw the same conclusions that they did and say that addressing them within Storm would be a painful exercise. A number of the problems identified by Twitter in that paper are issues of scale. Most people won’t hit some of them. Most people aren’t going to be building a system where the Storm’s latency issues are going to matter. And if they do, then another system would be better for them to start from.
We’d say the #1 problem that everyone will face with Storm that was identified in the Heron paper is debuggability. When you have multiple threads of execution on JVM as part of a Storm topology and they are all writing the same log file, it can be really hard to piece together what is going on. Twitter came up with a solution to that problem that they detail in the Heron paper (1 thread per JVM). I think that only addresses part of the debuggability problems of Storm. It’s as equally difficult to trace the movement of data from a single originating message through a topology and see each transformation along the way. End-to-end tracing as a built-in facility would be great. What that would look like we’re not sure, but it’s something we’d love to see.
InfoQ: A full chapter of your book deals with Trident. Could you briefly comment on a typical scenario where one would want to use Trident? When would it be advisable, on the other hand, to go Storm-only and build your own abstractions on top of it?
If you are performance sensitive, then you should immediately cross Trident off your list. You’ll get much better performance using basic Storm primitives. If you are looking to do “exactly once” processing, then Trident might be what you want. It doesn’t absolve you from understanding what “exactly once” means in a distributed system, but it can help get you started. Spark Streaming occupies a similar space to Trident and if you are interested in doing micro-batch and batch processing as compared to micro-batch and stream processing then you should be looking at Spark rather than Storm.
InfoQ: Finally, how do you think that Twitter’s adoption of Heron will affect Storm’s future?
Operating under the assumption that Heron is open sourced; Storm is going to lose some people to Heron. Heron was designed to address some shortcomings in Storm while still being API compatible and that is going to make it attractive to consider switching. That said, Storm is already a complex beast to wrap your head around and Heron adds an additional layer of complexity by requiring Mesos. We could be wrong (as there is limited information to go on) but Heron appears to raise the bar above what Storm has in order to get running. It promises to make debugging easier which might balance that out but I doubt it. Saying Mesos is required is a large hurdle for a lot of people.
Yahoo recently doubled down on their support for Storm so while it’s easy to make Storm to Heron comparisons now and speculate about the future of Storm, I think it’s safe to say that we don’t really know right now what the pros and cons of each will be.
A lot of work is going into Storm now to make it “enterprise ready” in ways that weren’t mentioned at all in the Heron blog post and paper. I would guess that Storm is going to have a large lead there. In the end, it’s going to come down to which project is better ran and can attract a better community. Open source projects usually live and die based on the community they can build up around them which is a human rather than technical concern.
Discount code ‘storm40iq’ for 40% off on 'Storm Applied' -- valid for all formats at manning.com.
About the Book Authors
Sean T. Allen has worked in software for over 20 years. He currently works as VP of Architecture at Sendence where he plays David to the Big Data Goliath. You can follow Sean on Twitter at @SeanTAllen.
Matthew Jankowski has worked in software for 13 years. He currently works as a Lead Software Engineer for TheLadders. His passions include clean code, domain driven design, the various big data frameworks/tools, and continuously learning. You can follow Matt on Twitter at @mattjanks16.