Hazelcast Release Jet, Open-Source Stream Processing Engine
Hazelcast is primarily known for its open-source in-memory data grid (usually referred to as Hazelcast IMDG, or just Hazelcast). However, over the last two years, they have been working on a major new open-source project, called Hazelcast Jet, and this week have announced the release of this new technology.
InfoQ caught up with Greg Luck, CEO of Hazelcast and Marko Topolnik, Jet Core Team engineer for a closer look at what makes this new release exciting.
InfoQ: Jet is presented as a significant improvement in the stream-based processing space. Can you explain why?
Luck: The real aim of Jet is to enable fast big data as just another part of application infrastructure. Technologies like Spark and Hadoop can intrude too far into the application developers architecture and thinking. We want Jet to give developers the tools to focus on their problem, not the application plumbing.
Jet provides breakthrough performance as well - we have apples-to-apples performance numbers for Jet and other engines - all based on 10 node clusters with rotating disks - and the performance numbers speak for themselves.
InfoQ: Can you speak to the architectural and technical decisions you made when producing Jet? What makes it different from some of the incumbent players already in this market - especially Apache Spark?
Topolnik: We decided to model our core execution engine on the principle of cooperative multithreading, often referred to as “green threads”. This means that, instead of letting the OS schedule our work, we run everything on just as many threads as there are available CPU cores. Our basic processing unit, which we call a tasklet, cooperates with the execution engine by doing a little work each time it is called, then yielding back. Since we use bounded concurrent queues with smart batching, this work pattern comes naturally: each tasklet call processes the data that’s ready in the queue.
Why do we think this is a good approach? Firstly, our cost of context switching is virtually zero. There is hardly any logic needed to hand over from one tasklet to the next. Second, we get the effect of CPU core affinity: tasklets don't jump between threads and each worker thread is highly likely to remain pinned to a core. This means high CPU cache hit rate. Finally, we have immediate insight into which tasklet is ready to run by inspecting its input/output queues. If we used native threads, we would have to use blocking queues with comparatively heavyweight wait/notify mechanisms and we’d be at the mercy of the OS to run our task as soon as it can make progress.
A second important decision was to use single-producer/single-consumer concurrent queues everywhere. To connect N upstream tasklets with M downstream tasklets we need NxM queues; however, it allows us to use extremely fast wait-free algorithms on both ends. We don’t even need volatile writes because we can use lazySet which just enqueues the item on the CPU’s store buffer. On the consumer side we need just one lazySet after we drained the whole queue into thread-local storage.
Luck: These type of ideas are directly influenced by Martin Thompson and the Mechanical Sympathy movement he founded, of course.
InfoQ: Hazelcast IMDG already had a fairly clean & straightforward approach to partitioning. How does Jet improve on that? What sort of use cases will see significant improvement over the simple "send a Runnable to a particular data partition"?
Topolnik: Sending a runnable to a partition is analogous to the work of a single DAG vertex. The advantage of Jet comes from the ability to have the vertex transform the data it reads, producing items which no longer belong to the same partition, then reshuffle them while sending to the downstream vertex so they are again correctly partitioned. This is essential for any kind of map-reduce operation where the reducing unit must observe all the data items with the same key. To minimize network traffic, Jet can first reduce the data slice produced on the local member, then send only one item per key to the remote member that combines the partial results.
Luck: We also have a distributed version of java.util.stream, and that plays well with Jets architecture because we have sources and sinks as a core part of Jet's architecture. We'll also be adding Map-with-Predicate as a source in a future version, to allow filtering and field projection to act as a source of streaming data for Jet.
InfoQ: Are there any particular industries or use cases where you think Jet will have particular impact / success?
Luck: We think Jet will be of great benefit to IoT, financial services, payments processing, fraud and any industry that's making heavy use of CEP (Complex Event Processing). We think the key thing about Jet is how useful it is when you need to perform a DAG calculation in an operational context, rather than as a piece of analytics.
InfoQ: Will you be following the same sort of product strategy for Jet as you have followed for IMDG? I.e. OSS edition, with support and premium features available for a fee?
Luck: Not necessarily. From today (Feb 8th) we'll be offering professional support for Jet as well as IMDG. Jet will work very well with IMDG, so we anticipate some increased use of IMDG as a result of the Jet launch, but no aspects of Jet are closed source and there's nothing on the horizon that is. We might add management monitoring as a paid feature later in the year, as that's kind of an obvious angle, but nothing's decided yet.
We're not focused on monetising Jet at the moment - we just want it out there as a successful open-source project with an Apache 2 license.
InfoQ: What does the roadmap for Jet look like?
Luck: 0.3 is out now, and then we're aiming for a monthly cadence after that.
We're also planning a 0.3.1 in two weeks time - just to tidy up a couple of bits that missed the 0.3 release. In particular, 0.3.1 will come with IMDG 3.8 and also pick up elastic scaling of Jet clusters (even with jobs that are already inflight).
The 0.4 release should contain a lot of performance work. Even though Jet's performance is already outstanding, we'll be making some further improvements in 0.4. We'll also add JCache support and IMDG listeners as a true stream source. In the current release IMDG is supported, but as a batch source, so we want to add in the true streaming support as well.
We have support for Kafka and HDFS in already, but they need some performance work and additional documentation to bring them to first-class supported status.
We have some other features as well, including a DAG visualisation tool, which we're hoping to release as an Eclipse and IntelliJ plugin.
We want to get Jet out in front of our community and then listen to what they're saying - so once Jet's established, the roadmap will be very community-driven.