Hazelcast Introduces MapReduce API
An interview with Christoph Engelbert, Senior Software Engineer at Hazelcast Inc.
InfoQ: Can you give us a quick rundown on what Hazelcast is and what your companies business model is?
Christoph: Hazelcast is an open source in-memory data grid solution available under Apache License 2. It implements a lot of the typically Java APIs, such as Map, List, Set, Queue, Lock, ExecutorService, etc., in a distributed manor and adds features for partitioned cluster environments, for example, distributed queries (Predicate) or executing Runnable/Callable on a specific node.
Additionally, there is Hazelcast the company that develops, distributes and supports the open source project. We offer commercial support, trainings and help with everything you need in mission critical environments. Further, we provide commercial extensions that include solutions for monitoring and management of the cluster, higher predictability in terms of performance, cluster security and other language native API clients such as C++ and C#. While Hazelcast is entirely written in Java, we provide polyglot language support through a memcached-compliant API as well as a RESTful interface.
InfoQ: In which applications, verticals or open source projects is Hazelcast used and also, how do developers use it, typically?
Christoph: Hazelcast is taking off in the financial domain, for low-latency trading applications, risk, financial exchange and other comparable applications. It's also being used in big telcos, network equipment makers and cloud providers. We're starting to see uptake in areas like Internet and mobile payments, gaming and gambling, travel and hospitality and also eCommerce. Most of our use cases are caching or application scaling. Further, a lot of companies and projects build their own solutions on top of Hazelcast, such as OrientDB, Vert.X, MuleSoft, WSO2 or Apache Shiro.
InfoQ: You've recently released a MapReduce API and I take it you're the main developer. What was the motivation for this?
Christoph: I started CastMapR as a research project. I wanted to get into the new Hazelcast 3 API and since I was recently working with another API for MapReduce (since I was using JBoss cluster) it seemed to me like a good fit. Then, when I joined Hazelcast end of 2013, we started a discussion on making it part of the main Hazelcast distribution.
InfoQ: So you ported CastMapR into Hazelcast?
Christoph: Kind of, yes. The first idea was to just move the codebase into the core distribution but over time we figured that we want to go more for reactive programming and so I rewrote most of the internals. In addition we discussed the exposed API a lot. CastMapR was mostly inspired by Infinispan's API because I just liked it. For the new MapReduce API we decided on a more Hadoop-like API (closer to the original Google paper) but I sticked with the DSL way of defining jobs. Eventually I ended up with only little pieces of the old implementation to be reused. The new implementation is fully concurrent in design. Mapping and reducing phases are fully parallel and the whole system is working in a streaming way (based on chunked processing). Therefor, the old implementation is now discontinued and full effort is spent into the Hazelcast internal MapReduce API.
InfoQ: OK, great. Now, what are typical (envisioned) use cases for Hazelcast’s MapReduce API and how does it compare to, say, MongoDB’s MapReduce API or Hadoop?
Christoph: Typical scenarios where you want to use Hazelcast MapReduce API are distributed computations where the EntryProcessor is not a good fit. Either you want to have data transforming or you want to utilize multiple data sources. It is also a good fit for long running operations since all of the current systems working directly on partition threads so you do not have to do explicit locking for data changes. In one of the next version I will add continuous map reduce support so you can have a fully streaming analysis running. The best example for this is always Twitter which processes tweets in realtime to collect informations like reweets, favorits and a lot of other statistics. This is also useful for risk management and analysis.
The biggest difference to Hadoop is the in-memory and the realtime processing. In Hadoop you have different phases where every phase is executed one after the other whereas in Hazelcast you get full performance due to the internal concurrent design where mapping and reducing running in parallel on all nodes. Phases itself are pretty similar to what you find in Hadoop, so you have mapping (and combining), shuffeling (partitioning to the nodes) and reducing phases but there not as clearly separated as in Hadoop.
A comparison with MongoDB is hard since I never used their MapReduce API but it seems to lack Combiners which are very helpful for huge amounts of datasets but as I said I'm not aware of their implementation.
InfoQ: Cool, one last question: is there anything else you want to share with our readers about the Hazelcast MapReduce API?
Christoph: Yeah, I have a personal request: I'd like people to test-drive the API and provide us with as much feedback as possible. The API is fully stable and I'm absolutely happy with it. I also want to learn about real-world user's experience to find more areas to tweak, since I am sure we can improve it.
Thank you very much for taking the time to do this interview, Christoph!