Apache Flink 1.0.0 is Released
Apache Flink 1.0.0 was released recently. It is a platform for distributed stream and batch data processing. The 1.0.0 release guarantees backward compatibility with future 1.x.x releases, and fixes about 450 JIRA issues made by 64 contributors. Besides the fixes, it has a variety of new user-facing features.
InfoQ caught up with Stephan Ewen, a project committer about the 1.0.0 release, its new features and what has happened with the project since the earlier InfoQ news item.
InfoQ:We talked to you previously just before the 0.9 Release. Tell us about the journey since then and especially about new user-facing features in 1.0.0.
Stephan Ewen: Since the last time we spoke, Flink made quite a journey. The past releases had a strong focus on building up the stream processing capabilities. We reworked the DataStream API heavily since version 0.9. I personally subscribe to the vision that data streaming can subsume many of today’s batch applications, and Flink has added many features to make that possible.
Today, the DataStream API incorporates many advanced features, like support for Event Time, support for out-of-order streams, a very flexible and customizable windowing mechanism, and fault tolerant distributed state. Flink’s runtime can handle high throughput data streams while maintaining a low processing latency and provides exactly-once processing semantics with respect to state and windows. We also added a lot of operational features, like high availability (no single point of failure), and better monitoring (metrics, web dashboard). There is even a compatibility layer for Apache Storm to reuse code written as part of Storm topologies.
The most prominent user-facing features of the 1.0 release are probably the “savepoints”, the “complex event processing” (CEP) library, support for Kafka 0.9, out-of-core state, as well as better monitoring capabilities (for example for checkpoints and backpressure). The CEP library allows users to define sequences of events and conditions to search for in a data stream. The savepoints basically snapshot the complete state of the stream processor at a certain point in time, and allow you to resume the program (or another program) from that state. That is actually a very powerful mechanism that can be used to roll back or re-run streaming computations, or to upgrade programs without losing and intermediate state.
Of course, other parts of the Flink are making progress as well, including the DataSet (batch) API and the graph processing library “Gelly”.
InfoQ: Is 1.0.0 the first release to guarantee backwards compatibility? How do you plan on achieving this? How easy or difficult is it migrate from previous versions of Apache Flink to 1.0.0?
Stephan Ewen: The backwards compatibility refers to the classes and methods that are part of the public stable API, and those classes are marked with a special Java Annotation “Public”. This stable API includes the majority of DataStream and DataSet API constructs. For future 1.x versions, these will stay the same.
There are also some parts that are marked as “Public Evolving”. For such features, the details may be subject to change in the next version based on feedback from the community.
Migrating DataStream programs from version 0.10 to 1.0 should be quite simple, because the big breaking changes to the DataStream API happened between version 0.9 and 0.10. The DataSet API has actually been stable for a several releases now.
InfoQ: We did talk about the similarities to Apache Spark at the outset in the previous interview. The streaming platform seems to be a very competitive space and there is no dearth for projects it seems .Why should a developer use Apache Flink? What’s the sweet spot in comparison with others?
Stephan Ewen: Flink’s strength is in my opinion its combination of API, operational features, and performance that allows you to build streaming solutions that were not possible with other frameworks, or become significantly easier now:
The DataStream API is both concise and flexible, with support for features that are important for advanced applications, for example event time, custom windows, or state. Flink works on a proper streaming model, which gives it well-behaved latency and backpressure, it is fault tolerant with strong semantics (exactly once), and integrates very well with Kafka and YARN.
Finally, Flink has a quite sophisticated support for managing state, including the “savepoints” mentioned before. These safepoints can be used for recomputation, forking off different versions of the program (think A/B testing).
InfoQ: Performance is often a differentiator in this space. Can you characterize the performance of Apache Flink with respect to projects in the same space?
Stephan Ewen: Flink’s runtime by itself has been measured to have very competitive performance, both throughput- and latency wise. In many real world application, however, the stream processor itself is not the performance bottleneck. The performance is often limited by the interaction with other services, for example with external databases.
Because of that, even more important is in my opinion the fact that Flink’s strong support for state and consistency allows users to build applications differently than before, to circumvent these bottlenecks. A recent article for example showed a case that circumvented the need for using a key/value together with the stream processor, to store and expose realtime statistics. The gain in performance respectively the reduction in required resources was immense.
InfoQ: How does Apache Flink compare with Apache Kafka for instance? What does the Kafka connector do?
Stephan Ewen: Flink and Kafka solve quite different problems and complement each other very well. Kafka offers the durable storage for streams, Flink the computation and analytics on top of these streams. Flink’s Kafka connector receives the event streams from Kafka brokers and tracks a bit of metadata to make sure that during any failure and recovery, the exactly-once semantics are preserved.
InfoQ: Can you talk about state in the context of Apache Flink and how it uses RocksDB to store state. Why the choice of RocksDB?
Stephan Ewen: State in streaming programs is essential for all slightly more complex applications. Time- or count windows have state, or the CEP library to track for example the current event in a sequence pattern. Earlier stream processors, like Apache Storm, had to rely on external key/values stores for state. Flink maintains the state internally in the stream processor (in the worker processes) and checkpoints the state periodically for fault tolerance.
Flink has multiple so called “state backends” that a user can choose from, which describe the data structures in which Flink keeps the state. One choice of state backend is RocksDB, which we chose because it is a solid and scalable on-disk (or on flash) key/value index. It is a great choice for programs that keep very large state exceeding the memory of the worker machines. On the other hand, for applications with state that strictly fits into main memory, one can use the pure main memory state backend.
InfoQ: Can you talk a little bit about unfinished work in 1.0.0 and a roadmap for the next few releases and what is in store for the project a year or two down the road?
Stephan Ewen: There are a lot of very exciting features on the roadmap and currently under development. The biggest API addition in the near future is probably going to be streaming SQL.
On the side of operational features, we are working on making the parallelism dynamic, meaning that streaming programs can scale in and out while running. Furthermore, integration with the Mesos resource management framework is ongoing, as well as performance improvements for checkpointing very large state. There is also constant work to add and improve connectors to other systems, as well as improving the monitoring.
Finally, people from the Flink community are also active in the Apache Beam (incubating) project. The Flink runner for Beam pipelines is currently the most feature-complete open source Beam runtime, and we are working with the Beam community to evolve both the Flink runner and the Beam programming model further.
Downloads for Apache Flink based around the appropriate Hadoop versions and other getting started material is available in the docs.