Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Rethinking Flink’s APIs for a Unified Data Processing Framework

Rethinking Flink’s APIs for a Unified Data Processing Framework

Leia em Português

Key Takeaways

  • Within the Flink community, all data sources are considered naturally unbounded, and bounded data sources are a slice from the unbounded data.
  • Stream processing is best suited for processing data that changes rapidly, using application logic that stays unchanged. Conversely, batch processing is useful for running unchanged data through evolving queries and logic.
  • In Flink, there is a tradeoff between latency and completeness. With stream processing, developers consider tuning parameters between early results and completeness of the computation, while with batch-style processing tuning is unnecessary because all data is available at the beginning of the computation.
  • Flink’s current API structure includes the DataSet API (for batch style processing), the DataStream API (for real time processing) and the Table API/SQL for declarative-style programming. 
  • The framework’s roadmap incorporates changes to the APIs such as deprecating the DataSet API while enhancing the DataStream API to fully subsume batch processing-style use cases. 

Batch and Streaming: Two sides of the same coin

Since its very early days, Apache Flink has followed the philosophy of taking a unified approach to batch and streaming. The core building block is the "continuous processing of unbounded data streams, with batch as a special, bounded set of those streams."

The "streaming first, with batch as a special case of streaming" philosophy is supported by various projects (for example Flink, Beam, etc.) and has often been cited as a powerful way to build data applications that unify batch and stream data processing and help greatly reduce the complexity of data infrastructures.

With the recent donation of Blink—Alibaba’s internal fork of Apache Flink—to the community, Flink committers and contributors work towards making this "streaming first, with batch as a special case of streaming" philosophy a reality with Flink 2.0.  

Before diving into the nuts and bolts of how the existing APIs in Apache Flink might change to accommodate the unification of batch and streaming workloads, it is worth noting a few requirements that will ultimately impact how the system’s design will look as we move toward the next evolutionary step of Apache Flink.

Within the Flink community, we consider all data sources to be naturally unbounded, and bounded data sources are what you get when you take a slice out of that unbounded data.

This means that streams become the natural base primitive of sources while boundedness is one potential additional property of them. Following this paradigm, we can neatly cover both traditional batch workloads and streaming workloads with one abstraction. The following picture gives a good visual description of this model.

[Click on the image to enlarge it]

As simple as this idea might be, making you wonder why it is that we have separate batch and stream processing systems — including Flink, with its DataSet and DataStream APIs — we live in a complex world where requirements are vastly different when it comes to data processing. More specifically, between batch and streaming, there is a difference that relates to what changes faster: the data or the program. For stream processing-style use cases (such as data pipelines, anomaly detection, ML evaluation, or continuous applications) we need to process data very fast and produce results in real-time, while at the same time the application logic stays relatively unchanged or is rather long-lived. On the other hand, for most batch processing-style use cases like data exploration, ML training, and parameter tuning, the data changes relatively slowly compared to the fast-changing queries and logic.

Put another way, there is a tradeoff between latency and completeness. With stream processing use cases, data comes continuously from the infinite stream of events while we use processing-time timers and watermarks for tuning between early (but possibly incorrect) results and completeness. In batch-style use cases, however, we have all the input data at the beginning of the computation and we don’t need to use timers or watermarks for fine-tuning of the output characteristics. We can always produce a complete and correct result.

This difference in requirements is reflected in different execution styles that are illustrated in the following picture. On the right, stream-style processing scenarios have an "always-on" processing graph (sources, user-defined functions, sinks, etc.) with all parts running continuously and at the same time. On the left, for batch-style processing scenarios, the processing graph goes live in stages throughout the processing, allowing for better resource allocation in a given stage. For example, Flink can have specific sources running with the parallelism that they need to read data, which are then shut down as later stages of the computation come online, thus leading to better resource allocation.

[Click on the image to enlarge it]

A look inside Flink’s current API structure

Now that we have explained the differences in the requirements and optimization potential between batch and streaming, let’s take a closer look at Flink’s existing API structure and its future direction as we head towards a unified data processing framework.

As illustrated in the picture below, Flink’s existing API stack consists of the Runtime as the lowest level abstraction of the system that is responsible for deploying jobs and running Tasks on distributed machines. It provides fault-tolerance and network interconnection between the different Tasks in the JobGraph. On top of Flink’s Runtime sit two separate APIs, the DataSet and DataStream APIs. The DataSet API has its own DAG (directed acyclic graph) representation for tying together the operators of a job, as well as operator implementations for different types of user-defined functions. The DataStream API has a different DAG representation as well as its own set of operator implementations. Both types of operators are implemented on a disjointed set of Tasks which are given to the lower-level Runtime for execution. Finally, we have the Table API / SQL which supports declarative-style programming and comes with its own representation of logical operations and with two different translation paths for converting Table API programs to either the DataSet or DataStream API, depending on the use case and/or the type of sources that the program comes with.

[Click on the image to enlarge it]

Looking at the existing stack from system-design, code-cleanliness, and development perspectives, we notice a few areas of improvement:

  • The graph representation is separate for each API, resulting in code duplication.
  • There are multiple translation components between the different graphs which also results in code duplication.
  • Operator implementations are separate and incompatible, resulting in duplicate operator implementations which execute in different low-level Task implementations.
  • The Table API is translated into two distinct lower-level APIs, resulting in multiple translation stacks.
  • The different APIs currently have separate connectors resulting in duplication of maintenance work and different implementations for each API. (For example, the DataStream API has a Kafka connector while this is non-existent for the DataSet API. The DataSet API has an HBase connector that is currently missing from the DataStream API.)

Future Flink APIs for a unified batch and stream processing framework

Since Apache Flink views "Batch as a Special case of Streaming," the overall theme of Flink’s roadmap is evolving around enhancing the DataStream API so that it can fully subsume batch processing-style use cases. Once this is achieved, the DataSet API could then be deprecated. Below we describe a few ideas that the community is already discussing and has active improvement proposals and stories already being worked on.

Introduction of BoundedStream in the DataStream API

One of the enhancements to the DataStream API includes the introduction of BoundedStreams, a concept that will allow the DataStream API to harness the optimization potential and semantics of the DataSet API — that is, excluding processing-time timers and watermarks from any batch-style workloads and using stage-wise execution if possible for better resource allocation.  

A New Unified Source Interface

Another important step is the introduction of a unified source interface which subsumes the current streaming source interface (SourceFunction) and batch interface (InputFormat). This work is tracked as FLIP 27.

Enhancements to DataStream API’s graph representation

The graph representation of the API will be enhanced to support additional information about boundedness. Moreover, components such as translations, scheduling and deployment of the operators, as well as memory management and the network stack will be enhanced to support batch-style executions, including leg-wise scheduling for efficient execution of hash-join type use cases.

Enhancements to DataStream API’s translators and operators

The DataStream API will be further optimized with enhancements to its operators and translators, making the streaming execution Flink’s default processing operation. Meanwhile, batch-style executions will simply bring up additional optimization rules which enable feature parity in the framework. For example, for processing graphs that incorporate both bounded and unbounded streams of data, the StreamOperators can bootstrap the state of some operators and efficiently manage resources between the two streams, by first executing any bounded parts of the workload and then continuing with the unbounded streams. The stream operator API will be further enhanced to support selective reading and EndOfInput events for batch-style executions (FLINK 11875).  

Enhancements to the Table API / SQL

Since the Table API / SQL in Apache Flink is already unified for both batch and stream-style workloads, the development work of the community is primarily focused on adding a new Table API runner implementation based on work done at Alibaba (FLINK 11439). This is a new runtime for the Table API that makes use of the advanced capabilities of the newly improved Streaming API.

All the enhancements mentioned above will eventually guide the framework’s future stack to look like the picture below.

[Click on the image to enlarge it]

Flink’s Runtime is at the core of the framework along with an enhanced Stream Transformation DAG. Both the DataStream API and the Table API / SQL sit on top of those same abstractions, the first as a Physical Application API capable of handling both batch and stream-style executions, and the second as a more declarative API executing SQL for both bounded and unbounded data streams.

Stay tuned for more updates on Flink’s journey towards becoming a unified data processing framework by subscribing to the public Apache Flink mailing lists and following the FLIPs and stories mentioned in the article.

About the Author

Aljoscha Krettek is a PMC member at Apache Beam and Apache Flink, where he focuses predominantly on the Streaming API as well as the design and implementation of additions to Flink’s APIs. Aljoscha is a co-founder and software engineer at Ververica (formerly data Artisans). Previously, he worked at IBM Germany and the IBM Almaden Research Center in San Jose. Aljoscha has spoken at Hadoop Summit, Strata, Flink Forward, Big Data Madrid, and several Meetups about stream processing and Apache Flink. He studied computer science at TU Berlin.

Rate this Article