Fabian Hueske on Apache Flink Framework

Apache Flink is a distributed data flow processing system for performing analytics on large data sets. It can be used for real time data streams as well as batch data processing. It supports APIs in Java and Scala programming languages.

Flink contains a memory management component, serialization framework, and type inference engine. It runs on YARN and HDFS and has a Hadoop compatibility package.

Fabian Hueske, PMC member of Apache Flink, spoke about the data processing framework at the recent ApacheCon Conference.

He discussed the programming model & APIs and how to use Flink API in applications, with code examples using DataSets and Transformations.

The Table API allows to evaluate SQL-like expressions on structured data streams or data sets and is tightly integrated with the Java and Scala APIs. It can be used in both batch and streaming programs.

Flink also supports graph analysis requirements. Gelly and Flink ML are libraries for analyzing large graphs and defining machine learning pipelines respectively. These libraries can be combined with Flink's other APIs.

InfoQ spoke with Fabian about the framework features and how they can help with big data processing and analytics requirements.

InfoQ: How is Apache Flink framework different from other Big Data frameworks from Apache community, like Hadoop and Storm?

Hueske: Apache Flink is a distributed data processor for large-scale data analysis. It provides high-level APIs in Scala and Java and has its own runtime that supports both stream processing and batch processing natively. Flink integrates nicely with Apache Hadoop. It supports HDFS as well as Hadoop Input- and OutputFormats to read and write various data formats from and to different data stores. Flink can run on Hadoop YARN clusters alongside MapReduce and other frameworks, which is actually very common in practice. Compared to Apache Storm, the stream analysis functionality of Flink offers a high-level API and uses a more light-weight fault tolerance strategy to provide exactly-once processing guarantees.

InfoQ: Flink API looks similar to what Apache Spark provides, in terms of Expressive Transformations and streaming operations. Why create another framework if Spark already provides the same capabilities for big data processing?

Hueske: Data flows are an intuitive programming abstraction and popular among many people. Consequently this concept has been adopted by many systems including Apache Spark, Apache Crunch, Cascading/Scalding, Microsoft LINQ, and as well Apache Flink. While Flink and Spark look quite similar from an API point of view, their back ends put emphasis on different aspects. Flink was initially designed as a hybrid of MapReduce and parallel relational database systems. This led to a number of design decisions including pipelined data transfers, active memory management, and RDBMS-style physical operators. Due to these concepts, Flink can process real-time data streams with low latency as well as heavy and complex batch programs using the same runtime. Given enough memory resources Flink efficiently operates in-memory but will gracefully de-stage data to disk if data volumes exceed the available memory budgets.

InfoQ: Flink's goal is to provide both Streaming and Batch data processing in one system. Apache Spark already does this with its Core and Streaming libraries. What are the advantages of using Flink over Spark or other big data processing frameworks?

Hueske: Due to its pipelining runtime, Flink can process streaming data in a similar way to Apache Storm. This is different from Spark in that data streams are not discretized into micro-batches. Put differently, Flink integrates batch and streaming by processing batch programs as special case of data streams (think finite streams) instead of treating data streams as a series of micro-batches. This has a few implications. First of all, Flink starts a data streaming program once and continuously processes data immediately as it arrives. Operators are scheduled just once and can maintain a state over the whole time when a stream is executed. Also, Flink features flexible window operators on real-time data streams. Consequently, Flink can address streaming and batch use-cases that other systems cannot easily support.

InfoQ: What type of monitoring capabilities are available when using Flink for data processing and analytics?

Hueske: Flink's web frontend shows detailed information about currently running jobs and the system in general. It also provides information that helps to debug the runtime behavior of jobs. The community is currently working on improving the web frontend to show more system- and job-related metrics.

InfoQ: What is the future roadmap of Flink in terms of 1.0 release as well as new features and enhancements?

Hueske: The Flink community aims to bundle a major release every 3 months and version 0.9 is just coming up. Although the community has not discussed a 1.0 release yet, I personally think that there will be a 1.0 release within year from now but this will be of course a community decision.

Fabian also mentioned that the Flink community just put out a preview release for the up-coming 0.9 major release. This release includes many improvements and new features, most importantly three new APIs and libraries. He also talked about the upcoming talks on Apache Flink at a few conferences this year including Strata in London, UK and HadoopSummit in San Jose, CA. In addition to that, there will be a dedicated Apache Flink conference called Flink Forward held in Berlin in October this year.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Enterprise Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter