Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Google Open Sources Cloud Dataflow Java SDK

Google Open Sources Cloud Dataflow Java SDK

This item in japanese

Google announced earlier this year their Cloud Dataflow, a service and SDK for processing large amounts of data in batches or real time. Now they have open sourced the Dataflow Java SDK, enabling developers to see how it works and possibly use the SDK for services running on-premises or in other clouds.

Dataflow is a cloud service using a technology that evolved from FlumeJava – a Java library for creating data-parallel pipelines –, and MillWheel –a framework for building fault-tolerant stream processing applications-, both used internally by “hundreds of developers” at Google. The service is language agnostic but Google is providing a Java SDK to make it easier to create applications for it.

The key concept used by Dataflow is pipeline, which consists of a “set of operations that can read a source of input data, transform that data, and write out the resulting output.” Data is organized in collections which can be bounded or unbounded in size, and submitted to a number of transformations which are computations that operate on input collections and generated output ones. A pipeline runner is the environment where the pipeline is to be executed. The SDK provides three types of runners: DirectPipelineRunner – local machine-, DataflowPipelineRunner – Google Cloud Platform – and BlockingDataflowPipelineRunner – also on Google Cloud but it prints log messages on execution status.

Pipelines can be simple with transformations executed linearly one after another, or complex as a directed graph with a transformation path that branches and merges back later. A pipeline cannot share data nor transformations with another pipeline. Pipelines are executed asynchronously and the dataflow service may decide the order in which some transformations are executed optimizing the whole process for efficiency.

Dataflow apps can be deployed on Google Cloud Platform, which takes care of all infrastructure needed including providing VMs for running the code, storage for data or BigQuery mechanisms for processing data. But developers can also deploy these applications on different runners either locally or in other clouds, providing a similar service is created.

The Dataflow SDK comes with examples. A Stack Overflow tag has been created to answer developers’ questions.

Rate this Article