BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Google Open Sources Cloud Dataflow Java SDK

| by Abel Avram Follow 4 Followers on Dec 18, 2014. Estimated reading time: 1 minute |

Google announced earlier this year their Cloud Dataflow, a service and SDK for processing large amounts of data in batches or real time. Now they have open sourced the Dataflow Java SDK, enabling developers to see how it works and possibly use the SDK for services running on-premises or in other clouds.

Dataflow is a cloud service using a technology that evolved from FlumeJava – a Java library for creating data-parallel pipelines –, and MillWheel –a framework for building fault-tolerant stream processing applications-, both used internally by “hundreds of developers” at Google. The service is language agnostic but Google is providing a Java SDK to make it easier to create applications for it.

The key concept used by Dataflow is pipeline, which consists of a “set of operations that can read a source of input data, transform that data, and write out the resulting output.” Data is organized in collections which can be bounded or unbounded in size, and submitted to a number of transformations which are computations that operate on input collections and generated output ones. A pipeline runner is the environment where the pipeline is to be executed. The SDK provides three types of runners: DirectPipelineRunner – local machine-, DataflowPipelineRunner – Google Cloud Platform – and BlockingDataflowPipelineRunner – also on Google Cloud but it prints log messages on execution status.

Pipelines can be simple with transformations executed linearly one after another, or complex as a directed graph with a transformation path that branches and merges back later. A pipeline cannot share data nor transformations with another pipeline. Pipelines are executed asynchronously and the dataflow service may decide the order in which some transformations are executed optimizing the whole process for efficiency.

Dataflow apps can be deployed on Google Cloud Platform, which takes care of all infrastructure needed including providing VMs for running the code, storage for data or BigQuery mechanisms for processing data. But developers can also deploy these applications on different runners either locally or in other clouds, providing a similar service is created.

The Dataflow SDK comes with examples. A Stack Overflow tag has been created to answer developers’ questions.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT