Google Announces Cloud Dataflow Beta at Google I/O

| by Barry Burd Follow 0 Followers on Jun 28, 2014. Estimated reading time: 2 minutes |

At its annual developer conference, Google announced a set of new initiatives for cloud computing. At the top of the list is Cloud Dataflow -- a way of managing complex data pipelines.

InfoQ spoke with Brian Goldfarb, the Head of Product Marketing for Google's cloud platform. He pointed out that Cloud Dataflow handles both batch and streaming data. Imagine analyzing millions of tweets posted during a worldwide event in real time. In one pipeline segment, you read the tweets. In the next segment you extract tags. In another segment, you classify tweets by sentiment (positive, negative, or other). In the next segment, you filter for keywords. And so on. Map/Reduce -- an older paradigm for handling large data sets -- doesn't readily deal with such real-time data, and doesn't easily apply to such long, complex pipelines.

Google's new paradigm uses the same API for both batch processing and real time processing, and uses the same API for both simple and complex pipelines. With the product the developer concentrates exclusively on the data logic, leaving pipeline optimization details to the Google cloud. Instead of concentrating on each pipeline segment separately, Cloud Dataflow takes into account the way segments interact with other segments. That way, a single segment with slow processing doesn't necessarily stall the action in all the downstream segments. To handle the traffic among segments, Cloud Dataflow uses aggregation by key, sliding windows, parts of Map/Reduce, and many other techniques.

With Cloud Dataflow, Google's cloud makes choices about the best way to optimize any particular application. The developer can accept these optimizations for most scenarios, and override the defaults for edge cases.

For the developer, most scenarios involve coding in relatively simple parts of the API. Here's an example from the conference keynote:

Pipeline pipeline = Pipeline.create();
PCollection tweets = pipeline.begin()
                    .apply(new InputFromPubSub())
                    .apply(new TweetTransformer());
tweets.apply(new CalculateSentiment());
tweets.apply(new CorrelateKeywords());;

(For the time being, the API is available only for Java.)

The developer defines the pipeline, making sure that the code inside each segment (TweetTransformer, CalculateSentiment, and so on) is efficient and correct. Google's cloud then orchestrates the flow between and among the segments. Google's cloud also takes care of the low-level VM details. Operations such as deploying, scaling, spinning up and spinning down are all done behind the scenes.

To accompany Cloud Dataflow, Google has four new tools to make the developer's work easier and more productive. The tools are Cloud Save, Cloud Debug, Cloud Trace, and Cloud Monitor.

  • Cloud Save is a simple API for saving and retrieving user information in the cloud. This information can include application data, preferences and other things.

      GoogleAppClient client = . . .
      List infoToSave = . . .
      SaveResult saveResult =, infoToSave);
  • Cloud Debug is exactly what the name suggests. It's a debugger (whose interface is in a web browser) for cloud-based applications.
    Cloud Debug

  • Cloud Trace presents a useful visualization of the timing of service calls for a database request.
    Cloud Trace

  • Cloud Monitor provides service-level metrics and warnings, with custom alerts to catch problems before they affect users.
    Cloud Monitoring

    Cloud Monitoring comes from Google's recent acquisition of Stackdriver.

Cloud Dataflow and its accompanying tools are currently in beta. A public release date has not been announced.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

MapReduce is already deprecated by Muhammad Khojaye

New paradigms like Storm, Spark, Giraph already emerged inside Hadoop space had replaced MapReduce. Since it was used already heavily, it is still difficult to transform existing system.

Cloudwedge by challenge coins

At its annual developer conference, Google announced a set of new initiatives for cloud computing. In its ongoing I/O, Google announced the launch of Cloud Dataflow, Though yet in the Beta stage, cloudwedge Dataflow is promising though. thanks

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you