Google Announces Cloud Dataflow Beta at Google I/O
At its annual developer conference, Google announced a set of new initiatives for cloud computing. At the top of the list is Cloud Dataflow -- a way of managing complex data pipelines.
InfoQ spoke with Brian Goldfarb, the Head of Product Marketing for Google's cloud platform. He pointed out that Cloud Dataflow handles both batch and streaming data. Imagine analyzing millions of tweets posted during a worldwide event in real time. In one pipeline segment, you read the tweets. In the next segment you extract tags. In another segment, you classify tweets by sentiment (positive, negative, or other). In the next segment, you filter for keywords. And so on. Map/Reduce -- an older paradigm for handling large data sets -- doesn't readily deal with such real-time data, and doesn't easily apply to such long, complex pipelines.
Google's new paradigm uses the same API for both batch processing and real time processing, and uses the same API for both simple and complex pipelines. With the product the developer concentrates exclusively on the data logic, leaving pipeline optimization details to the Google cloud. Instead of concentrating on each pipeline segment separately, Cloud Dataflow takes into account the way segments interact with other segments. That way, a single segment with slow processing doesn't necessarily stall the action in all the downstream segments. To handle the traffic among segments, Cloud Dataflow uses aggregation by key, sliding windows, parts of Map/Reduce, and many other techniques.
With Cloud Dataflow, Google's cloud makes choices about the best way to optimize any particular application. The developer can accept these optimizations for most scenarios, and override the defaults for edge cases.
For the developer, most scenarios involve coding in relatively simple parts of the API. Here's an example from the conference keynote:
Pipeline pipeline = Pipeline.create(); PCollection
tweets = pipeline.begin() .apply(new InputFromPubSub()) .apply(new TweetTransformer()); tweets.apply(new CalculateSentiment()); tweets.apply(new CorrelateKeywords()); pipeline.run();
(For the time being, the API is available only for Java.)
The developer defines the pipeline, making sure that the code inside each segment (TweetTransformer, CalculateSentiment, and so on) is efficient and correct. Google's cloud then orchestrates the flow between and among the segments. Google's cloud also takes care of the low-level VM details. Operations such as deploying, scaling, spinning up and spinning down are all done behind the scenes.
To accompany Cloud Dataflow, Google has four new tools to make the developer's work easier and more productive. The tools are Cloud Save, Cloud Debug, Cloud Trace, and Cloud Monitor.
- Cloud Save is a simple API for saving and retrieving user information in the cloud. This information can include application data, preferences and other things.
GoogleAppClient client = . . . List
infoToSave = . . . SaveResult saveResult = CloudSaveManager.save(client, infoToSave);
- Cloud Debug is exactly what the name suggests. It's a debugger (whose interface is in a web browser) for cloud-based applications.
- Cloud Trace presents a useful visualization of the timing of service calls for a database request.
- Cloud Monitor provides service-level metrics and warnings, with custom alerts to catch problems before they affect users.
Cloud Monitoring comes from Google's recent acquisition of Stackdriver.
Cloud Dataflow and its accompanying tools are currently in beta. A public release date has not been announced.
MapReduce is already deprecated
Mike Amundsen May 29, 2015
Ben Linders May 28, 2015