BT

Google's Cloud Dataflow Enters General Availability

| by Kent Weare Follow 9 Followers on Sep 17, 2015. Estimated reading time: 3 minutes |

On August 12, Google announced that its big data processing service has reached general availability. This managed service allows customers to build pipelines that manipulate data prior to being processed by big data solutions.  Cloud Dataflow supports both streaming and batch programming in a unified model.  

The Cloud Dataflow service is the evolution of some internal projects at Google including MapReduce, Flume and Millwheel. Google first released an early access preview of Cloud Dataflow at Google I/O in June 2014.  This released was followed by alpha and beta releases in December 2014 and April 2015 respectively. 

A common problem for organizations is the ability to process large amounts of data prior to being ingested into cloud or on-premises big data platforms.  Data often times needs to undergo ETL-like operations such as enrichment, shaping, filtering, consolidation, computation and composition.  Google’s Cloud Dataflow platform has been designed to address these challenges for customers within the Google cloud platform.  Organizations that prefer to run their big data workloads on-premises can build their own data pipelines using Google’s SDK or via third party solution.  

Customers may also use Google Dataflow in high volume computation scenarios where you need to process more data than your cluster’s memory footprint can handle.  Using Cloud Dataflow developers can also break these jobs into perfectly parallel data processing tasks which can be executed concurrently and independent of each other.

Some of the benefits that Google claims include:

  • A NoOps model that can allocate resources on-demand with intelligent auto scaling and automated work optimization.
  • A unified, functional programming model that supports both batch and stream based processing.
  • An extensible, open source SDK that enables custom scenarios for customers and allows for 3rd party integration. 

Eric Schmidt, product manager at Google, classifies the Dataflow offering in the following two ways: “a collection of SDKs for building batch or streaming parallelized data processing pipelines and a fully managed service for executing optimized parallelized data processing pipelines". 

Image Source: http://googlecloudplatform.blogspot.ca/2015/08/Announcing-General-Availability-of-Google-Cloud-Dataflow-and-Cloud-Pub-Sub.html 

 Organizations can use the Cloud Dataflow service as a method to ingest, transform and analyze data before dispersing it to other analytic services and platforms.  These other integration points include Google's Big Query, Cloud Datastore and Cloud Pub/Sub messaging or 3rd party analytic services. 

Google has on-boarded some partners including springML, Cloudera, dataArtisans and Salesforce.com to extend the platform offering. For example, customers using Cloud Dataflow and Salesforce.com Wave analytics will be able to analyze large amounts of data, regardless of the origin, using an end to end platform in order to optimize customer interactions. 

For organizations preferring to build their own solutions, the open source SDK provides a specialized collection class called PCollection which is used to store your bounded and unbounded data collections.   Google claims that a PCollection can reach a “virtually unlimited size” and when combined with a PTransform can be used in data transformations between your source and destination systems.  I/O APIs support different file types including text, Avro files and Big Query tables which can be used to load data into a PCollection class.

Google Dataflow will be billed on a per job basis which accounts for a graph of computations provided by the developer, service time, work time and shuffled bytes. In addition to these costs, other Google services that are being consumed, such as Big Query, will be billed separately.

Google has some competition in this space including the likes of Amazon and Microsoft.  Amazon offers their Kinesis platform while Microsoft addresses similar use cases using the Azure Data Factory platform.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT