Beam Graduates to Top-Level Apache Project

Beam recently graduated to a top-level project at Apache Software Foundation. Beam's goals include letting one process unbounded, out-of-order, global-scale data with portable high-level data pipelines. Beam was initially an internal Google project later moved into Apache, and was in incubation from February, 2016 through late last year. The Beam project seeks to create a unified programming model for streaming and batch processing jobs, and to produce artifacts that can be consumed by a number of supported data processing engines. Beam seeks to:

provide the world with an easy-to-use, but powerful model for data-parallel processing, both streaming and batch, portable across a variety of runtime platforms... The Beam SDKs use the same classes to represent both bounded and unbounded data, and the same transforms to operate on that data.

The SDK's available in Java and Python provide abstraction between the background processing engine of choice and the processing pipeline components. Supported processing engines include Apache Apex, Flink, Spark and Google's Cloud Dataflow engine.

The programming model for Beam pipeline involve PCollection(s), Transform(s), and Pipeline I/O as well as the runners for each supported processing engine, whose omission defaults Beam to a local DirectRunner:

Pipeline
PCollection
Core SDK transform objects ParDo, GroupByKey, Combine, Flatten and Partition
Source / Sink Pipeline I/O
DirectRunner, DataflowRunner, SparkRunner, FlinkRunner and ApexRunner

Google's motivation for open-sourcing Beam is part of an emerging business model that supports integrating with, and contributing to other open-source projects. The rationale is that doing so will increase the adoption potential for the Beam project, in the hopes of more exposure for Google Dataflow platform and for it to emerge as the processing engine of choice among supported engines. Google's comparison between Spark and Beam note the Beam model as the correct model for stream and batch data processing due to Beam's focus on, and importance of semantics enabled by event-time windowing, watermark, and trigger features. The open source community and broader data science industry has yet to empirically validate these claims independently of Google and should be addressed with more use case analysis around architecture and benchmarking. Early signs indicate a growing Beam community and positive feedback around supporting multiple processing platforms.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter