Key Takeaways
- Beam is a "glue project" that expresses the concepts connecting multiple complex open-source projects. Beam's programming model allows a unified approach to batching and streaming
- Abstract API's allow for portability across runtimes like Spark, Flink, Apex, and Calcite.
- Beam unifies batching and streaming with configurable pipelines
- Beam supports Python and Java SDKs currently, for focus on what's widely adopted in the data engineering and analytics communities.
- Beam aims to generalize and influence other data processing systems.
Apache Beam, a unified batch and streaming programming model made its way to a top-level project with the Apache Software Foundation earlier this year. Beam initially started as a component of Google Cloud Platform's DataFlow service. Beam has been steadily gaining headway in search frequency since early last year and has steady growth and participation in its community email distribution lists. Searches for Beam highlight some recurring questions around Beam that frequently come up.
The impetus for using Beam are covered in a number of posts, as well as discussed in tech-talks and conferences. As part of the primary motivations for the project, the portability and future-proofing aspects of Beam make it a worthy consideration for any new data pipeline project involving increasingly ubiquitous streaming-data. Frances Perry, Beam Project Management Committee (PMC) guide and Google DataFlow engineer provided InfoQ with an in-depth overview of the Beam programming model and some compelling reasons for using it to build new data pipelines.
InfoQ: Conceptually, Beam feels like it takes an "analytics-first" approach to thinking about data pipelines, just based on its semantics and conceptual nomenclature. Can you describe how the thought-process and design for Beam, even in terms of its core concepts and semantics is different from its predecessors?
Frances Perry: I like that term! Beam is designed to separate information about your program and your data from details about the underlying runtime. That means that when you use Beam, you're focused on the properties of your data (is it bounded? unbounded? potentially out of order?) and the algorithms that you want to perform on it (sums, joins, filters, histograms, etc). You're not specifying information about the underlying runtime -- how it should shard data for good performance, what the current data delay is, etc. And that's really important because these things aren't fixed. Even the most careful hand tuning will fail as data, code, and environments shift.
InfoQ: The core project has a fully supported Python API a central focus. Even before open-sourcing Beam, was a python API always a focus? How did working with python influence the underlying Java API's?
Frances Perry: Yes -- we've always known that you need to meet developers where they are at. Language preferences run deep. One of the early lessons we learned is that it's a delicate balancing act when you want to manifest a single set of concepts across multiple languages. You want the abstractions to align cross-language, otherwise it's a complete nightmare to track the subtle differences, teach the concepts, etc. But on the other hand, each SDK needs to feel native to developers -- Python developers want pythonic APIs, not ones that smell like they were written by a Java developer.
InfoQ: How do you and the other project chairs' see the long term challenges of Beam maturing over time as it adds support for more processing backends?
Frances Perry: Beam is a glue project -- the abstractions at its core can express concepts that connect multiple complex projects in ways that make them work together. These multiple points of integration are both a key strength, but also present a real challenge. From the beginning, we've designed Beam to be extensible. We started with the idea of multiple runners, and soon realized that we'd also need options for multiple SDKs, multiple IO connectors, and multiple file systems. But as we all know, building an API to generalize one thing is a farce -- it takes multiple working implementations to get it right. And even though many runners do very similar things, they each do it a bit differently. So figuring out how to build the right abstraction that can balance portability and expressiveness is hard. And that's where the experience of folks across the community is so valuable.
InfoQ: Is there a risk of fan-out from an increasing number of supported backends that could make Beam's core abstractions challenging to make work well? For example, the overlapping feature set between processing engines at the core of Beam semantics.
Frances Perry: We hear this concern a lot -- that Beam is going to end up as the lowest common denominator and therefore be uninteresting. Our aim is for Beam to be neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines. Keyed State is a great example of functionality that existed in various engines and enabled interesting and common use cases, but wasn't originally expressible in Beam. We recently expanded the Beam model to include a version of this functionality according to Beam's design principles. And visa versa, we hope that Beam will influence the road-maps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam (née DataFlow) model.
InfoQ: Now that Beam has had its first stable release, what incentives do organizations with well established Spark, Flink or Hadoop pipelines have to migrate or port existing systems over to Beam?
Frances Perry: There are three points that make Beam a compelling choice over other options.
- Unifying batching and streaming: Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.
- APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next bullet) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid over-specification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.
- Portability across runtimes: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-premise to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud.
InfoQ: Have you seen any use cases where in order to use Beam the underlying processing engine needs to be reimplemented a different way?
Frances Perry: Definitely -- and as discussed above, that's unsurprising and even encouraging. But at the same time, we're trying to be very practical about how to deal with these mismatches. We've got the capability matrix to track very clearly which parts of the Beam model each runners is able to support, and whenever possible we design additional functionality to be optional, so that adding it gives runners the option of improving performance but there are simpler, correct implementations.
InfoQ: What's next for the Beam project?
Frances Perry: In my opinion there's a three key areas: portability infrastructure, production readiness, and project integrations.
We have portability in place for the Java SDK -- the same pipeline can be run on multiple runners. But we still have to connect a few more pieces so that all Beam SDKs get this functionality. I'm eagerly awaiting the day when the Python SDK is equally portable.
There are folks running Beam in production now, but the more users it gets, the more we'll be able to harden things and smooth the awkward edges. The community interaction is key here to improving the experience for everyone.
And finally, we're continuing to find new ways to integrate Beam with other open source projects. Not only have we added new runners, like Apache Apex, we're also creating integrations with projects like Apache Calcite to open up Beam to new categories of users.
About the Interviewee
Frances Perry is a software engineer who likes to make big data processing easy, intuitive, and efficient. After many years working on Google's internal data processing stack, she joined the Cloud Dataflow team to make this technology available to external cloud customers. She led the early work on Dataflow's unified batch/streaming programming model and is now on the PMC for Apache Beam.