Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News ApacheCon 2019 Keynote: Google Cloud Enhances Big-Data Processing with Kubernetes

ApacheCon 2019 Keynote: Google Cloud Enhances Big-Data Processing with Kubernetes

Leia em Português

This item in japanese

At ApacheCon North America, Christopher Crosbie gave a keynote talk title "Yet Another Resource Negotiator for Big Data? How Google Cloud is Enhancing Data Lake Processing with Kubernetes." He highlighted Google's efforts to make Apache big-data software "cloud native" by developing open-source Kubernetes operators to provide control planes for running Apache software in a Kubernetes cluster.

Crosbie, product manager for Open Data and Analytics at Google, began the talk by reviewing Google's history of contributing to the open-source big-data community, starting with the map-reduce paper that inspired Hadoop, and how the Google Cloud Platform (GCP) has recently been offering managed versions of various Apache applications. In particular, Crosbie's GCP Cloud Dataproc service provides managed versions of Apache Spark and Hadoop clusters. Crosbie's talk focused on the pain points associated with the administration of these clusters, and how switching the cluster management to Kubernetes can solve many of these problems. The Dataproc team has built Kubernetes support into their product, but they have also open-sourced much of their work, so users are able to run Spark and Hadoop in their own Kubernetes clusters.

Because Spark and Hadoop are distributed systems that run on a cluster of machines, they require a cluster management system to handle various administrative tasks, such as checking the health of the various cluster nodes, restarting or replacing failed notes, and scheduling jobs. Both systems support multiple cluster managers, including YARN and Mesos. Much of the Dataproc team's previous work has focused on building control planes for YARN, integrating YARN controls into the GCP API.

However, Crosbie says using YARN has several pain points. YARN's dependencies on other components result in a complicated open-source software stack, and dependency management, versioning, and job-tuning are hard. To reduce operational overhead, organizations often maintain only a single cluster for running all jobs, which means jobs must compete for resources and the cluster may be over-provisioned. 

Crosbie claims that using Kubernetes as the cluster manager can solve these pain points. Because many organizations are now running Kubernetes as a cluster manager for their own containerized applications, using Kubernetes for managing Spark eliminates the need for a second resource-management interface for YARN or Mesos, simplifying operations. Because the applications are containerized, dependency and version management are "built in" to each container, and execution is more isolated. The overall system is more resilient, since security patches and O/S updates are handled at the lower level of the Kubernetes cluster itself, and don't require changes to the Spark and Hadoop containers.

The key to using Kubernetes to host Spark was to extend the Kubernetes API using custom resource definitions and operators. This allows the Kubernetes control-plane API to "speak the language" of the hosted app. Crosbie's team has open-sourced operators for both Spark and Apache Flink. Users can opt to run a hosted solution in Google's Cloud, or they can download a helm chart and run on their own Kubernetes clusters.

Crosbie concluded his talk by listing some pros and cons of switching to Kubernetes over YARN. Many of the cons are the result of operational "inertia"; for example, existing jobs may be well-tuned for running on YARN, but would need re-tuning if migrated to Kubernetes. Likewise, YARN's log file conventions are well-understood and organizations may have built auditing and monitoring specifically targeted to it. Crosbie likened security on Kubernetes to a "Russian doll," with several layers:

Kerberos within [Kubernetes] RBAC controls, within VM service account, within cloud IAM, backed by Cloud Identity often synced to something else.

The Kubernetes operator for Spark was developed in collaboration with IBM and Microsoft. It was open-sourced as an alpha version last year, and beta earlier this year. The operator for Flink was open-sourced last week, but has not had an official release tagged.

Rate this Article