Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Spotify Open-Sources Terraform Module for Kubeflow ML Pipelines

Spotify Open-Sources Terraform Module for Kubeflow ML Pipelines

This item in japanese

Spotify has open-sourced their Terraform module for running machine-learning pipeline software Kubeflow on Google Kubernetes Engine (GKE). By switching their in-house ML platform to Kubeflow, Spotify engineers have achieved faster time to production and are producing 7x more experiments than on the previous platform.

In a recent blog post, Spotify's product manager Josh Baer and ML engineer Samuel Ngahane described Spotify's "Paved Road" for machine learning: "an opinionated set of products and configurations to deploy an end-to-end machine learning solution using our recommended infrastructure." By adopting these standards, Spotify's machine learning engineers no longer need to build or maintain infrastructure and instead can focus on their ML experiments. Since launching the platform in mid-2019, about 100 internal users have adopted it and run up to 18,000 experiments.

Spotify has long used machine learning for automatically building customized playlists such as "Discover Weekly" that recommends new music to users. The company's initial policy was to allow teams to choose their own tools and framework; much of their existing infrastructure used the Scala language, and included several custom libraries that Spotify has since open-sourced, including Scio, a data processing library for Apache Beam. This system had several drawbacks: it did not scale well and required the team to support multiple frameworks, slowing iteration time from concept to production. Further, many engineers "would never consider adding Scala to their Python-based workflow." These challenges caused Spotify to rethink their framework choices and develop the "Paved Road" concept.

The Paved Road is intended to address the problems of the end-to-end ML workflow---the multi-step process of developing and deploying a model, which includes:

  • data pre-processing
  • feature transformation
  • model training
  • model evaluation
  • model serving

In particular, Spotify focused on the problem of the data interface between these steps, choosing the TFRecord and tf.Example formats defined by Google's TensorFlow Extended (TFX) ML platform. Building connectors from TFX to their existing tools gave the team a path to migrate toward a common framework. Soon they were able to take advantage of several other TFX components such as Tensorflow Data Validation (TFDV), which allows developers to detect problems with data, such as skewed distributions or erroneous values. However, there were still challenges; for example, their tooling lacked an end-to-end orchestration framework.

The final result was to move to Kubeflow Pipelines (KFP), an open-source ML workflow platform. In this platform, components of the workflow are packaged as Docker containers managed by Kubernetes. KFP supported TFX out-of-the-box, so teams did not need to learn a new framework. Another advantage is the KFP SDK, which pre-packages many common tasks, allowing for code sharing and re-use. Spotify deploys and manages shared Kubeflow clusters so that developers can focus on ML experiments instead of the details of managing infrastructure. The clusters are hosted on Google Kubernetes Engine (GKE) and are configured using Terraform.

Kubeflow Pipelines is part of the growing space of tools for managing the full machine-learning lifecycle, many of them open-sourced by large companies: for example, MLFlow from Databricks and Michaelangelo from Uber. AWS recently released several new Sagemaker services that attempt to unify the ML lifecycle stages into a single interface. Spotify ML engineers Ryan Clough and Keshi Dai spoke about Spotify's platform at KubeCon 2019, and Clough later tweeted:

If there's one takeaway I have from [KubeCon] it's that we're all dealing with similar problems with different variations due to the constraints of our organizations, and we're working hard to put our best offerings forward, and nobody has got it all figured out. Nobody.

Spotify hosts several open-source projects on GitHub, including their own TensorFlow helpers and other ML-related libraries.

Rate this Article