Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Designing Better ML Systems: Learnings from Netflix

Designing Better ML Systems: Learnings from Netflix



Savin Goyal shares lessons learned by Netflix building their ML infrastructure, and some of the tradeoffs to consider when designing or buying a machine learning system.


Savin Goyal is an engineer on the ML Infrastructure team at Netflix. He focuses on building generalizable infrastructure to accelerate the impact of data science at Netflix.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Goyal: I'm going to share some of my experience building and designing ML infrastructure at Netflix. For many people, ML at Netflix is synonymous with our recommendation systems. When you log into Netflix, every single element of the user interface is personalized to your taste. Our usage of ML actually goes much beyond that. We have significant investments in a diverse set of areas, from figuring out how to accurately value content for our service, fighting payment fraud, service abuse. As one of the biggest studios, we rely on data science for efficient content production. We have to figure out efficient shoot schedules. Do automated QA of raw footage. Figure out opportunities on how we can advertise efficiently to non-members, as well as making sure that our existing members never run into rebuffers.


I'm going to talk about some of the recent work we have done in supporting all of these broad and diverse use cases. I'm going to focus more on some of the higher level design principles with the hope that some of those might be useful to you in your day-to-day work. All of the work that I'm going to talk about is now open source and is broadly available on GitHub.

Data Science Stack

Since we are building infrastructure, specifically machine learning infrastructure, to cater to a wide variety of use cases, it is important to understand, what are some of the common concerns that cut across horizontally in each of these use cases? Let's start by looking at the day-to-day life of a data scientist who is working on any of these projects, and what does their stack look like? Any ML project starts with data. A data scientist needs to be able to reliably and efficiently query the data warehouse to find the data that they need. This data could be Amazon S3 that Netflix uses. It could be Google's GCS, Azure's Blob store. Could be some other distributed file system like HDFS. Usually, many organizations tend to keep their data in one of these data warehouses. Once as a data scientist you have access to this data, to perform any transformation on this data set you need access to compute resources. Oftentimes, that's the laptop, many times it could be the cloud as well as a Kubernetes cluster. Next comes the question of how this compute would be orchestrated on these compute resources. Are you simply going to submit these jobs one after the other calling some API? Are you going to rely on a workflow scheduler like Airflow to string together your compute? Once you have these layers sorted out as an end user, you need to understand how to essentially start writing your code against the APIs exposed to you, so that you can actually perform the compute, you can actually train the models that you intend to do. How would you like to architect your code? Up until now, there's very little ML specific stuff that we have spoken about. It's very much like cookie cutter software engineering, systems engineering. It's above this level that we start getting into the realm of machine learning through model operations. ML development is a highly iterative and experimental exercise. As an end user, you have to worry about keeping track of all your different experiments, all your hyperparameters, your models. Maybe you rely on some best practices. Maybe you bake these best practices into your coding standards. Maybe you use some off-the-shelf tooling. Maybe you work for a big company where you have a dedicated team that is building some of this specific tooling for you. Finally, at the very top of the stack is all the tooling for actually training your models. This includes your favorite IDE, your feature engineering code, your favorite ML frameworks, like TensorFlow, PyTorch. Maybe you might just want to roll your own algorithm implementation.

How Much Data Scientists Care

What's interesting is that from a data scientist's point of view, they deeply care about what tools they have at their disposal at the very top of their stack. They'll have strong opinions on whether they want to use TensorFlow or PyTorch. They'll have preferences about their favorite IDE. At the lower levels, they don't really have express opinions. They don't care if their GPU is coming from a Kubernetes cluster or a server rack hidden in somebody's closet, as long as they have quick and easy access to that GPU. Same for that data as well. They don't care if that data is stored in Parquet format, or ORC, as long as they can very quickly and reliably query that data.

How Much Infrastructure Is Needed

Unfortunately, in any organization, while it's easy to get off-the-shelf tooling for layers of the stack that are high up, there has been significant progress that has been made in the open source community. Still, a lot of effort is needed to set up and maintain the lower levels of the stack. How do you set up your data warehouse? What are some of the best practices in terms of writing data to that data warehouse? How do you set up and maintain your Kubernetes cluster? How do you orchestrate that compute? These are the details that a data scientist won't ideally want to get into.


This was indeed our observation at Netflix as well. We essentially decided to build a framework that will allow our data scientists to move across these layers easily, while affording them complete freedom and flexibility of tooling at the top layers of the stack. Metaflow is Netflix's ML framework geared towards increasing the productivity of data scientists by helping them focus on data science and not engineering. It's an open source project that's available on GitHub, that you can take a look at. I'll essentially talk about some of the higher level principles that we learned the hard way while building Metaflow. Each of the layers that I spoke of are huge, daunting pieces of infrastructure by themselves. ML infrastructure is such a broad field that we can talk about any of these layers for hours, and we would still have barely scratched the surface. I want to focus on three specific layers of the stack, namely, architecture, job scheduler, and compute resources. I'll talk about how we view these layers in our experience. How these layers should interact with one another to provide a productive experience to our data scientists.

ML Workflows as DAGs

A very natural paradigm for expressing data processing pipelines, machine learning in particular, is data for DAG, or Directed Acyclic Graph. In this example, we have a DAG, the first step, the user is fetching some data or maybe they are doing some feature engineering on it, then maybe they decided that they need to train two different models. Maybe they are just playing around with different hyperparameters, and they want to choose the best one to publish at the end of it. At this level, the DAG doesn't really say anything about what code gets executed or where it is executed. It's mostly about how the data scientist wants to structure their code. This concept of a DAG, its utility is predominantly in helping data scientists organize their work conceptually.

Unbundling the DAG

If we zoom in a bit into this notion of a DAG, there are three distinct layers that we'll see. There's this fundamental layer of architecture, where the user has defined what code needs to execute. Then there's this encompassing layer of a job scheduler, which dictates how the code will be executed. Then we have this layer of compute, which dictates where the code is going to execute. If you look around in many existing systems, they require a tight coupling between these layers, which was often necessitated by infrastructure limitations that predated the cloud. As an example, the user may have to specify their code using a custom DSL, which limits the work that they can do. This DSL may have to be executed with a built-in specific scheduler that is then again very tightly coupled with the compute layer. While this tight coupling may be justified for domain specific use cases, say, high performance computing, when you're building infrastructure to support hundreds of different use cases, from natural language processing to classical statistics, you'd ideally want these layers to be decoupled so that the user can choose which layer to use when. This was inherently the motivation for Metaflow.

Unbundling the DAG with Metaflow

With Metaflow, our data scientists can architect their modeling code in languages and libraries that they are familiar with. They can leverage the rich data science ecosystem in Python and R without any limitation. The user's code gets packaged for the compute layer by Metaflow, so that the user can focus on their code rather than writing Docker files. Finally, the scheduling layer takes care of executing the individual functions using the compute layer. From the data scientist's point of view, the infrastructure works exactly as it should. They can write idiomatic modeling code, use familiar abstractions, and the code gets executed without hassle, even at massive scale.

Architecting Flows

How does it work in practice? There are many ideas that make complete theoretical sense but might fall flat when they face reality. Let's look at Metaflow's programming model and see if it actually really works in practice. Data scientists when they use Metaflow, they can structure their workflow as a directed acyclic graph of steps as depicted on this diagram. The steps can be arbitrary Python code. In this hypothetical example, we have a variable x that is incremented by two steps, A and B, A incremented by two, B incremented by three. Both of them execute in parallel because if you notice, step A essentially transitions into two steps in parallel, step A and step B. In the join, we essentially take the maximum of those two concurrently existing values of x. In the case of Metaflow, defining a DAG is as simple as annotating your functions with the @step decorator for nodes of your graph. For edges, you can simply specify the transitions using the method calls. Given that any of these steps are just executing arbitrary Python code, you can literally use any library available to you in the Python universe. Metaflow by itself does not place any constraints on that. At Netflix, we even do have a significant number of users who are committed to Erlang. We do provide a similar package in R as well for them with exactly the same set of capabilities. Also, note how the value of x was available both to step A and B, and then two copies of x, one with the value of 2 and the other with the value of 3, are now both available concurrently in this join step. Metaflow is essentially taking care of state transfer out of the box so that the user doesn't have to worry about any of the data flow details. This paradigm by itself, this notion of arranging your work in this DAG paradigm is really powerful, since it very simply allows our users to properly organize and visualize their work that gets them a lot further than their imperative style of writing code.

Specifying Compute

Once the user has specified their workflow, there could be a scenario that they lack the ability to execute some or all of the steps of their workflow on their laptop. This is pretty common if you need to train a model that needs access to GPUs, or maybe it needs access to many more cores of CPUs than you have already available to you. Maybe you want to process a huge data frame that needs 200 gigs of RAM, and your laptop just is bottleneck that 16 gigs of RAM. In Metaflow, the user can easily declare the compute layer for their steps, and Metaflow will ensure that these steps get executed on that compute layer. In this example, specifying the @resources decorator is akin to the user saying that this step needs to execute on a cloud instance with four GPUs. In this example, the user never had to specify how their code is getting packaged. How are they baking a Docker container? If let's say, eventually, their code is running on top of Kubernetes cluster, they don't have to deal with a Kubernetes API, or even when we move a compute unit from their laptop to the cloud, they still need access to the raw data. Metaflow takes care of moving around data, dealing with the underlying compute layer behind the scenes for the user so that the user doesn't have to focus on that. Many times, it's not only the resources that might be a constraint, maybe you might want to process some sensitive data that you don't have access to on your laptop. At Netflix, this is a very common use case. At Netflix, some of the teams have built this amazing compute system called Archer, which provides access to all the media files, all the raw footage from our TV shows and movies in a secure environment. In this scenario, in this DAG that you have, maybe all you want to do is some summarization on some media files in your start step, but you don't want any of your other steps to execute in this secure environment that Archer provides. You can now very simply just annotate one of your steps with @archer and then Metaflow will figure out how to package your compute and execute the compute on Archer in a very secure and stable manner.

Executing Workflows

Once the user has specified a workflow, orchestrating the execution of the DAG belongs to the job scheduler layer. The scheduling layer doesn't need to care about what code is being executed. Its sole responsibility is to schedule the steps in the topological order, making sure that a step finishes successfully before its successors in the graph are executed. Metaflow ships with a local scheduler, which makes it easy to test workflows locally on laptop or on the cloud. While the built-in scheduler is fully functional, in the sense that it executes the steps in a topological order and it will handle workflows with tens and thousands of tasks. It lacks support for triggering workflows for alerting and monitoring on failures by design. Since we've been talking about the value of interchangeable layers so far, rather than build yet another production grade DAG scheduler by ourselves, we can very simply just combine this DAG into something that a production DAG scheduler understands. Most DAG schedulers, at the end of the day, they are executing a graph. We already have a representation of the graph from the user. It should be rather straightforward for us to write a compiler that can compile that down to something that the scheduler understands. Another benefit that I want to highlight here is that when we were building Metaflow, we wanted to provide strong guarantees about backward compatibility to the user-facing API, the API against the user's code. The user can now write their code confidently knowing that Metaflow will schedule and execute their code without changes, even if the underlying scheduling and computing layer evolves over time. For example, if you have specified @resources decorator, we might be launching that step on a Kubernetes cluster, and that Kubernetes cluster might evolve. The underlying APIs might change. The user doesn't have to take any explicit action on their end, because Metaflow will take care of evolving along with the underneath line layers while providing a consistently stable API to the end user. This eases the burden on part of the user where they don't have to go through a migration pain as the world evolves beneath them.

Executing Workflows Locally

Here's an example of executing the flow locally using Metaflow's built in scheduler. Metaflow will validate the workflow, making sure it's a well-defined workflow. It will assign the execution a unique ID so that you can inspect the state of the execution at any point of time in the future. Any steps marked with a specific compute environment, say @resources, @archer, or maybe you've written your own decorator, they will get executed in those environments, and Metaflow will pipe back the logs to their console. If the user specified 200 gigs of RAM with the @resources decorator for one step, to them, it would feel like suddenly somebody came in, swapped out their laptop with a bigger laptop that had 200 gigs of RAM. They didn't have to take any specific action besides specifying that decorate. This is incredibly productive for our users, because then they can squarely focus on their business logic on their code, their machine learning training code, and all the underlying systems engineering nitty-gritties are just abstracted away by the library.

Deploying to Production Schedulers

For integrating with production schedulers, currently Metaflow has two integrations. In open source version, we have an integration with AWS step functions, which is a managed offering by AWS. It's highly available. Scales out incredibly well to thousands of tasks in a given workflow, as well as thousands of concurrently running workflow. More importantly, it's a managed service by AWS, so it has zero operational burden. With one single CLI command, python step-functions create, you can export your Metaflow workflow to step functions. Metaflow will, behind the scenes, compile your workflow into a language that step functions understands. You as a user, you have literally no need to familiarize yourself with the documentation of the SDK that step function ships or understand any of the nitty-gritties. All of those concerns are abstracted away. The great thing about a scheduler like step functions is that it integrates further with the rest of the AWS infrastructure, so you don't have to take any explicit action. All of your logs would be made available in CloudWatch, all of the alerting and monitoring is available to you out of the box. We also have a similar internal integration with Netflix's meson, which is a production scheduler. It's internal to Netflix. We haven't been able to open source that yet. That's where most of our machine learning pipelines execute. Again, our users don't ever have to worry about the API that meson exposes or how it evolves behind the scenes. Just very recently, the team behind meson did a migration where they migrated all of their users from one version of their SDK to another. None of the users of Metaflow had to take any specific action to migrate all of their workflows, just migrated behind the scenes, because all we had to do was change the API integration that made up meson.


The primary takeaway is that problems are solved by humans and not tools. We should build our tooling with human centricity, keeping our end users at the center. Our users, they should focus on the details of their ML work and not on the details of compute and scheduling layer that sits beneath them. Decoupling the architecture, job scheduler, and the compute layer, that totally makes sense, at least for our use cases.

Cloud Sandboxes

If any of these ideas resonated with you, Metaflow is an open source project. We also have complimentary sandboxes available for you where you can try out Metaflow's cloud integrations. You can sign in with your GitHub ID, and we'll provision you an AWS account with all the infrastructure set up for you so that you can essentially figure out that everything I spoke about, actually really works. We have lots of documentation available online.


See more presentations with transcripts


Recorded at:

Jul 23, 2021