InfoQ Homepage Presentations Human-centric Machine Learning Infrastructure @Netflix

Human-centric Machine Learning Infrastructure @Netflix

View Presentation

Speed:

Download

49:11

Summary

Ville Tuulos talks about choices and reasoning the Netflix Machine Learning Infrastructure team made on developing tooling for the data scientists and some of the challenges and solutions made to create a paved road for machine learning models to production.

Bio

Ville Tuulos is a software architect in the Machine Learning Infrastructure team at Netflix. He has been building ML systems at various startups, including one that he founded, and large corporations for over 15 years.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Thanks for joining me today. You actually made a great choice. So this is the first time we are talking about this piece of infrastructure, so this is like an unveiling of a new Netflix original. So I hope it will be exciting. So first, just to give you an idea, how many of you are data scientists or work with data scientists closely? So, hands up. Okay, so good number of people here. So I guess it goes without saying that data science is a very hot area right now. There's a lot of exciting stuff happening, a lot of job opportunities for data scientists. So it's a good time to be a data scientist today. And I think it's a great time to be building machine learning infrastructure as well. So I've been doing this for a long time. I am really passionate about building machine learning infrastructure.

At the same time, I do have this nagging feeling that eventually I will go unemployed, and this won't be needed anymore. This talk here should have a best before date, maybe over the next 5 to 10 years. So maybe this is irrelevant. I hope it will be relevant in 2028, 2027, something like that. And the reason for that is that there is the pendulum of computing, and we start with something that initially is a technical problem. If you go back to, let's say, '97, and you wanted to set up an ecommerce shop, it was a technical problem, setting up all the Apache servers, and racking and stacking the machines, because the cloud didn't exist. And it was really a technical problem making it scale. And then over time, it became more and more of a human problem. So today, if you wanted to set up an ecommerce shop and you pitched your idea to a VC, you wouldn't be pitching it based on the idea that, "Oh, I can build scalable web services." Because, yes, I mean, you can use Google App Engine to do that these days. It would be more about, "Well, do you actually understand your customers and how do you compete against Amazon?" And it's the same deal with machine learning infrastructure.

This is the sweet spot in the history, where everybody agrees that we have a problem. And I guess it's always the first step to admit that we have a problem. And the problem is that we want to make data scientists more productive. And also we want to make data scientists more productive and just make it easier to apply machine learning to different business problems. So those two things. And given that I know that this is something that the industry will solve over the 5 to 10 years, we wanted to be a bit forward leaning at Netflix, really focusing on the problems that we need to be addressing over the next five years. And I feel that those are more on the human side of things.

So of course, your mileage may vary. I mean, at your company, it might be more technical in nature. It might be about scalability of machine learning models. It might be about lowering the latency for real time predictions. And those are all good and valid problems, and I'm actually happy that there are many companies solving these problems, for us as well. But by and large, for most companies, these issues are becoming human issues. And hence the title "Human Centric Machine Learning Infrastructure."

Caveman Cupcakes

Before coming to Netflix, I used to work at startups for a long time here in San Francisco, building machine learning infrastructure. And so hence I want to start with the startup story. So this is called Caveman Cupcakes. Imagine, it's an amazing startup. They make these gluten free paleo vegan sugar-free cupcakes. Really good for you. And then here's the founder. And the founder, proud founder of the Caveman Cupcakes, he has learned that data science is really, really important for success of any startup, any company these days. So the founder wants to go ahead and hire a data scientist. And now imagine that he gets a referral that this amazing person, amazing individual, Alex. And Alex is this full stack old school data scientist who can do everything from low level coding all the way to writing academic papers, and anything and everything in between. So he gets hired. And the first problem that the founder wants Alex to solve is to come up with the dynamic pricing model. So it happens that the market for cupcakes is really competitive and the margins are thin. So they really need to optimize the margin, so come up with the pricing model.

And now, well, as always, in data science, we can make some silly assumptions to make math easier. So let's just say that the optimal pricing model can be represented by one single gaussian. It's a silly assumption, but let's go with that. And now Alex starts thinking about the problem, and asking around and collecting some data and doing some prototyping in a notebook. And he comes with an approximation, the black line here, that's pretty close to the optimal pricing model. And now he's able to come up with the [inaudible 00:04:49] solution and he makes an implementation in C++.

So he also is a software engineer. So he can do this in C++. And they deploy this in production and the results are great. So they able to increase revenue by 15% and the founder is very happy. And now as it oftentimes happens when you make something really successful, people come to you asking for more of the same good stuff, and now the founder comes and asks that, "Can you predict churn as well?" So at some point, people stop eating cupcakes and they want to prevent that, "Okay, people shouldn't churn."

So now Alex thinks about the problem and he realizes that the churn problem is pretty close to the original pricing problem. So not exactly the same, but close enough that he is able to take his existing solution, the solution he wrote in C++, and generalize it so that it covers both the use cases. So he's a good software engineer so he knows that he wants to build this generalized solution. So he builds something that covers both the use cases pretty well. So he extends his code and deploys it to production and the founder is, again, very happy, now able to prevent churn really effectively as well.

And now, again, there are more and more requirements. So the next thing the founder asks Alex to do is that, "Now that our foundational business is really in a good shape, so we want to expand, so we want to spend more on marketing. So can you come up with the attribution model for marketing?" And now this challenge is Well, I mean, first, Alex isn't an expert in marketing, so he needs to read some literature and understand what people have done in this field. But also the problem is quite different from the few other problems that he solved previously. So now he really, really wants to generalize his really awesome piece of C++ to all these problems, and he tries to come up with the solution, but it's kind of a watered down version. And also, he just didn't understand the problem well enough. So now when they deploy this in production, the results maybe aren't that great.

So now imagine that instead of hiring Alex, the founder would have done something different. And in this case, instead of hiring a single individual, he would have hired a team of specialists. Maybe one of them has a PhD in economics, and maybe she's really great with pricing models, and another person is really good with marketing. And the third person is really good with, I don't know, churn models. Now, when the team of data scientists sees these problems, each one of them has the domain expertise to come up with a really good solution for each of these problems. So they get really a good version for the attribution problem, they get a good model for the pricing, and so forth.

But now, what is likely to happen, is when you hire a specialist, a data scientist who really know these fields, they might not be experts in software engineering. So it might be that they have used R. Like we heard in the previous presentation, R is an amazing language for doing data science. So it's a great choice. And by all means, you should be using R, let's say, for the pricing model. And the same thing with Python. Python has an amazing ecosystem. So it would be silly to say that you can't do this in Python. But now, it is of course possible that these solutions- when it comes to software engineering and DevOps best practices- aren't quite as great as what Alex wrote, who had a lot of experience with production systems and so forth.

So now, if we compare these two different approaches. So we compare Alex to the team. So the problem with Alex was that the code was great and the models were actually pretty great. Well, he didn't quite understand the marketing domain well enough. It was all around great. But I mean, it was obvious that as the business scaled, and as the problems became more diverse, his limitations started to be obvious. So you can't scale a large organization based on these unicorn, amazing individuals. And it's just not a good idea in the long term.

And then when you have a team, I mean, you can afford hiring a team of specialists who really know these fields and overall are really strong data scientists. But then it might be that they don't have the software engineering background to make the solutions production ready. And this is probably something that most of you hopefully feel that it's pretty natural and happens often times. And, you know, this has happened many times. Again, going back to the example of web services, this has definitely happened before, that we start with some technical field that requires really deep understanding, low level coding. And then over time, it becomes the higher and higher level activity. And now data science is definitely like moving higher and higher up in the stack.

Well, we do have this operational issue, how do we operate these things in production? And one way how historically we have solved problems like this, is that we just provide high level abstraction. So we just provide some amazing machine, amazing, let's say, data science machine that hides all the nitty-gritty details of how you do distributed computing and how you optimize your models for GPUs and SIMD and so forth. And then as long as you are operating this machine and always taking care of, all is good. And now your data scientists, all they have to do is to work within the constraints of the machine and it takes care of everything for you.

And now in the context of data science, there are attempts to make such machines. So it could be Spark, it could be TensorFlow, it could be RStudio; many attempts to do that. And of course, if you ask them, it might be that they are not like trying to solve everything. But the mindset is that, well, as long as you are working within the constraints of the machine, everything is good.

But at Netflix, one challenge that we have seen is that when you have a very, very diverse set of problems, we are not just talking about one or two, or maybe not even three different data science problems, but you want to apply machine learning and data science in every single area of your business, it's very, very hard to have a single abstraction that works for everything. And as software engineers, we have, of course all seen this happening many times in different areas of software engineering, not only data science- that there are many, many promises, that if you use this framework or if you use this system, it just magically solves all your problems. And it never works.

Build

And hence, if you take a different mindset, that instead of putting your data scientists to work on this data science machine, and say that, “well, this is all you can do,” you can take an opposite viewpoint, that what if you have a total freedom, you can do whatever you want. Absolutely whatever you want. Total freedom. And I don't know if you saw yesterday, a talk about like the microservice lessons learned by Google, and the presenter had- Ben Sigelman, I guess was his name- he had an excellent example of having ants and hippies. So I guess this would be the hippie version of things. So you do whatever you want. And it sounds great, although like the presenter yesterday pointed out, there's a heavy tax. And especially as the organization grows, there's a heavy tax when you let people do whatever they want.

So, of course, at Netflix there are a few things that are given. So for instance, if you are a data scientist at Netflix, we think that you should be probably using Netflix data at least. You don't come with your own data and say "Oh, yes. I know, I mean, I'm here and this is my first day, and by the way, I am only analyzing my own data." So you get Netflix data from the Netflix data warehouse. I haven't met a single data scientist who wants to design their own data warehouse. So the data is in S3. I mean, "Yes, I'll be using that."

In the same way, actually there have been two computer- two data scientists, who have said that well, they would like to build their own machines so they get exactly the GPUs they want. But most people, they are just happy to get the compute resources off the shelf. So you say that, "I want 400 gigs of RAM. I want two GPUs, and where they come from, I really couldn't care less. I mean, it comes from the cloud, whatever." So compute resources are given. Well, now, when you have data and you have compute resources, the next problem is that, well, you can't just push stuff to the, let's say, some magical cloud and expect that everything works. So you need some kind of a system to orchestrate the jobs and maybe execute the thing every day, and so forth. So you need some kind of a job scheduler. Not very interesting, maybe in the data scientist's point of view. So again, most people say that "sure." I mean, I thought that that's easy. Let's just assume that we have a job scheduler.

Now, versioning is a funny topic. So it's like you know that they say that you should be flossing every day and you should be eating properly, and all the vitamins and so forth. Everybody knows that you should be doing it and still it's actually not that exciting. In the same way, all responsible data scientists know that they should be versioning their code. They should be versioning their notebooks. They should be versioning their experiments. And maybe they even should be versioning their data. And again, no one does it since it's just not easy enough. And again, if someone, if a magical versioning fairy came to you and said that, "Yes, we will just version everything for you automatically," yes, people would be very happy. No doubts about that.

Now, again, this is another thing that people just think that, “yes, sure, it's given that you are not an island, you are not a single data scientist. You work as a part of a team.” There are collaboration tools like Slack and maybe Bitbucket, GitHub, so forth. So again, sure. People very much agree that these are good ideas and they don't challenge that we shouldn't be using these.

Now getting to a more sensitive area. You know that at some point, you need to deploy your models. And if you are a data scientist coming straight from academia, it might be that you had never deployed a model in production. You have written papers, you have created prototypes, you have run experiments, but you have really not deployed models in production. And if you imagine a large company like Netflix, and it probably applies to many of your companies as well, deploying anything to production is a non-trivial undertaking. So again, if someone says that we have this CICD system that helps you to deploy something in production, people generally are pretty happy with that. Although, how you exactly deploy and how you monitor your models is somewhat data science or the model-specific question as well.

And now moving higher up in the stack, we come to the question of feature engineering. And many of you who may have followed news and articles about data science, there's every now and then these articles that say that 80% of data scientists' time is spent on data monitoring or data transformations. And somehow, it's usually stated in a pretty negative tone, as if it was somehow a secondary activity, in contrast to something that's supposedly more awesome, I don't know, maybe modeling. I feel personally that this is really an area where data scientists can really give a lot of value-add to businesses, understanding both the business domain as well as the model. So I think it's valuable. But there are good examples of infrastructures out there, like Uber's Michelangelo- I believe that there's talk about that later today- that show that providing features as a service can be pretty useful as well.

And then finally, we come to the question of the actual modeling activity. So ML libraries. So is it so, that the company should provide you, let's say, five different models, and as long as you are using one of these deep learning models, or maybe one of our implementation of random forest, then you're fine? Or should it be so that you can choose to use whatever you want? And now, how we feel about this, is that if you are a data scientist, you probably should care quite a lot about the ML libraries. After all, I mean, that is your domain of expertise. And maybe you should care about feature engineering and then, to a degree, maybe you should care about model deployment.

But as we move downwards in the stack, it's understandable that, well, maybe you shouldn't care so much about it. And after all, I mean, you have limited amount of cognitive bandwidth, so you should focus your creative energies there on the top of the stack, rather than the bottom of the stack. And then my team is an infrastructure team. So we view ourselves as a very complimentary team to data scientists. So we should be caring about things that the data scientists don't care about. And then correspondingly, for things, although the name says machine learning infrastructure, which is paradoxical that we actually want to avoid doing machine learning because that…. I mean, that's what data scientists do. And we focus on infrastructure, that's why we want to care about all these other things. And then, of course, over time, we do want to provide good tools, good opinions, paved paths for doing these things like model deployment and feature engineering. But by and large, that should be the domain of expertise for data scientists.

Deploy

So most of the things that you see here on this slide are related to building models. And then there is the question of deployment. And how do you actually, after you have built something, and after maybe you leverage the existing platforms to build something, how does it actually start producing business value? And that is really the hard part that, although everybody says that machine learning is amazing and it is the new oil or whatever, the fact is that it doesn't have absolutely any value unless it is connected to some existing business processes, and the business is ready to consume the results.

I mean, when we say deploy, we don't only mean deploy in the sense of just running it on a daily basis, but really that, is there actually someone in the business who can change their like business decisions based on the model? And that is hard. And it's hard because of technical reasons. As I mentioned, just the operating things in production is nontrivial, but also it's hard psychologically. So there's the old saying that says that no plan survives contact with the enemy. And we can extend this to data science by saying that no model survives contact with reality.

So you're a data scientist, you have been building your amazing model, like in a notebook for two weeks, and you're super happy with its beautiful math, and then the test data works great, and the cross validation gives a really promising results. And that's the happy face, that's the happy part, that's the part that you really enjoy. And you have the nagging feeling that there's a chance that once you deploy to production, the results aren't that great. And if you have any kind of technical excuses for not pushing it to production, they are many convenient reasons just to stay in the happy La-La Land, and keep iterating on the theoretical model, and avoid that humbling moment that your baby just doesn't survive in reality.

So what we want to do is that we want to remove those roadblocks. We want to remove the excuses for not deploying. So putting these two things together, what we want to do, as the machine learning infrastructure team, we want to help data scientists to build models, especially focusing on the parts of the stack that are not that interesting for data scientists, and that are not in their core domain sphere of expertise. And then we want to help them to deploy so that we actually get some business value out from the work that they are doing. And also, they get to actually start working on that exciting part, which is that, well, it really, really works. It just doesn't work theoretically, in my notebook, but it really works in production.

Netflix Research

So that's giving you a broader philosophical context why we are doing this and like why we are thinking this in terms of being human centric. And in the Caveman Cupcakes example, I told you that there were three different business problems. So there was the pricing, and there was the churn, and there was attribution. Now you can imagine that as the company grows, as the company becomes more like Netflix, like this really ambitious global company that's growing fast, the appetite to apply machine learning goes through the roof.

So on Monday, you may have seen Justin's talk about multi-armed bandits, and how we can use it to optimize artwork. When you log into Netflix, and you see those titles, you know probably that it's driven by a machine learning system. They are personalized and we have a really mature recommendation engine for doing that. And that's great. That's the crown jewel of Netflix. But then at the same time, it's a tip of the iceberg. So there's so much that's happening behind the scenes that not many people know about. So many of these things are- this is only for you my friends- this is amazing amount of machine learning that's happening behind the scenes that not many people know about.

So they're super exciting topics for machine learning, like analyzing the screenplays, the textual content of the screenplays to understand if there will be demand for a movie or a series like this. Or how do we actually, when Netflix is producing originals, how do we optimize the production schedule so that everything happens at the right time? So it's really a complex operation when a movie is being made. Or overall, when we are licensing content, how much we should be paying. All those decisions are driven with the help from machine learning models.

And then there are things that are more operational in nature. You may know that Netflix operates one of the biggest content distribution networks on the planet, the Open Connect devices. And how we placed titles on the cache of these devices also can be determined with the machine learning model, or how we actually operate our platforms. We can make intelligent platforms by really understanding the usage patterns, or when you are streaming something, what should be the optimal bitrate depending on your network conditions that are of course highly volatile. Again, it could be a machine learning model. And obviously, there are things like fraud detection, analyzing marketing. It's literally hundreds and hundreds of problems that we're trying to solve using machine learning.

And now, this may give you some understanding why we feel so strongly about giving our data scientists freedom to choose the models that work best for these different problems. Because, as great, let's say, as deep learning has become over the past years, it is a stretch to say that we could solve everything using TensorFlow. Or it is a stretch to say that we could solve these problems using any single model. So it very much makes sense to give people freedom. And it very much makes sense then to have infrastructure that can support that freedom.

So now, besides all those different problems, data science interesting challenges that we have, Netflix is, of course, especially in this conference- I am sure that maybe some of you have a drinking game about Netflix presentations, and how often these different things are mentioned here. So Netflix is very well known for having a very mature infrastructure and mature platforms for many different use cases. My team works as a part of the data platform organization. So this is a bit like a data platform centric point of view. And for instance, Netflix probably was one of the first companies really starting to leverage this idea of using Amazon S3 as a data lake. So all the data goes to Amazon S3, and then there are different systems to query data from S3. It works really well for us. And on top of S3, we use different query engines like Spark and Presto, and so forth.

And then as our container management orchestration system, we have this project called Titus. I believe that there was a talk about that yesterday. It's open source, but I mean, you may have heard of Kubernetes in the same space works the same way. So if you think of this stack in terms of your company, probably you have some kind of a data lake or data warehouse, probably you are using some query engine like Spark [inaudible 00:24:05] Presto. You may be using some container orchestration platform like Kubernetes. And then on top of that, you most likely have some kind of a job scheduler, to manage say ETL pipelines or something of that sort. And Airflow is really a popular tool for that use case. We are using a system called Meson. There have been a few presentations about that. If you're interested, there was one at AWS re:Invent last year, so you can find it online.

And then finally, on top of these foundational pieces of infrastructure, we have our notebook infrastructure. And there were actually recently two really great blog articles in our tech blog about how we run all the ETL pipelines, really key pieces of infrastructure on top of Jupyter notebooks. And we are contributing to this open source project called Nteract, that's a nicer version of Jupyter notebooks. And now, inside these notebooks, if you are a data scientist at Netflix, you can then like do whatever you want. So you can use any ML libraries, so you can use TensorFlow, XGBoost, and so forth.

So I joined Netflix little bit more than a year ago. And this was the situation. This is what I saw, when I came. And looking at this, I was thinking that that there is nothing to be done. I mean, they made a mistake hiring me, since technically, everything is possible already. So if you wanted to run distributed TensorFlow, it's actually quite doable using Titus. Or if you want to use Spark MLeap, there's really mature Spark infrastructure. So there's really nothing that's needed. Technically, everything is possible already. But the challenge was that nothing was easy enough for data scientists.

Data Scientist Built an NLP Model in Python

So there was data, there was compute, there was prototyping. You could build models. But doing this wasn't really as easy as it could be. And I can tell you a funny story - well, I don't know, maybe it is not the funniest story. Maybe it's a sad story. But I was involved in this project. And it was really a great, great project. It's still running. So we had a data scientist who had this idea, that we want to analyze the tweets, what people are tweeting about Netflix. So it happens that Netflix has really active Twitter accounts, and understanding how people react to different shows, and now what's the sentiment about different shows. It's really useful knowledge for us, so that we can make sure that we are not producing stuff that people don't like.

So also, a nice thing about tweets is that they are pretty short. So I guess it's like 160 characters or something of that nature. So even if you had a million of them, it would be just maybe 160 megabytes. So you can have it in a dataframe, in a pandas dataframe, in a notebook. And then you can take an off-the-shelf NLP library. There are many of them these days. I think we use Gensim for doing topic modeling. And we applied this in a notebook. And it's really like a school book example of really awesome and fun data science. So nice dataframe of real world data, and really nice models; all looks good and fine. So this was really fun, easy part.

Well now, the first realization after building this notebook was that, well, actually, we don't need only a single model, but we do want to have slightly different models for different cohorts of titles. So we needed multiple models. And instead of trying to train all of them in a single notebook, we want to farm it out to Titus, the container management platform. And now, the thing is, that this shouldn't be too hard. So there's actually a Python client for Titus. So pushing stuff, running these containers on Titus shouldn't be really too hard. And it wasn't, so the data scientist was able to do this maybe in a week or so.

Well, now, a challenge was that the original notebook version used Presto to fetch the data, like from the data warehouse. So there was a single SQL query that got executed. And now, when you had 100 of these models being built in parallel, a challenge is that you have now 100 queries going to Presto, and obviously, they get queued up, and then you won't be getting the results as fast as you got in the single model case. So it was a bit slower than what we wanted. And now, working with the data engineer we were able to organize the code a bit better. So we were able to ingest the data a bit faster, even with these parallelized models.

Well, now we had the models being built, then the data was flowing. So the next question was, how do we update this thing daily? So obviously, we want fresh results every day. Twitter tweets out real time. So the faster we can update this, the better. So now we do have this job scheduler. So we want to use the existing job scheduler. And there's plenty of documentation and examples how to do this, so the data scientist was self-sufficient enough to learn how to do this and push the thing to job scheduler, right? Maybe took two weeks to figure it out and a few iterations. But again, I mean, quite doable. Thanks to all the existing infrastructure.

And then interestingly, this was maybe the hardest part of the process, that once we had the models updating daily, and everything was good, but now you had this model file somewhere in S3 and you got the new one every day. But they're really not doing anything good for anyone. So we wanted to start exposing the results to the business stakeholders. And we realized that in this case, the easiest way to do this is through a custom UI. And in order to create a custom UI, you need a front end engineer. You need someone to build the backend for the web application, and well, you still need the data scientist to help with the model. And we also had the data engineer to help with the data pipelines.

So we had four people plus the business stakeholder now to eyeball the results and say if it's useful at all. So we had five people in the meeting going back and forth, whether it should be modified. And we kept iterating on this. And now all together, taking this project from start to finish took about four months. And now, if you think about it, this is actually like, if I had presented you this maybe 10 years ago, this is pretty fancy stuff. We're taking real time tweets. We are building natural language processing models, understanding the sentiment of these tweets. We have a custom UI. The models update daily. It's a pretty non-trivial thing.

So in that context, in that light, four months isn't that bad. If this was a startup and you did it in four months, well, it's really not too shabby. But still there was the nagging feeling that this could have been easier. So each one of these things required the data scientist to learn about all these tools. And as an individual thing, nothing was too hard. But holistically, the whole project took a bit too long. And it wasn't only about time, but now if you look at the code base, 60% of the code was related to infrastructure, and 40% was about data science. And now, as software engineers, of course, you think that, “well, if the code is architected well enough, I mean, you have the infrastructure part there and you have the data science part here and it's still pretty readable.” But remember that the data scientist who did this might not have been really a professional software engineer. So it maybe wasn't as well architected. Also it was a pretty organic process. So it wasn't perfectly designed up front. So now you have this code that 60% something that's not data science, it took four months. And then there was the feeling that this could have been easier. And actually, the worst part was what happened when we started actually using this. And that was really when the hard questions started coming in.

So now suddenly, we were wondering that, well, I mean, the business stakeholder was saying that, "Do these results make any sense? These look pretty strange." And well now we obviously need to monitor this model in production that, is it actually converging? Is the model even working correctly? And then one day, it failed and the data scientist was like, "Well, yes. No, I mean, I have the red light in the scheduler. So what should I do? And how do I test this locally? And how do we debug it?" And then when we had another data scientist who wanted to contribute something to this project, there was the question that, "Yes, you can take this project, but be careful not to run it because it might overlap with the production." And well, I mean, it's hard to experiment with that. And then once we fix the model, there was the question that, "How do we backfill it?" Like historical database on the fixed model.

ML Wrapping: Metaflow

And after having a few of these experiences, we had realized that, yes, we are indeed missing a piece of infrastructure. And this piece of infrastructure is a bit meta, and hence the name "Metaflow," in the sense that it doesn't do much by itself in a way. It just glues together all these existing pieces of infrastructure, all these solid foundational components that Netflix already has. And probably your company has, as well. You probably have S3, you have probably Spark, you may be using Kubernetes, maybe you're using Airflow. So it is actually a pattern that you can replicate at home.

And now if you think about this, we didn't want to build a data science machine. We didn't want to say that “we are building this machine learning infrastructure, and as long as you are just using the Metaflow API, you don't have to worry about anything else, and we just magically make everything work.” We didn't want to take that mindset since we knew that given the diverse set of problems that we have, it's not going to work. In the previous presentation, we had really good examples of the design thinking. So we really took this very user centric design approach that well, we don't want to hide these pieces of infrastructure by itself, but it's more like, think them as ingredients.

So if you go to a fancy restaurant, you get the plate. On the plate, you may see that you have a piece of asparagus and you have a piece of a nice chunk of fish, and maybe some squash, or whatever. And you can recognize these pieces. But somehow the kitchen, the two Michelin star kitchen, did some amazing value add on top of it that now. Suddenly, it's worth 100 times more than each of these pieces of ingredients by themselves. And that's the mindset; we don't want to hide the details, we want to make it easier to navigate this complex landscape of different pieces of infrastructure. And just do bit strategic value adds here and there, so it's easier to run data science on this infrastructure.

How to Get Started?

So I want to spend the rest of this talk talking about what we actually did. So you probably are interested. So how does it actually work in practice? So as I mentioned before, there are two main activities that we want to support. One is build, the other one is deploy. So let's start with build. It's like taking the very high level approach at first. This is like good starting point for data science. You will see you get some input data, you perform some computation, you produce some output data. And you can trivially model this as a function, let's say in this case, in Python. So you have a function. It takes some input, it performs some computation, it produce an output. And now obviously, you can run the piece of Python code on the command line. It's simple enough. And we assume that if you are a data scientist, you can do this. It's not too hard. There are plenty of examples out there. And now, when you actually extend this example here to do something practical, what happens is that instead of having one line of code, you may have 500 lines of code, or maybe you get a thousand lines of code.

How to Structure My Code?

So now you have your maybe train function, or maybe your score function. And it's 500 lines of code doing its thing, maybe in a notebook. And now when you have one function that's 500 lines of code, there's the inevitable question that, well, probably it shouldn't be a single function. And then there's the question that, "Well, I mean, should it be multiple modules? Should it be multiple packages? Should it be classes? Should it be something else?” And what we say is that, "Well, let's make it a DAG. Let's make it a graph." And modeling data science workflows as a graph is nothing controversial. So this is what TensorFlow does. This is what many other frameworks do. But this is a bit at the higher level.

So the idea is that you take your 500-line function, and you split it in parts so that each of these steps in the graph do somehow logical- one logical, like a piece of your computation. Like in here, we could start. We maybe fetch some data and maybe, A wants one version of the model and, B, another one, and then we compare, we choose the best model and so forth. And how we do this is that we want to stay really away, out of your way, and so hence, you can just take your existing Python code and just add some few annotations. So you add the step decorator and it becomes a node in the graph, and then you specify the transitions using the self.next(). So, simple enough. Going from your original notebook version to this, it's really not too hard. And then you can run it locally in the same way as you did before. So if you have ever run Python code before, going from your old way to this, it's really, really easy. Well, that's with the caveat that you actually use Python. So we do have a good number of people who use R. So what we did is that we provided the same functionality for R as well.

How to Deal with Models Written in R?

So now, if you are in RStudio, you can define your graph in RStudio, and you can run it in RStudio, or you can run it on the command line using R script. And the nice thing about this is that we didn't have to re-implement the whole thing in R, but R is sitting on top of Python, thanks to this amazing library called reticulate. So R calls Python. It's high-level languages stacked on top of each other. So in a performance point of view, you might think that it's a nightmare. But in the productivity point of view, it's actually pretty useful.

And now just to motivate you that if by now you think that, "Well, I mean, this seems so trivial that I'm not interested anymore." I'm just trying to keep you awake a bit. People actually do use this. And since we started, we have now 134 projects using this framework at Netflix, and people are really happy using it and the adoption seems to be pretty healthy. And it's not because you can write DAGs. That's table stakes. But there are a few other features that are really useful.

How to Prototype and Test My Code Locally?

So here's another one. It's really one of the features, small thing, but I'm really proud of it. So I used to use Luigi. I don't know if anyone has used Luigi before. It's another one of these open source [inaudible 00:38:23] by Spotify. I really like it. It is similar to Airflow. And what we ended up doing at my previous place with Luigi is that we ran some computation and then in the end of the computation, we stored the result in some data store like S3 or a database. And then in the next step, we fetched the result from S3 or database, and then performed some additional computation and stored the results again. So all these steps ended up having some amount of boilerplate related to persisting and loading results.

So here, we wanted to make it easy enough, just as a convenience feature. What we do is that we persist everything. So when you assign something to an instance variable in your Metaflow Python code like here, self.x, we persist it automatically after the step has finished. And there are a couple of nice side effects when we do this. One is, this in fact becomes a snapshotting feature, or a checkpoint. So what that means is that now we can provide this resume function. So imagine that your start means a data loading step, and takes two hours to run, and now B, you’re actually training your model. And now B fails, and you want to iterate on B. So you don't want to wait on start for two hours. So you can just resume from B. And this works because we have persisted the whole state of start. So we can just continue from B without having to re-execute start. And another surprising benefit of this small feature is that there are many frameworks out there that allow people to choose to persist something. And the hard part is about choice. So what we saw before when we gave data scientists freedom to choose to persist something, is that people tend to be very conservative. They're conservative thinking that, "Oh, I don't need this piece of data. I only need my model, and I don't want to do this because I don't want to add another 100 bytes to Netflix's S3 bucket."

And the challenge with this mindset is that the next day when your model fails, and you really don't want to spend any time debugging it, but you knew that, “if only I had that one piece of information, if only I knew the intermediate state, the debugging would be so much easier.” And that's not possible because you decided not to persist it. And we make your life easier by saying, "Don't worry about it, it's a drop in the bucket." And by the way, it's all content addressed, so it's all deduplicated, so it's actually not adding that much overhead at all, and we just persist everything. So we get this [inaudible 00:40:43], and also we get full visibility to the internal state of the modeling pipeline. Super useful.

How to Get Access to More CPUs, GPUs, or Memory?

Now, the next question is about scalability. So maybe many of you are looking at Python and R, so I mean, it doesn't scale. Well, yes, Python and R, they are pretty high level languages by default, oftentimes super inefficient. You could always implement, like Alex in the beginning, in thin C++, and it would be a thousand times more efficient. But then on the other hand, if you look at what AWS is providing these days, or Google Compute Engine, you can get boxes with 400 gigs of RAM, you can get boxes with 200 gigs of RAM.

So instead of saying to people that, "Yes, what you did is maybe working but so slow that we can't execute it", we said, "Whatever. Let's just take a box, a big box with 200 gigs of RAM, or maybe 16 cores, maybe a good number of GPUs, and let's just make it work. Let's rely on vertical scalability as much as we can.” And all you have to do is that you just decorate your state by saying that this function requires this much memory or maybe this TensorFlow training needs this many GPUs, and you just run it locally as before. And we've farmed these functions then to Titus, and Titus gives us the machines with these requirements. And it's very convenient, because now, just by adding a single line of code, you can run your idiomatic R training, or you can run your idiomatic Python training, which probably isn't optimal at all, but it's a super easy and productive.

How to Distribute Work over Many Parallel Jobs?

Another really nice use case is that oftentimes, you don't know the graph in advance, but you want to do some kind of a funnel, the dynamic funnel. The typical use case in data science is hyperparameter search. So you have different parameterizations for your model, and you want to run all the models in parallel with different parameters, and then you just want it to happen for any number of models. It can be a really large number of models. So we provide the simple Pythonic threader, carry the grid. We say that, "Okay, do a foreach over the grid." And then it does the fanout. Easy enough. And one of the best proofs that people actually find this useful, is that 40% of the people who use the system, or projects who use the system, start using this feature right away. And others, they first test something locally, and then when they know that they run out of resources, or maybe on their sandbox environment, they decide that, "Okay, now I want to start farming out computation elsewhere."

How to Access Large Amounts of Input Data?

Then, as I mentioned, in the NLP use case, the challenge is that when you start doing this fanouts, you need to get data somehow. And I mentioned that we used to use Presto, but the challenge is that then running these big SQL queries many times in parallel is very slow. So instead, we provided this really fast data pipeline that goes directly from S3 to EC2. And if there's anyone, by the way, from Amazon here, they should fix their S3 transfer class in boto, which is way too inefficient. So we wrote our own, and now you can actually get 10 gigabits per second between S3 and EC2. And you can imagine that this is a game changer for data scientists who now can do a select star in a big table, get all the data in their model, do feature engineering in the pipeline. It's faster than let's say, in Spark, even though you're doing dumb simple Python.

And I think, in the interest of time, I think I will skip this slide. But the idea is that we had a really successful project that was using R, and using these features, parallel foreach, we relied on vertical scalability, and it worked really nicely.

How to Version My Results and Access Results by Others?

And more about deployment, since I still have few minutes. So this is about versioning. So now, versioning is one of those things that's super important. No one wants to do it. We do it by default. As I mentioned, we are snapshotting data. We also snapshot the code. And largely, we want to start snapshotting the whole environment where your code runs. All these artifacts are immutable, so you can refer to anyone else's artifacts, or your own. And there are two big benefits when you do this. First, you can experiment freely. You can take any existing piece of data science code, and you can start running it because you know that it runs in isolation. You can't step on anyone else's toes. You can reuse someone else's previous results and you can start building on top of that. And the second big benefit is reproducibility. So now that we are snapshotting all these different things, we can take an existing version of the model, re-run it again and see that we get the same results.

How to Deploy My Workflow to Production?

For operational stuff, Netflix has this amazing hairball of ETL pipelines, and it used to be so, as I mentioned, when you wanted to get something in this official production scheduler, it required quite some work. And we wanted to take that pain away so that there are no excuses for not deploying to production. And now there's a single command that can take your local DAG and deploy to production with this scheduler that then provides you operational support, like nice reports for SLAs and alerting if something fails and so forth.

And as a result, now, based on our statistics, 26% of projects deploy to production and these are good numbers. It's supposed to be a funnel. This is experimentation after all, so that number should never be a hundred percent. So we think that it's a good number, but even more importantly, now, the median time to first deployment is seven days. Compare that to the four months that it used to be. And this is so important because, as I mentioned, most of these models, if not all of these models, initially fail in production. So it's important that you see that first failure as soon as possible, so that then you can go back to the drawing board and start fixing it and actually make it work in real life.

How to Monitor Models and Examine Results?

And then once you have the thing running in production, you need to know that it actually works. We could provide a UI that shows the model metrics and whatnot. Again, the challenge is that it's very hard to have a universal UI. So instead of having a universal UI, we decided that data scientists should be able to build their own UIs. Well, now, data scientists don't want to use React. They don't want to use JavaScript. They want to use the tools they know. So we allow them to do this in notebooks. So thanks to the fact that we are persisting all the intermediate state, you can, for instance, peek into the let's say RMSC of a model, and you can plot how well the model converged, or you can plot some other metrics that make sense for your model. You can have the notebook run every time your model runs, so you have always a good report, a good understanding, what the model did and how well it worked.

How to Deploy Results as a Microservice?

And then lastly, the previous examples I showed you were mostly about the batch scoring and batch training use cases. But obviously, there are many real time use cases as well. And for that, we provide the hosting platform. So you can run your models as usual, using this platform. And then you can deploy to our hosting service. And instead of you having to write a custom back end and a custom microservice, you just define a simple Python function, so almost like a function as a service, that takes your model, takes the request, and you decide how you want to produce the response. Oftentimes, it might be as simple as a dot product, but in other cases, it might be something more complicated. And then you get a first class microservice that you can call from other services. And one of the nicest examples of this was a project that optimizes launch schedules for our titles. So on Friday, when you go to Netflix and you see that there are many new titles, what to launch and when is actually a surprisingly hard problem.

So our data scientists wrote an optimizer that optimizes the schedule. And not only that, but we have a custom UI that allows business stakeholders to ask these what-if questions. What if we launch this before that? And now, thanks to the hosting platform, we run this mini optimizer on the fly, answering to these what-if questions well, how the schedules would change if we changed this knob.

So that's Metaflow. And just to recap, before I run out of time, we started with the idea that we have a very diverse set of problems. And we believe that the best way to address a diverse set of problems is by having a diverse group of people with different backgrounds, with different domain expertise. And we don't require that they necessarily have a super deep software engineering background. That's easy. And with diverse people, we want to embrace the fact that we get diverse models. It's fine. You should use the best tool for the job. And then since we have this diversity, we want to help, with our infrastructure, to help people build these things, and especially focusing on the parts that the data scientists don't care so much about. And then we want to remove any excuses for deploying. And then the result is that people can focus on things they care about, and then our business gets results faster than ever before. That's it. Thank you.

See more presentations with transcripts

Recorded at:

Dec 19, 2018

Ville Tuulos

InfoQ Software Architects' Newsletter