Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Aish Fenton on Machine Learning at Netflix

Aish Fenton on Machine Learning at Netflix


1. I’m Barry Burd, professor Of Computer Science and Mathematics at Drew University in Madison, New Jersey. I’m at QCon New York and I’m speaking with Aish Fenton who is manager of research and engineering at Netflix. His talk at QCon this time was about machine learning at Netflix Scale. Aish what kind of things do Netflix machines have to learn?

Yes, actually we use machine learning across a lot of places in the product. There are some obvious things where we use machine learning such as our recommendations but what a lot of people don’t realize is just about everything that they see in Netflix has some level of personalization, that’s all back by machine learned models.


2. Can you give us an example of some application of machine learning that isn’t involved with preferences?

Yes, it’s all personalized to some degree, so it has some aspect of the user preference in there but an example that I don’t think people might consider is we have these genre rows at Netflix, so we'll have a row which is a collection of shows you can select from, with a particular theme like horror films or action films or books based on movies or some quite quirky things like that. What people don’t realize is that even those themes they are personalized to your tastes like if you think about something like books, movies that are based on books, then when you think about it that can mean a lot of things to different people, so if you are into the finance industry, you'll probably be interested in books about Wall Street and therefor movies that are based on those and things about the GFC, whereas for a lot of people it would mean young adult fiction which is quite a different interpretation of the genre.

Barry: So to what extent have you solved machine learning and preferences problems? It’s always a tough problem.

We have plenty of room for improvement, we’ve obviously done a lot of work at Netflix over many years and I think we're getting some great results, we are actually at the point where 75% of what people play when they come to Netflix is actually based on these recommendations that we put up on the homepage. So we are doing very well but we still think that there is a lot of room for improvement.


3. Machine learning and learning in general is what I consider some of the toughest programs to write in computing because of the heuristics involved, so tell us about scale, what’s the difference between a small machine learning problem and a machine learning problem on a very large scale?

Yes, in some ways as the scale increases, in one way it makes your problem easier because obviously we have a lot more data to work with, so we have 44 million subscribers now, so they generate a lot of data that we can then base our algorithms off. But then on the other hand it means that our algorithms have to work at a much bigger scale in terms of serving this volume of customers.


4. And what kind of solutions so you apply?

So there are actually a number of different solutions we apply to make this stuff scale, I don’t think there is any one solution, obviously some of the big data stuff is really useful, we use it a lot, so some of the big data solutions such as Hadoop, HDFS type technologies, we do lean upon a lot, but some of the solutions, the thing with the machine learning model is that the part that’s often hard is training the model. Once you have the trained model, sometimes the model itself isn’t actually as complex.


5. Can you give me an example of the sort of situation you are talking about?

Yes, so take something like neural networks, they are computationally expensive to train, in fact that’s why deep learning and neural networks have come much more to the forefront recently is because there is only recently that we’ve had the computing power to actually be up to train these things, but the funny thing is the end result you get out after you’ve trained is something that is a relatively small contained computational unit that you can then embed in your online applications. So in terms of scaling I actually think that the toughest problem isn’t so much around the performance side of it, it's actually around being able to scale teams that can do that sort of work. So a lot of the focus that we have at Netflix, obviously there is a lot of focus on the performance side too, but a lot of what we’ve done is we build up tools that make it easy for us to do this sort of experiment and have that rapid cycle around being up to try out different areas we're researching on very quickly, and then unless you have an infrastructure in place that lets you do that kind of rapid experimentation it’s very hard to actually scale a team that can do independent research on this stuff.


6. So when we are talking about the scale of teams, what kinds of scale are we talking about?

Well Netflix I think actually is smaller than most people would assume, so Netflix is around 1000 people, 450 I think now are engineers; working directly on machine learning algorithms, so my team is 7 but I work with a number of other teams at Netflix that are also around that unit size, so say in total it might be 30-50 people working on this sort of machine learning research side of it at Netflix.


7. And what kinds of things do you do to make the infrastructure of teams, large scale teams from machine learning, work for you?

I think one of the things that Netflix has done very well is in designing an in house A/B testing framework, so how the process typically goes with our teams is that we come up with a problem that we want to solve in the product or hypothesis as to how we might improve something. Usually some researchers work on that, they come up with some idea of how they maybe able to solve that, they design an algorithm that they think will address the problem and that’s normally all done offline and they put it out in production, and the real question of course is like does it actually when it's deployed in the real system used by real people, does it actually solve the problem, and that’s normally where things slow down in this pipeline, like if you get a bunch of bright PhDs in the room, normally they can come up with some very interesting ideas quite quickly and they can prototype stuff quite quickly in tools like R or IPython these days. Where it all slows down a lot is when it comes to putting it out in production. Normally then you go from a few days cycle to suddenly it takes months or years to get these things out. So we’ve build a lot of in house tools that let us go very rapidly from, once we have this prototype that we’ve built offline to be able to turn that into productionized code and deploy it out to a segment of our user base that then we can track a lot of metrics against to see if it’s really doing what we think it’s doing.


8. In other words the Netflix Interface that I see may not be as the same Netflix interface that they are seeing in another state, another part of the country, maybe my next door neighbor?

Yes, probably more likely your next door neighbor, so it would be unusual that we would segment things by state, but different users are all in different experiments and we do so many experiments at any one time in Netflix that there isn’t really one Netflix production system now, because everybody is in some sort of experiment or another, so really the Netflix production system is a collection of experiments that are ongoing at any giving time.

Barry: So now are there strategies for solving the problem that you found don’t work, either strategies for machine learning or strategies for managing the scale.

So I think one of the problems that is very hard for companies to address with this when they are getting into it, is they hire PhD’s which are very bright around the mathematics and you do need those people, but then it’s very different from the sort of culture that’s often embedded in an engineering company where the engineers work on the engineering code, you have a bunch of people often with a hard science background doing research using very different tools sets, so R and Python and things like that, and where it all slows down, is actually getting them from the prototypes that they make into production it often breaks down because that process of productionizing the research takes a long time and I think that’s a real hurdle for most companies trying to do this.


9. What advice can you give to an engineer who wants to break in to machine learning?

Yes, good question, so there are a lot of courses now online, so there is the Coursera courses and other MOOCs that have popped up, they have some very interesting courses. I’m actually quite a fan of some of the old fashioned MOOCs before they are called MOOCs, so MIT put up a lot of their undergraduate mathematics program online a long time ago, many, many years ago and I usually advice people that if they don’t have, if they didn’t do the mathematics at university or they were not paying much attention, go back to the MIT courses make your way through those ones, they all give a good founding in the mathematics side of it, and then on applied side we’ve got these great things now like Kaggle, where you can actually go and you can get your hands dirty very quickly which is a very important part of the learning process with this.

Barry: Aish thank you so much for being here today!

My pleasure!

Oct 15, 2014