Bio Xavier Amatriain (PhD) is Director of Algorithms Engineering at Netflix. He leads a team of researchers and engineers designing the next wave of machine learning approaches to power the Netflix product. He is working on the cross-roads of machine learning research, large-scale software engineering, and product innovation.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
I’m the director of Algorithms Engineering at Netflix. I lead a team of Machine Learning Engineers, we work on different aspects of the product related to the algorithmics, we work on things like personalization, search and so on. I’ve been at Netflix for around 3 years now and before that I have a academic background, I’ve been doing research and Machine Learning Recommendations and Software Engineering in general for several years, so I’ve been at Netflix around 3 years now.
As I said my background is mostly academic, I did my PHD, I was doing mostly Signal Processing at that time, interestingly many of us who are now doing more Machine Learning Algorithms have a background in Signal Processing because that was more at that time, I’m talking about ten years ago, people were talking much more about Multimedia and Media Processing and Signal Processing and that was sort of the hype of the day and I got into Machine Learning through the Signal Processing around which is not uncommon. But then I started realizing at one point of my career when I started more and more into studying human generated Signals, so to speak, instead of looking at the Signals that came from audio or video, Signals that come from people and what they do and analyzing what they do and coming up with patterns that respond to what people do, that’s when I got interested in sort of like General Machine Learning approaches and specifically applying them to sort of a personalization into understanding user profiles and what they do and how better adapt products to what people do.
So that was one of the things that got me into Netflix, Netflix is well known for being one of the companies that has kind of tailored the product around this idea of Personalization and Recommendation and as opposed to say you think about Google, you think of Google as a search company, you can think at Netflix as probably as an entertaining company but we are also sort of like Personalization and Recommendation company in the sense that a lot of our products are based around this idea of trying to surface whatever is better and best for each person at any given time and that is one of the key thing that goes into the product and one of the key things that we work on our algorithms, how to better identify with the user preferences and the user taste and how to better surface whatever is best for that user in given context.
So you’ve seen it all around, one of the things that people don’t realize 75% of what people watch on Netflix come from some Recommendation, so anything you see when you go into your Netflix regular experience is going to be a Recommendation except when you actually type into search and you search for something, and even when you search what comes out it’s a form of Recommendation, but maybe the clearest example of a Netflix Recommendation if you think about the rows that we have in out interface, the top ten row which is something that you usually see on the top of your user experience, that is something that is specifically Tailored for you and it’s looking at your past history of what you’ve consumed, what you’ve been watching, what you’ve liked, what you’ve told us you liked and all of that goes into sort of like the Big Data pipeline and we analyze and we come up with smart algorithms that are able to identify those ten items that will go then in your top ten list.
I mean Hadoop is going to be helpful in one set of problems and of course at Netflix we do use Hadoop and we work on some of our solutions are based on using Hive or Pig Scripts that runs Hadoop, but one important thing to remember at Hadoop is that it provides solutions for sort of like Data Distribution problem or Distributed Data Computing and in an offline or batch mode setting, and that it’s just one part of the problem, which is an interesting one because some of your Big Data problems can be addressed that way and think about the kind of processes that you can run over night, I like to use the metaphor like, when your people are sleeping and you can crunch some numbers and run to Map Reduce job from Hadoop and the next day when they wake up you have something ready for them, that is a good thing to do on the Hadoop side of things. But there are many more, I usually talk about having three different layers of Computation going on, the offline one which is addressed by either Hadoop or any other sort of offline Batch Computation system that you can develop yourself.
But then there are two others, on the other extreme there is the online Computation side of things, that is what happen when you want to compute something quickly in response to something that the user is doing, think about the user, giving you some feedback on a given show or a given movie and you want to recompute what your Recommendation is going to happen in say less than few hundred milliseconds, because you want to refresh that list right away. That is obviously not address by the Hadoop infrastructure and then in between the two, between the offline and the online there is a very interesting layer, near line Computation layer where a lot of very interesting things happen also, usually also in response to user events, so same thing, the user starts watching a movie. If the user starts watching a movie or TV show, you know we have usually half an hour or two hours, you could be doing, you could update things, they don’t need to happen online in a few milliseconds, they could happen in a few minutes, but you can recompute and rebuild your models and you can recompute your Recommendations in a different way that would happen through sort of like the Big Data offline Hadoop pipeline that is going to be happening over night and it’s going to be Big Data crunching.
6. So that sounds like there is a lot of intention about how not only what data is coming back, but also how the solutions are deployed and how the solutions behave, does that apply a team structure that is equally segregated or specialized?
That it’s a good point, I think in order to implement a solution like the one that I was mentioning where you have not only a lot of data coming in but also events that travel quickly and they interact with your responses and you have different sort of a levels of latency that you can adopt, you need a different set of skills sets that go all the way from sort of like the data scientist to the pure systems engineer and anything in between, and we usually think of having all of those different profiles being able to interact with each other which gives a lot of value also to people that have skill set and sort of combined in some way. The people that I have in my team mostly I consider them Machine Learning Engineers and that means that are not pure data scientists which come with a pure scientific background and they may be able to just barely write some R code or scripts in Pig or something like that. The people that I have in my team come more from a Computer Science background but they do have skills in Machine Learning and Algorithms, however it’s important for them to be able to interact with people at both and both a data scientist at one end and also Pure Systems Engineers on the other end, that deal with Data Engineering issues like pushing data through the pipeline making sure that is QA’d and it’s not stale and it performs appropriately. So there is a whole range that goes from the Pure Data Science to the Pure System Engineer, I think most importantly the people are in the middle, are sort of like Machine Learning Engineers are the ones that are able to sort like connect all the dots between the two extremes.
Usually requires some time and some experience, but there are very easy ways to get jump started, I think things like and enrolling into Coursera online course, the famous Andrew Ng’s Machine Learning Course, that is a very good introduction to Machine Learning and there even at Coursera or Udacity there are several follow apps or sort of more specialized things on, graphical models or even deep learning. So there are many things out there that you can get started with and not to mention a tone of books on Machine Learning that I think are very interesting to read. Now the other important thing is how do you acquire practice with some real life problems. I think an interesting avenue is the contests that are happening now, I think of Kaggle or similar initiatives where you can actually compete and look at some data problems and if you are lucky and you either win or you are close to winning, I think that is something that is going to be not only a great learning experience or something that you can go into your resume and you can start calling yourself Machine Learning Engineer, once you’ve acquired that on your own, and I think it’s very valuable to have people that have this sort of strong Software Engineering background but are able to grow into the algorithmic space by acquiring knowledge and some practice.
8. So you also mention that maybe ten years ago people were studying signals from machineries and automated, and then eventually became more human based and that is more Machine Learning Recommendation Systems, where do you see this going in another 5 to 10 years?
That is a very good question, I think there is still a lot to be done on the human space, there are many signals that are generated by what humans do and what they want to do that we are just barely tapping into and I think some of the applications of that which are things like Recommendation Systems that Netflix does or think about advertizing. Advertizing it’s another very good example where people are sort of investing a lot of things into understanding human behavior, but there are so many other applications that maybe right now they are not finding their commercial avenue so easily and I’m thinking of things like, think about traffic, it took me two hours to drive into San Francisco today and studying pattern of human mobility and how that can be solved and optimized and how can we better understand what people do to be better at optimizing resources for example, energy and so on, I think there is a lot of interesting unsolved issues that are going to be leveraging a very similar set of tools that are related to sort of understanding human patterns, what they do, how they do it and then optimizing over some algorithms related to that to be able to sort of optimize some resources and to find the best solution for each context and each person. I think the idea of sort of like using past history of what people do in the giving context and then coming up with predictions of what is going to happen and how to better adapt to what is going to happen, I think has a lot of, different applications are going to be coming up in the next few years for sure.
Harry: Xavier thank you very much for the interview and I hope you have a great QCon!