BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Hien Luu on ML Principles at DoorDash

Hien Luu on ML Principles at DoorDash

Live from the venue of the QCon London Conference, we are talking with Hien Luu, head of ML Platform at DoorDash. In this podcast, Hien discusses the main principles and strategies that DoorDash uses to scale and evolve MLOps, as well as some tips for those who want to get started with MLOps.

Key Takeaways

  • The ML Principles at DoorDash include thinking big and starting small, improving by 1% every day, and being customer-obsessed. 
  • These principles can help organizations be more effective in their machine learning projects, which historically have had an 85% failure rate.
  • One of the main reasons behind this high failure rate is the lack of proper planning, infrastructure, and engineering mentality when it comes to productionalizing machine learning models. 
  • Organizations need to invest in the right tools, talent, and data to make their ML projects successful. It is important to set realistic expectations and to iterate on models, as machine learning is a constantly evolving field.
  • Hien recommends using Redis as a feature store for low-latency use cases and considering other options like CockroachDB for more cost-effective storage.

 

Transcript

Introduction [00:17]

Roland Meertens: Welcome everyone to the InfoQ podcast. My name is Roland Meertens and I'm your host for today. I am interviewing Hien Luu, the head of ML platform at DoorDash. We are talking to each other in person at the Qcon London conference, just after he gave the presentation Strategy and Principles to Scale and Evolve MLOps at DoorDash. Make sure to watch his presentation, as it contains many insights into the principles Hien Luu uses to improve the machine learning capabilities at DoorDash. During today's interview, we will dive deeper into these principles, and I hope you enjoy it and you can learn from it. Hien Luu, welcome to the InfoQ podcast.

Hien Luu: Thank you for inviting me.

Roland Meertens: We are here at QCon London, and you just gave your talk about Strategy and Principles to Scale and Evolve MLOps at DoorDash. Welcome. Could you maybe give a three-minute summary of your talk?

Hien Luu: Sure. The summary is essentially about design for any company, any organizations, that are adopting machine learning operations, building out infrastructure, how they should think about what strategies to use, and which principles that might be helpful for them as they building out infrastructure. The strategies are fairly a little bit biased and opinionated from my perspective and experience, but I think they fairly, at a high level, that most companies can consider leveraging for their specific organization, because each organization has their own specific ML use cases, the culture of their organization, the people that they work with, the maturity of their technologies. So hopefully it's a framework that's broad enough that they can use and apply it to their own specific organization's needs. And similarly, the principals are also very, very high level as well in terms of how do they go about supporting the customers, how to make sure they're making progress, how do they align with their stakeholders, decision makers and so on?

Roland Meertens: What are the principles you have, then? I think you had three principles, right?

ML Principles at DoorDash [02:15]

Hien Luu: Yeah. The first one is about think big, start small; and the second is about 1% better every day; and the third one is customer focus or customer obsessed, essentially.

Roland Meertens: I think one of the things I really took as an interesting fact from the start of your presentation was the 85% failure rate of machine learning projects, like 85% of the machine learning projects fail. What's the main reason behind this? Why is this field so bad at creating projects?

Hien Luu: First, that statistic was called out I believe two or three years ago. That's at the time when a lot of companies or enterprises are jumping onto leveraging the ML to add value to the products. Internet companies like Google, Facebook or Meta, or LinkedIn, they had done this for a long, long time; but enterprises, this is something new to them, and if they jump in without any proper planning or thinking, and the infrastructures are not there, and I think that's part of the reasons why the failure rate is so high, because it's new. It's not something that you just do. You need to have the proper team, you got to have the right approach to applying ML, you need to have the right data, and you need to integrate that into the online products or services.

So it's a whole structure of things that need to come together. It's not something you can just do. So maybe that's my guess in terms of why the failure rate is so high, but everything as an industry in the MLOp space, it has matured quite a bit in the last three years. There's a lot of startups, a lot of money that went to that, and startups are now... The products are maturing, and also the best practices are also being shared and developed. So I think my hope is the next time somebody do the same survey, the number will be much lower.

What makes ML difficult? [04:11]

Roland Meertens: And do you think, if you want to point fingers, is it mostly an organization problem, is it simply that people don't have an MLOps team, or are the data scientists not good enough at engineering?

Hien Luu: I would say it's a combination of those things. It's not just one, because each organization is very, very different. Where they start from, what talent they have, do they have enough data for their use cases or not? So I think for each one of those, it's a little bit unique in their own, but the number one thing I would say it's about the engineering mentality that was not there in order to productionalize machine learning models. Data scientists might build a model and they just throw out a wall, for example, and there's no reproducibility, right? There's no way of treating ML artifacts as code. There's no such mentality until the whole MLOps discipline has been codified or formalized in the last few years.

Roland Meertens: So it's really also a bit of getting the discipline, getting more structure, maybe.

Hien Luu: Because evolving model is not a one-time thing, like other things, like other maybe applications. It's a continuous basis of, you have version 1.0, but data changes, your use case might change as requirements might change. So they might have to go to and develop the second version of that, for example. And if the engineering disciplines are not followed, then the second time building that model would be really challenging. Data scientists might have left a company and nobody knows how to rebuild that model, for example.

Roland Meertens: Normally the second time you build something, it should be easier and faster to do, and not equally difficult.

Hien Luu: Right, exactly. And it might not be the case for ML, and that's why it takes so long to get model to production, and some cases you don't even get to production because of those reasons.

Concrete tips for improving [05:58]

Roland Meertens: So if we're taking, then, your principle 1% better every day, do you have some tips on how people should get started? What are some concrete tips for data scientists to, with simple things, make their output way better?

Hien Luu: I think from the MLOps infrastructure, there's a lot that can be done, but you don't have to build everything at once. You want to build enough at the current state to unblock whatever it is blocking that data scientist or that organization of going from ideas to production. And then incrementally, over the months' quarter, then you building out the bells and whistles and the things that they need. But the initial point is just how do we get from ideas to production quickly, and building just enough for those tools? It might look ugly initially, but it actually unblocks somebody to get to the next stage. So it is all about not aiming for perfection, but building incrementally, just enough to get so the data scientist can be unblocked and do whatever they need to do.

Biggest blockers [06:58]

Roland Meertens: And do you see most problems, then, for example, with the data versioning or logging or compute, what are some of these main blockers you mostly see at DoorDash or at other companies?

Hien Luu: I can talk about DoorDash and then we can talk about other companies as well. I think the DoorDash story is quite interesting because of the high growth that we have gone through in the last few years, and that high growth creates a lot of interesting challenges in terms of the large volume of data that we need to be able to ingest, move it, and compute; and doing that reliably and with high quality. I think that's what we've been learning to do better and better throughout quarters. Initially the infrastructure's not quite there to be able to do that easily and in a cost-effective way. It's not like we don't know how to do, it's just the growth is so fast. And sometimes we don't have enough manpower to do it because it's just the growth was tremendous in the last three years.

I think for other companies, obviously each organization is at a different state in terms of their adopting to the cloud. Do they have enough data engineering team? Do they have a solid experimentation platform? Do they have a way to easy to get access to compute resources without having requiring a data scientist to do all those low level stuff that need to get to machines? So therefore, data scientists now cannot focus on their main task, but doing all these engineering work. So it depends on the size of the organization and where they are at in terms of their maturity at each of those infrastructures that are commonly needed for dual machine learning development.

Roland Meertens: And would you say that this is some kind of hidden tech debt where, if people didn't set up their MLOps stack in the right way?

Hien Luu: I wouldn't say tech debts yet, because a lot of these enterprises just starting in their journey, so it's more about how to get there, because they don't have the right infrastructure, tooling, data, talent for example. I think in the coming years, the bigger story will be about tech debts, but in the last two years, more about how to get from A to B than anything else.

How to improve as a company [09:07]

Roland Meertens: And is that then something where the company culture maybe has to change, where the company culture has to say, okay, we try to focus on hiring more of these people, or is it something else?

Hien Luu: Yeah, I think partially it's like organizations that are new to machine learning, they don't quite know how it works. They don't quite know what the process looks like. They don't quite know how machine learning can bring value. They read, they understand the report, but they don't know, they haven't gone through that experience. And they might have high expectations initially, but then the model, turns out, doesn't do exactly quite what they want or expect, then they treat as a failure, for example. So machine learning is about iterations, getting better and better. Once you have more data you can train at larger scale and so on. So expecting the first model to hit the ball out of the park, I think that may be one of the reasons.

Roland Meertens: Yeah, I think that's one of the things which I took away from your talk, which I found quite interesting. Whilst identifying and aligning the needs of customers and the different stakeholders and getting all the decision-makers together and setting the right expectations, do you have any tips for that, for either people who expect that you have a massive improvement with a new model, or how do you go about this?

Hien Luu: I think the first thing is aligning on what the business problem is in terms of applying machine learning to solve for that problem, and going through the process of determining whether ML is still the right solution or not. And then the second part is setting the expectation around the successful metrics, the business metrics that we want to use to evaluate that ML against, and then go through iterations of building a first version of the model and see if that actually meets, and then iterate from there. And also setting expectation about how long would it take either, because if there's something that's new to an organization, they might not know. And the expectation might come from building online applications, which is a very different process than machine learning because there's a lot of unknowns in machine learning until you get down to it.

Roland Meertens: I think also one thing in machine learning is that you are not just dealing with an algorithm which gives you the right output, but you will also have to deal with false positives and false negatives.

Hien Luu: Yes.

Roland Meertens: How do you work with stakeholders to get them to decide what to do with these cases, for example?

Hien Luu: It comes back to that educating the stakeholder about machine learning itself and the uncertainty involved in machine learning. And then also the next step I would say is about setting expectation about iteration. I think iteration is probably one of the key things. Setting the right expectation up front that the first model might not work out well, and it will take time, and I think that's maybe hard for business to understand initially, but that is something that needs to be expressed and shared. Otherwise, the expectation's going to be there, and then frustrations will come if the first model doesn't work out the way it was expected.

Roland Meertens: I think the other difficulties always that's in machine learning, the boundary between what's possible and what's impossible is razor-thin. Do you have any way to set the right expectations, then?

Hien Luu: If you hire experienced data scientists, they can bring their stories, they can bring their experience, or maybe look at use cases that are reported out there in the industry in terms of how that went. I think that using those data points maybe help these organizations to lower their expectation, or at least be educated about these scenarios where things might not work out initially.

Roland Meertens: The other thing which I think was funny in your talk was this customer obsession, and you were talking about the french fry moment. What is a french fry moment?

Hien Luu: This phrase was coined by the folks from Google, essentially delighting your customers with solutions that they don't expect. The french fry moment is about, if somebody orders a burger and they don't order fries, and you delighting them with delivering that french fry with them, because they go together, the burgers and fries normally go together. So it's about delighting your customers with solutions that they don't prompt for it, or they might not know that they need it. But once you built it and show it to them, they say, wow, this is great. That's what I call a french fry moment. It's not something they come and ask you to build it. It's based on what you observe, based on what you know, and you come up with the idea and built it and then show it to them.

Feature stores [13:48]

Roland Meertens: If we get a bit more technical, you mentioned that you're using Redis as a feature store, and you also have some tips to improve it. So how do you, like for example, using Redis as a feature store, and did you think of any other alternatives?

Hien Luu: Our journey of our feature store is quite fascinating. For our use cases, for online prediction use cases where low latency is one of the key requirements. See therefore there are solutions out there besides Redis like Cassandra, but for us, Redis has been working really well in our company, and we natively support that within the infrastructure team, so it's easy to get that going. So Redis is in memories, on memory, majority of it, and then it's very suitable for low latency use cases. Do all use cases need low latency? Maybe not.

So therefore there are certain sets of use cases are still need low latency but not very, very low. Then we can consider our other options that are more cost-effective, essentially. And we have evolved our feature store to support also another storage engine called CockroachDB in the last few years. So we have a mix of these backend storage as Redis and CockroachDB. So depending on the use case and the needs, we might send those features to the right store and source engine, and can still meet the latency requirements, and then can be more cost-effective as well.

Roland Meertens: Interesting. And so at the end of the talk, you were talking about some future-oriented things. What do you think are the key trends and advances that companies or MLOps teams should anticipate in the near future?

Hien Luu: It also depends on the need of the company as well, because for us, obviously we have more and more data scientists, we have more and more data, and there's a need to us to increase the performance of the model. And deep learning is one of the techniques to do that. So from our perspective is how do we support model training at scale, and using GPUs for example. And then similarly on the prediction side, how do we support prediction with low latency for those large complex deep learning models with low latency? So there's many, many techniques for that, but also one of the easiest ways is leveraging GPUs again. So that is going to be our focus in 2023, is how to support that across training and model serving for deep learning models.

GPU Computing [16:08]

Roland Meertens: And to end with, do you have any tips, then, for people who want to get started? Do you have any specific tools which you like to use here to get these models in production or to get to leverage your GPUs?

Hien Luu: So GPUs are not cheap, I'm sure you are aware. So it's also coming back to the ROI of a use case, if it makes sense to leverage GPU or not, based on the impact that use case might have to the company. Obviously money, it's all companies worry about. So it's all about making sure you're using the right GPUs for the right use cases. So that's going to be that learning process of testing out a model on GPUs and see what the performance differences would be like, and using the tooling to understand how efficient those GPUs are being used. They're not sitting idle for that use case. We are going to be using a lot of tooling to help us understand what is the right use case for GPUs, and how efficient are we using the GPUs for that use case. So that will be a very key aspect as we move forward.

Roland Meertens: So do you actively try to deploy the same model, then, maybe on a CPU and a GPU, and you see what the impact is on your business?

Hien Luu: Correct, exactly. That's going to be one of the steps as we move into this world leveraging more and more GPUs. It's understanding would they reduce latency, how much business impact would that bring to the company, and whether that would justify the additional costs of GPUs?

Roland Meertens: Interesting. Do you have any tips for specific tooling here, which I can try out or which people can try out?

Hien Luu: Yeah, I think there's toolings that from, again, we're still early in this journey, but there are toolings available from NVIDIAs for profiling the GPUs and such, that you can do a Google search or whatever that might be to find out. For us, it's still in... We are at the beginning of this journey, so we're still in a learning phase there.

Roland Meertens: Thank you very much for joining the podcast. Enjoy the rest of your conference.

Hien Luu: Thank you.

About the Author

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT