Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Strategy & Principles to Scale and Evolve MLOps @DoorDash

Strategy & Principles to Scale and Evolve MLOps @DoorDash



Hien Luu shares their approach to MLOps, and the strategy and principles that have helped them to scale and evolve their platform to support hundreds of models and billions of predictions per day.


Hien Luu is a Sr. Engineering Manager at DoorDash, leading the Machine Learning Platform team. He is particularly passionate about the intersection between Big Data and Artificial Intelligence. He is the author of the Beginning Apache Spark 3 book. He has given presentations at various conferences such as Data+AI Summit, XAI 21 Summit, MLOps World, YOW Data!, appy(), QCon (SF,NY, London).

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Luu: My name is Hien Luu. I'm currently leading the Machine Learning Platform team at DoorDash. Before this, I spent quite a bit of time at LinkedIn, and also a little bit of time at Uber. First, I will discuss the strategies for successful MLOps adoption. Then share a few specific principles that have been helpful for us in our journey of building our MLOps infrastructure at DoorDash. Then wrap up with some details about our current state of our infrastructure, and some information about future looking.

Strategies for Successful MLOps

Let's start with the strategies. If someone was asked to put together an MLOps infrastructure from scratch for a particular organization, the question is, what strategies would one follow or adopt to increase the chances of success? In this session, I will share a biased and opinionated set of strategies that have helped us in our journey of building our infrastructure over the last 2-plus years at DoorDash. Let's make sure we are on the same page about what MLOps is. For me, this one sentence, ML models have zero ROI until they are in production, really captures the essence of MLOps at a high level.

It consists of two parts, the ROI and production. For the first part, ROI, organizations believe that AI/ML can bring value to their business, so they want to invest in it. If they're not able to achieve the ROI, then MLOps is not a topic that business leaders will be interested in. The second part is about production, which is operationalize ML. The desire here is to successfully get ML models to production quickly, efficiently, and reliably. Achieving these two goals requires an understanding of what MLOps is at an engineering level.

MLOps has been described as an engineering discipline that combines the best practices and processes from these other disciplines. The three that are shown here: DevOps, data engineering, and ML. The central focus is to provide the automation infrastructure, three models to production faster. From the DevOps, that's discipline. Definitely the CI/CD is a key part. Then treating ML artifacts as code to enable reproducibility.

From the data engineering discipline, it's about reliable data pipelines with high quality. From the ML discipline, it's about the ML development lifecycle, experimentation, modeling, and so on. The main goal of MLOps from the engineering discipline is about reducing the technical frictions as much as possible, while applying ML from conception to production in the shortest amount of time.

While doing research on MLOps topic, I learned about this alliance called AI Infrastructure Alliance. This organization aims to help us as a ML practitioner industry or AI industry to converge on a common understanding of the essential building blocks for AI applications, which include the common workflows and the canonical stack. This particular diagram shows the common ML workflow for the various stages of the ML development lifecycle, that includes the data stage as the first part, and then the training, the middle one, and then deployment as the last stage.

These workflows are fairly clear. There's not a whole lot of mysteries that we need to figure out. However, what each company needs to figure out is how best to make them a reality at their company. This is where a set of strategies will be helpful. The strategy I'm advocating for successful MLOps adoption is in a form of an objective function. Essentially treating this as an optimization problem, as well as having fun with an ML terminology. The four inputs are: use case, culture, technology, and people.

I believe these dimensions have a large influence on the outcome of the MLOps adoption. Why is that? Because the information about these areas will help us to better understand which parts of the MLOps infrastructure are more important than others. We can use those details to prioritize on what to focus on and by when. It's also useful to figure out how best to align the approach with an organization's culture, and the technology maturity. At the very least, it's helpful to arrive at a decent starting point. In my view, it is extremely important for one to clearly identify and understand the details of each of these dimensions.

Let's start with the use case. Understanding the business domain an organization is in, and identifying a core set of ML use cases is pretty important, but it should only require a small amount of effort. How is this useful? Because we can use this information to determine which parts of the infrastructures are must-have versus nice-to-have. For example, every company is in the banking/insurance/healthcare business domain.

For the common sets of ML use cases here, it makes sense to double down on the governance part of the infrastructure. Because understanding the fairness biases related to ethical topics are really important. Also, more emphasis in the explainability of the ML prediction is quite important.

For certain types of ML use cases in a certain business domain, they have specific needs. It is important to understand what those are. Company culture, another very important dimension that might have a large influence on the pace of the MLOps adoption. Company culture is what makes an organization different and unique from others. It is a large topic.

In the context of MLOps adoption, there are a few areas to consider as listed here. I'll talk about three of those. First is innovative part, which encourages creativity, experimentation, and risk taking. If a company is innovative, there's a tendency that they have a higher level of risk tolerance. You can use this information to help determine the amount of effort and time you will need to put in to vet a decision or a technology adoption, for example. The next is collaboration.

Every company is highly collaborative, emphasizes teamwork, cooperations, and support, and such. It is a lot easier. If not, then you need to determine how much effort you need to put in to collaborate with other teams, especially the customers, the stakeholders, decision makers. The specific areas that you want to cooperate on is their involvement in adopting MLOps. The third one is about results driven.

This will help in determining whether a company is fast moving or slow moving. Results-driven companies tend to be fast moving rather than slow moving, so velocity matters. The pace will determine the expectations around how fast your MLOps adoption will need to be, and how soon it needs to demonstrate results. Effectively, it's best to operate at the pace that your customers operate at. Along with a result driven culture, you need to show incremental progress at a certain pace. Use information about a company culture to guide the decisions you make, what pace you need to operate at, and how much time is needed to allocate for decision making and collaboration.

Technology, the surrounding infrastructures are the key dependencies that MLOps infrastructure has. Without these, it will be quite challenging, like data infrastructure, A/B testing, logging, compute infrastructure, CI/CD, monitoring, and such. The suggestions here is to assess their maturity, assess the gap, and advocate for filling in those gaps, to support the ML development lifecycle and velocity. Let's pick then, for example, the data infrastructure. Data is a critical piece of ML. The ML infrastructure and data infrastructure is tied at the hip. Feature engineering, essentially. If the data infrastructure maturity is low, or always low, this is going to slow down the ML development velocity.

Understand what's needed to drive ML development velocity, and advocate for that. For example, how easy or how quickly it is to access the needed data. How smooth is it to move data around, and such. The next part is the A/B infrastructure. This enables experimentation. ML development is a highly iterative process that requires this. Understand how easy it is to do this, so that data scientists can iterate their model quickly, test them quickly.

If it is not easy, then you might need to advocate for your customers. The next one I'm going to touch on is the compute infrastructure. If the ML use cases require performing feature engineering or model training at scale, then it's important to figure out this sooner than later. Because access to the infrastructure for supporting those kinds of activities, is really important, whether access to GPU is needed or not.

People. According to a recent ML survey, organizational alignment is one of the biggest gaps in achieving AI/ML maturity. It's important to identify and align with customers, stakeholders, key decision makers, like data scientists, ML engineers, or PMs, and business owners. Involve them in your strategy and providing them constant communications. More importantly, make sure you're aligning on their needs and the impact of the MLOps infrastructure.

For this, I want to share a quick story about organizational alignment, from talking to one of the candidates. He was sharing that he is looking for a change because his company does not value the MLOps team as much as data scientist team. There might be a lot of reasons behind this situation. Maybe a lack of understanding or information, or the provided infrastructure is not that great. Aligning on these aspects is pretty important.

In terms of DoorDash, what does that look like in terms of these four dimensions that we just went through? DoorDash is essentially a logistics and e-commerce company. The kinds of ML use cases that we have are around logistics, search recommendations, ads and promotions, and frauds. In terms of the logistics use cases, it's about how best to assign orders to Dashers, predicting estimated time of arrivals, food preparations, and others. For these use cases, most of them require online predictions.

We want to make sure to provide the necessary infrastructure to support that. Governance is important but less compared to velocity, because our culture is results or impact driven, and fast moving. For us, teaming up with highly visible teams, and collaborate with them to drive impact together, is definitely a good thing to do. Then aligning with their customer planning processes, operate at the same pace as the customers, as well as demonstrating the incremental progress and impact is pretty important in this fast moving and results driven culture.

Technology-wise, I would assess that we're in the early adult phase. It is improving every quarter. Recognizing in terms of the state of data infrastructure and allocate for a data lake, for example, is what we did at the beginning. Plan the way we build our infrastructure based on that information. For people, we have young data scientist teams, decent MLOps experience. Some of the things we did that are really helpful was to form an ML, Machine Learning Council to help drive the direction of the MLOps infrastructure, so everybody's on the same page.

In terms of our journey over the last 2-plus years, this has been the progression. We actually started out with building out the model deployment and prediction service first. We tackle this first because we want it to align with customers' needs. We build online predictions that can support low latency for the logistics use cases. Then we tackled model training infrastructure next, as more data scientists joined the company, and more models are deployed to production.

We need a single source of truth. The need for a single source of truth became more important, as well as the ability to reproduce. Code versioning is part of that. We built a centralized model training service to handle this, and also enable us to do some part of continuous training. Next, we tackled model insight, as there are more models in production, especially the high-profile ones.

For that, protecting the downside when something goes wrong, has become more important. We built our features and model quality monitoring on top of the monitoring infrastructure that we have in the company. The last part is about feature engineering. As our data infrastructure matures, and the availability of the data lake, then we built out our feature engineering declarative framework that can create and generate features efficiently at scale.

Principles for Scaling and Evolving

We're going to go through a few principles that have helped us in scaling and evolving our infrastructure, as well as some details about our prediction service and feature store. The details will be more technical and less abstract than the previous session. The three principles that I will touch on are, dream big, start small, 1% better every day, and customer focus.

For us, these three principles work fairly well because they are aligned with our culture values. You might detect small similarities between these and Amazon's leadership principles. That's because our CEO is a big fan of Amazon and Jeff Bezos.

Let's talk about the first principle, dream big, start small. This one has two parts, the dream big part and start small. The dream big in the context of building an MLOps infrastructure is about coming up with a vision and strategic bets for how you think about building out MLOps infrastructure. This part is fairly straightforward.

The start small part is a very important aspect, especially for us as a young and high growth company like DoorDash. This part encourages us to make meaningful progress and impact incrementally. What does it look like in action? Like I mentioned before, the components that we built first were the prediction service and the feature store.

You might be wondering why. Also, I wouldn't have guessed this is the case in terms of the order as well, when we started out. Certainly, logistics team is one of the first teams at DoorDash that leverages ML. Their prediction service had challenges in keeping up with the DoorDash growth at the beginning of the pandemic. At the same time, the company has decided to build a centralized MLOps infrastructure. We partnered with them to build a prediction service and a feature store.

In terms of the start small mentality, we recognize this as an opportunity to make real impact. We know the adoption will be there, once we finish building these first two pieces. We jump at the chance of building those pieces out. The few key design decisions that have served us well are the following for the prediction service. First is the centralized prediction service. There are a couple options in terms of building one endpoint per model versus a single endpoint for all models.

There are certainly pros and cons for each one of these options. We chose the latter, because we believe that will give us better CPU utilizations, and easier to manage. We recognize we can also move toward the other approach later on. The second design decision is about supporting model shadowing. You can have models that are performing predictions, as well as models that are in testing phase, and you shadow all the traffic.

This has been tremendously helpful for collecting prediction logs for training purposes. That's why I was quickly testing out to ensure it works as expected. The third design decision is about latency. The first part to support that is supporting batch prediction. Oftentimes, you only think about individual predictions, but there are use cases that will leverage this. Examples of those are recommendation use cases or search ranking use cases.

The clients can send a batch of predictions in one request rather than separate down to separate network requests. To help with latency, we leverage the C++ layer to help with speeding up the prediction. With these set of features, our prediction service was able to handle the QPS of the logistics team's ML use cases. As a result, we're able to demonstrate progress and making meaningful impact in a short amount of time.

Once we successfully onboarded those use cases from the logistics team, the next team that reached out to us was the search team. Their use case is about search and ranking. It is a very important one as well as highly visible one. It has a much higher QPS as far as low latency requirement than the previous one. It's in the path of the user production traffic, so very critical one. Our initial approach is to scale out our prediction service horizontally and up as much as possible.

As we onboarded their use case, we discover a challenge, which typically happens in a multi-tenancy approach. The challenge is noisy neighbor. Meaning that those models have much higher traffic, and it impacted the other use case. We move these big noisy neighbors out into their own cluster, and scale those clusters out independently, so isolate them, essentially. With just that simple approach, our prediction service was able to support up to 10x peak predictions.

This satisfies the search use case. Then the next larger and more complex use case came in, which is about recommendations on the homepage. This use case is quite challenging from the following perspective. As you can see on the app here, the recommendations are displayed on the homepage whenever someone visits DoorDash. It's done automatically.

It is one of the highest traffic pages on DoorDash website or marketplace. The number of the recommendation candidates is pretty sizable. After they fetch, they need to be ranked. Therefore, the number of predictions are in thousands, at one time. Each of these predictions will require some feature lookup. Scaling out is not an option anymore, and it requires a different approach.

This leads to the next principle, which is, 1% better. This principle, the goal is not perfection. It's a constant and never-ending improvement. That's the mentality behind this principle. The question is, how to know what to improve and when to improve. One way is to track and understand what's going on to see what's changing and evolving. From there, we can figure out what areas needs improvement or anticipate potential problems. For us, on a weekly basis, we review the MLOps infrastructure metrics.

We noticed one of them, which is about feature value volume. We noticed that the volume was increasing at a rapid rate over recent months, or weeks. Also, the cost increases as we kept scaling up the cluster. The team decided to invest into this increased feature volume, and also to see what efficiency do we need to address to scale up our feature store infrastructure.

The teams stepped back and made sure to really understand the requirements, and also based on our experience of working with the feature store. Some of the requirements are listed here, in terms of how quickly they need to be refreshed. We need to support batch random reads, support different data types. Latency is pretty key because this is for our model serving.

The two things that the team focused on are improved CPU utilizations, and reduce memory usage. The team did some research, did some benchmark to figure out what options we have, and what optimizations will be necessary. As a result of that effort, the team came up with a few optimizations. The first one is about using the right data structure, essentially. We design a data structure for storing features.

Previously, we were storing features as a flat list of key-value pairs. Instead of that, we switched to a map data type, which means grouping related values or feature values into a single key. That immediately reduces the number of top-level key lookups, essentially. The added benefit is the colocation of an object's field in the same

Redis node, because it's the map. Versus in the past that we have to query these fields, and they might be scattered in multiple nodes. With this optimization, or the different approach of storing related features into a single top-level key, this helps with improving CPU utilizations.

For the second optimization to reduce memory usage, which has two parts. One is, previously, our feature name is quite verbose that you can see from the example, 20-plus bytes, up to 30 bytes, for example. The team decided to encode a feature name from that number of bytes into a 32-bit integer. We want to make sure that you use the hash function with minimal computational overheads, in this case.

With millions of features, the space adds up in terms of the saving or reduction. The second part of reducing memory usage is about leveraging compression, to compress some of those complex feature values, that's for embeddings and list values, for example.

As a result of this, stepping back, and with these optimizations that I shared, we were able to achieve both of those goals that were mentioned before: CPU utilization and memory usage. That led to latency reductions, as well as cost reduction. That was pretty awesome. With that, we were able to onboard the recommendation use case that I discussed earlier.

The last principle I'm going to talk about is the customer focus. This is one of my favorite principles, because it leads to many positive outcomes, and some of them are unexpected. This quote, "With customer obsession, you're not just listening to your customers, you're also inventing on their behalf." This came from one of the blogs about Amazon's Day One mentality. It also talked about even when customers don't yet know it, they always want something better. The desire to delight customers will drive us to invent on their behalf.

The provided example was about the Prime membership program. No customer has ever asked Amazon to do that. For us, this is what we go by in terms of customer obsession, applying the golden rule to customer support. We really believe that good customer support is one of the key ingredients of a successful MLOps infrastructure. We want to support our customers in a way that we would like to be supported, which means supporting our customers properly with respect and fairness.

We also need to strike the right balance between unblocking our customers and being overwhelmed with high support volume requests. A good practice is to track and evaluate the support load and the nature of the support, and invest in building tooling automations at the right time. As an example, in the early days, the model deployment used to require a pull request.

It has a small friction, but it works, but as the number of data scientists increases, it becomes a support burden, as well as slowing down our customers. We invested in automating the deployment process with a simple one-click button and still have these similar safeguards in place. It's a win-win for both sides.

"Delight customers with French Fries moment," I love this concept. This came out of Google. One of the things I like about this concept is it uses a playful name to capture an important concept, which encourages us to tap into our creative thinking, to delight our customers with solutions that don't require prompting from them. That's the key, don't require prompting from them.

Sometimes we end up benefiting from the solutions ourselves. As an example of this, in the early days, testing model predictions was a manual process using a script. Data scientists were ok with this because the documentation was fairly straightforward. As more data scientists joined the company, some of them are not fluent in scripting.

Without any prompting from our customers, we built a simple web-based application to help with this. It was a huge win. This became the genesis of an internal tool that we call ML portal. We have been doubling down in investing into building this tool out with more capabilities in the second half of last year.

Current State

Moving on to the next section, which is the current state, where are we? This graph shows the adoption trend of a number of models in production, and a number of online predictions. The trends are pretty good, up to the right. We're pretty happy with these two trends. Another thing that we're proud of, is the wide adoption of our infrastructure across many teams in the company. Like the search recommendations team, the ads and promotions team, logistics team, prod, and so on.

Future Looking

What are we looking to do next? What does that look like for us? We recognize there's still a lot more to be done. There's still a lot more opportunities to delight our customers and to anticipate their needs. We have a few areas that are on our radar. First one is supporting complex and large deep learning models.

Our data scientists' teams are looking to adopt more neural networks to leverage its power for their large use cases, especially around search recommendations, natural language processing, computer vision use cases. We want to provide a scalable, distributed model training platform. We are revamping our prediction service to make it more generalized and flexible, so we can host and serve large models. Next is ML observability. We want to double down on this in 2023.

We want to provide a cohesive story for end-to-end understanding of feature quality, model performance, drift detection, as well as making it easier to debug model performance related needs. Now that we have a reasonable number of models in production, maintaining and improving or iterating on existing models has become a higher priority.

We want to build a continuous model training infrastructure, that is safe and easy for data scientists to do, with minimum human in the loop. The last piece is efficiency. This is a constant area that we need to think about. When operating at scale, a small efficiency improvement matters a lot in terms of cost. Two areas that we're focusing on, the feature store and our prediction service. For example, leveraging autoscaling on our prediction service based on the traffic pattern.


See more presentations with transcripts


Recorded at:

Nov 15, 2023