InfoQ Homepage Presentations MLOps: the Most Important Piece in the Enterprise AI Puzzle

MLOps: the Most Important Piece in the Enterprise AI Puzzle

View Presentation

Speed:

Download

34:08

Summary

Francesca Lazzeri overviews the latest MLOps technologies and principles that data scientists and ML engineers can apply to their machine learning processes.

Bio

Francesca Lazzeri is an experienced scientist and machine learning practitioner with over 12 years of both academic and industry experience. She is author of the book “Machine Learning for Time Series Forecasting with Python” (Wiley) and many other publications, including technology journals and conferences.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Lazzeri: My name is Francesca Lazzeri. I'm a Principal Cloud Advocate Manager at Microsoft. I'm here to talk about ML Ops, and why I think that ML Ops is the most important piece in the enterprise AI puzzle.

When do you think that any machine learning algorithm is actually becoming AI? AI offers companies the possibility to transform their operation. However, in order to be able to leverage these opportunities, companies have to learn how to successfully build, train, test, and push hundreds of machine learning models in production. To move models from development to their production environment in ways that are simple, robust, fast, and most importantly, repeatable. Nowadays, data scientists and developers have a much easier experience when building AI based solutions because they have availability and also accessibility of data and open source machine learning frameworks. However, this process becomes a lot more complex when they need to think about model deployment. Also, they need to pick the best strategy to scale up to a production grade system. Just to clarify, model deployment is just one part of machine learning Ops, of ML Ops. Model deployment is the method by which you can integrate a machine learning model into an existing production environment in order to start using it to make practical business decisions for your business, for your company based on the data and on the results that you get from your models. It is only once models are deployed to production that they start adding value, making deployment a very important step of the ML Ops experience.

ML Ops - How to Bring ML to Production

Let's try to understand better, what is ML Ops? ML Ops empowers data scientists and app developers to help bring the machine learning models to production. ML Ops enables you to track, version, audit, certify, reuse every asset in your machine learning lifecycle, and provides orchestration services to streamline managing this lifecycle. ML Ops is really about bringing together people, processes, and platform to automate machine learning-infused software delivery and also provide continuous value to our users.

How Is ML Ops Different from DevOps?

How is ML Ops different from DevOps? There are four main aspects that I think you can look at in order to understand the differences. Data model versioning is different from code versioning, how to version datasets as the schema and origin data change. Then there is a model reuse that is different than software reuse, as models must be tuned based on input data and scenario. Another aspect is the digital audit trail and the actual requirements that change when dealing with code plus data. Finally, the model performance tends to decay over time, and you need the ability to retrain them on demand to ensure that they remain useful in a production context.

Traditional vs. ML Infused Systems

You can see how a traditional system is different from a machine learning infused system. Machine learning introduces two new assets into the software development lifecycle. These two assets are data and models. There are many different assets and processes that you need to manage in a real world scenario. For example, configuration, data collection, also feature extraction, data verification, some machine resource management, analysis tools, process management tools, and monitoring. The machine learning code is actually a very tiny piece of this big puzzle.

Customer Pain Points

These are some of the most common pain points that we try to summarize in this table. It's very hard to deploy a model for inference after I have trained it. It will be great to have a no-code deployment for models of common languages and frameworks. It is also very hard to integrate the machine learning lifecycle into my application lifecycle. It will be great to have a production grade model release with model validation, multi-stage deployment, and also controlled rollout. It is hard to know how and when to retrain a machine learning model. Model feedback loop with AB scorecards and drift analysis integrated with machine learning pipelines for retraining here can be very helpful. Finally, it's hard to figure out where my model came from and how it's being used. Here it will be great to have an enterprise asset management with audit trail, policy, and quota management.

How to Integrate ML Ops in the Real World

How do we implement ML Ops in the real world? There are many jobs, roles, and tools that are involved in production machine learning. Let's start with the most famous one, that is the data scientist. Data scientists most of the time, they know how to use machine learning tools such as Azure Machine Learning on the cloud. Of course, they are familiar with GitHub. They use deep learning and machine learning frameworks such as TensorFlow, PyTorch, Scikit-learn. They also know how to use Azure Compute, CPU, GPU, FPGA. Then we have the machine learning engineer, usually is a person who is very familiar with DevOps, GitHub, uses different Kubernetes services, and is also able to use Azure IoT Edge, and then also how to use Azure Monitor. Then there is a third role that is also very common that is the data engineer. This person knows how to use Data Lake, Data Factory, Databricks, of course SQL, and use these services and tools in order to manage the data pipeline. There are also some additional roles: the IoT Ops person, the data analyst, the business owner, the data expert, the data visualization person, and so on. All these roles are very important in order to make sure that you develop an end-to-end machine learning solution.

Pipelines to Manage the End-To-End Process

However, there is rarely one pipeline to manage the end-to-end process. We have different roles, and as a consequence, we have different pipelines. These pipelines do not talk really well between each other. The first pipeline is a data engineer. Usually, this person is very familiar with data preparation, knows how to use the Data Lake, Data Catalog. Then there is a data scientist who's actually very familiar with the machine learning pipeline. They know how to train the model, how to do feature engineering, feature extraction, train and evaluation of the model. Then they know how to register the model. Then there is a machine learning engineer who is an expert in the release of the model. They know how to package, validate, approve, and then finally deploy the model.

Process Maturity Model: Level 1 - No ML Ops

There are different levels of this process. We like to see four different levels. Now I'm going to show you all those four levels in more detail. There is level one, this is no ML Ops at all. Probably, it's a very common scenario for all of you. Here is a very interactive exploratory level in order to do some exploration and get something useful with machine learning. Most of the time we have a data scientist, who is the expert of this first level. They do the data preparation, the selection of algorithms. Finally, they select the best model based on their scenario and on their own data flows.

Level 2 - Reproducible Model Training

Then we have a data scientist also that is the expert of the model training part. Here we have a machine learning pipeline, so we have the data preparation and then the selection of the algorithm. Then we pick the best model, the most useful model, and we register the model to a model registry. There is also a run history service that is great at capturing information such as datasets, environments, code, logs, metrics, and outputs.

Level 3 - Automated Model Deployment

Then the third level is about automated model deployment. Here it's great because it's a level in which we can automate the packaging, but also the certification and the deployment of a machine learning model. From the model registry, you can package the model. You know how to certify the model. Then you can release the model. Also, you can see that at the packaging model, you can package environments and code. Certify the models in terms of data and also explanations of your machine learning algorithms in order to make them more interpretable. Then, of course, the release of the model in terms of eventing and notification and DevOps integration. Most of the time, we have a machine learning engineer that is taking care of this level.

Level 4 - Automated E2E ML Lifecycle

Finally, we have level four. This is about the automated end-to-end machine learning lifecycle. In this level, what is nice is that we have all the three roles working together. We have the data scientist, the machine learning engineer, and also the data engineer.

Real World Examples - Leveraging ML Ops to Ship Recommender System

Let's see now some real world examples. I took, for this presentation, an example from a recommendation project that we work on. I put the link to the GitHub repo, it is github.com/microsoft/recommenders. This repository contains examples and best practices for building recommendation systems, and also provides a lot of Jupyter Notebooks that you can leverage if you want to build end-to-end machine learning solutions for recommender systems. There are a lot of information around preparing your data, and loading the data for each recommender algorithm. Then you can build the models using various classical and also deep learning recommender algorithms. Then you can evaluate your model, you can evaluate the algorithms with offline metrics. Finally, there is the model selection and optimization. You can at this moment tune and optimize the hyper-parameters for recommender models. Finally, there is the operationalization. This is about operationalizing models in a production ready environment on the cloud.

Generalized ML Ops Process

We were also able to build this architecture that is more about generalized ML Ops process. Here, there is a developer that works on the IDE of their choice on the application code, then they commit the code to source control of their choice. VSTS has a good support for different source controls. On the other side, there is a data scientist that works on developing their model. Once they're happy, they publish the model to a model repo. Then release a build that is kickoff in VSTS, and it is based on the commit in GitHub. VSTS build pipeline pulls the latest model from a Blob container and also creates a container. Then after release, VSTS pushes the image to private image repo, or in the Azure Container Registry, and on a set schedule, where most of the time is overnight, release the pipeline. This release of the pipeline is kickoff. Finally, we have the latest image from ACR that is pulled and deployed across Kubernetes cluster on ACS. The user's request for the app goes through DNS server. This DNS server passes the request to load balancer and sends the response back to the user. Again, this is a generalized ML Ops process. As you can see here, we have the three roles that I was referring to before, that is the machine learning engineer, the data engineer, and the data scientist.

Azure Machine Learning ML Ops Features

Let's now look quickly at what are the Azure Machine Learning ML Ops features that you can leverage. How does machine learning on Azure help with ML Ops? Azure ML contains a number of asset management and orchestration services to help you manage the lifecycle of your model training and deployment workflows. With Azure ML and Azure DevOps, you can manage your datasets, but also your experiments, models, and any ML infused applications. Azure Machine Learning is a cloud based environment that you can use to train, deploy, automate, manage, and track machine learning models. Azure Machine Learning can be used for any machine learning algorithms, from classical ML, to deep learning, supervised but also unsupervised learning. Also, you can write in Python, but also R is another option. With the SDK, you can do that, you can use both programming languages.

Dataset Management and Versioning

The most important capabilities here on Azure is the dataset management and versioning. It's not only one of the most important ones, but also is the first step. Dataset versioning is a way to bookmark the state of your data. It's very important, so that you can apply a specific version of the dataset for future experiments. Typical versioning scenarios are when the new data is available, for example, for training, or when you are applying different data preparation or feature engineering approaches to your data. By registering the dataset, you can version, reuse, and share it across experiments, and also with your peers, with your colleagues. You can register multiple datasets under the same name and retrieve a specific version by name and version number. It is very helpful. It's also important to understand that when you create a dataset version, you are not creating an extra copy of the data with a workspace. Because datasets are references to the data in your storage service, you have a single source of truth that is managed by your storage service.

Declarative ML Pipelines

Then we have machine learning pipelines. An Azure Machine Learning pipeline is a workflow of a complete machine learning task. Subtasks are encapsulated as a series of steps within the pipeline. An Azure Machine Learning pipeline can be as simple as one that, for example, call a simple Python script. The pipeline should focus on machine learning tasks such as data preparation, including porting, validating, and cleaning your data, but also normalization and staging. Then it can be about training, configuration, including parameterizing arguments, file paths, logging, and reporting, also configuration. It's also important to train and validate your machine learning algorithms. Finally, they are also about the deployment of your models. It's a very important step because it includes also the versioning, scaling, and provisioning, and access control.

Model Management, Packaging, and Deployment

Another important capability is about model management, packaging, and deployment. Machine learning operation, that is ML Ops, is based on DevOps principles and practices that increase the efficiency of your workflows. For example, continuous integration, delivery and deployment of your machine learning workflow. Specifically, machine learning Ops applies these principles to the machine learning process with the goal of faster experimentation, and developmental models, faster deployment of models into production and also quality assurance. Azure Machine Learning provides the following ML Ops capabilities such as create reproducible ML pipelines, also register, package, and deploy models from anywhere. Also, capture the governance data for the end-to-end ML lifecycle. Also, monitor the machine learning application for operational and machine learning related issues.

Azure DevOps and Event Grid Integration, and Data Drift Monitor

It's also important to mention that there is an Azure DevOps integration in order to automate training and deployment into existing release and management processes, which is again a great capability. Then there is also the Azure ML Event Grid integration that is a fully managed event routing for all activities in the machine learning lifecycle. You can also set up a data drift monitor that compares dataset over time and determine when to take a closer look into a dataset. This is again another important capability.

Key Takeaways

It's important to have a machine learning plus a DevOps mindset. ML Ops really provides the structure for building, deploying, and managing an enterprise ready AI application lifecycle. ML Ops enhances lead delivery which is again, a very important step. Adoption will increase the agility, quality, and delivery of AI project teams more than technology, meaning that ML Ops is a conversation about people, processes, and technology. AI principle and practices need to be understood by all these roles, and also the different roles that we were talking about.

Resources

If you want to learn more, I put a few links for you. There is a lot of documentation that you can look at. You can find more at aka.ms/azuremldocs. There is the GitHub repo full of samples and tutorials, and it is github.com/microsoft/mlops. If you have feedback, you have to tell us what you think and tell us a little bit more about your scenario, and what you're trying to achieve, and you can do this at aka.ms/azureml_feedback.

I also added a few additional resources that people can check offline if they want to learn more about machine learning in general, Azure Machine Learning, and most importantly, ML Ops. ML Ops is more like a practice than an actual tool. I want to make sure that all of us, we learn how we can deploy our machine learning models, so that we can operationalize them. We can allow other people, other companies to consume them and to really operationalize the answers, the results that they need to get in order to improve any business process.

Questions and Answers

Jördening: We had one question on how the models are deployed if they are packaged as Docker OCI images, how they're deployed on Azure.

Lazzeri: Most of the time, there are two functions. The type of answer that I would give you now is based on what we are observing in the industry from our customers and also from the data science team that I've been leading. Most of the time we use Python for the deployment of models. There are two different functions that you need to do right in Python, those are pretty simple functions. They're straightforward, if you know how the models work. There is the init function. This is the function that basically decide and define more like how the data has to be prepared in order to be consumed from a model. You need to feed, of course, your algorithm, your model with big data. This first function does it for you. For example, if you have a time-series dataset, and you need to find the specific column, actually, your index column, your timestamps column, you're going to use it for these time series forecasting. All these types of data preparation has to be actually in this init function.

Then after the init function, there is the run function, which is another function in Python, very simple. You just need to make sure that after you ingest data, and the data is processed in the right way, this function is going to provide the model for you, the model that you decide would be the best model, that you decide to operationalize it. Init function by selecting the data and that is the result that you need. Once you write these two functions in Python, then you are going to basically deploy the model, and the model, the deployment of the result is actually what you call the [inaudible 00:25:05] that you're going to see and write this where you're going to have two functions. Most of the time the serialization process that is done in the [inaudible 00:25:20]. You do that adjustment for the data preparation and also the model itself, I define it as agile, in order to be like consumed [inaudible 00:25:32]. These are the two different steps that you need to follow in order to deploy your ML models in Azure, [inaudible 00:25:45]. This is 99% of the time I see data experts and the data science community, but also the ML engineer community they follow this process.

We do have other tools that we can use for deployment. Another tool that is now becoming also pretty popular is the ML designer, the drag and drop tool, which is nice if you want to see end-to-end machine learning lifecycle. For example, you want to see what other data scientists have done in terms of data preparation, and training, validation, with really a very nice visual flow, and put together. Then the deployment is going to be just a click. You don't need actually to write Python and any function. It's more for a low-code type of experience. Some customers prefer that, and so [inaudible 00:26:42] the designer, this agile tool is going actually to create the REST API for your model, these are the data point that you can use in order to call, consume the model which you have. This is really the personalization part that you've done in ML.

Jördening: Should data scientists who spend their day in Jupyter Notebooks be encouraged to learn these practices? I would definitely say yes.

Lazzeri: I totally agree with you. Right now, data scientists are still a crucial part of the end-to-end machine learning process. I really like to hire data scientists sometimes. When they are at the beginning of their career, it's normal, that they only focus on the model that [inaudible 00:28:06] on the Jupyter Notebook. They're going to spend time trying to do like best parameter tuning, data preparation, [inaudible 00:28:17]. As soon as they go in there, I really help them understand, what's the end-to-end process of operationalizing machine learning solution?

We all like to talk about AI, artificial intelligence sounds very fancy, but actually, are we really deploying machine learning models into production so that we then add and build the AI application on top of that? The answer is yes and no. We have been seeing a few. There are successful use cases where, of course, we are leveraging the AI applications, but most of the time, it's very hard to deploy machine learning models into production, and most important, it's very hard to make sure that we have these ML Ops best practices so that the solution can actually work over time and can post even better results with more time.

Yes, the more you grow in your area as a data scientist, the more you will need to understand about operationalization models, ML Ops best practice, and also understand what your peers in the team are doing. If you're working with machine learning engineers, and they are taking care of the optimization part. That is totally fine. You're not really going to work on that part. You need to be ready to at least understand what they are doing, so that if you find yourself in front of a customer, and you need to present the solution that you are building, you are able to tell the end-to-end story.

It's more like a career recommendation, mine, and of course, as a consequence, I hope that will also bring quality in our solution that we're going to push into production, because I think that if you know how the solution could be deployed, I think that that is going to improve also the quality of it. That is more like the advice I would give to every data scientist that talks to me and asks me this question. Of course, most of the time, especially at Microsoft who love to have that type of expectation, so when we just hire junior data scientists, the first role of the data scientist is really about data preparation and machine learning models. That is the first step into this fantastic world of machine learning and AI.

Jördening: Where would you recommend small teams to start? In which area?

Lazzeri: There are two areas that are extremely popular right now. One is the data preparation, because you can be an expert for the deployment of your model and things like that, but if you don't know the data, if you don't know how that data and good flows that can contribute to the accuracy of your model, I think that you cannot even start there. Any end-to-end machine learning solutions that you are planning to push into production, always back up the data. Make sure that not only you understand how that data can help you answer the excess problems, the excess questions that you're trying to solve, but also how you can prepare that data in order to feed your machine learning model. This is always the first part.

Then for your machine learning model itself, I think that we are so lucky at this point in the history of machine learning because we have access to many different open source frameworks, machine learning, deep learning. I've just mentioned PyTorch, TensorFlow. Of course, you need to know how to leverage those. I think that the Python community has done a brilliant job. You can always leverage them.

Then the second, in my opinion, and extremely important for small teams and for small companies is data deployment part. However, when I was answering the question about deployment, I touched two points. One is the init function that is about data preparation. Knowing your data and knowing your data preparation process well is going to help you for the deployment of the model. Then, the model itself. Again, being ready with the data preparation, and then with the best practice, to deploy your model, I think 60%, 70% of your work is already done. You've already completed. These are really the most important two points, the data preparation and the model deployment.

Jördening: Totally agree on that. I'm really curious to see where ML Ops is going, and what different things will appear.

See more presentations with transcripts

Recorded at:

May 06, 2022

Francesca Lazzeri

InfoQ Software Architects' Newsletter