Francesca Lazzeri on What You Should Know before Deploying ML in Production

At the recent QCon Plus online conference, Dr. Francesca Lazzeri gave a talk on machine learning operations (MLOps) titled "What You Should Know before Deploying ML in Production." She covered four key topics, including MLOps capabilities, open source integrations, machine-learning pipelines, and the MLFlow platform.

Dr. Lazzeri, a principal data scientist manager at Microsoft and adjunct professor of AI and Machine Learning at Columbia, began by discussing several challenges encountered in the lifecycle of a ML project, from collecting and cleaning large datasets, to tracking multiple experiments, to deploying and monitoring models in production. She covered four main areas data scientists and engineers should consider. First, she outlined several MLOps capabilities for managing models, deployments, and monitoring. Next she discussed several open-source tools for deep learning and for managing machine-learning pipelines. Finally, she gave an overview of an open-source machine-learning platform called MLFlow. In the post-presentation Q&A session, Dr. Lazzeri cautioned against thinking of MLOps as a "static tool." Instead, she said:

MLOps is more about culture and thinking on how you can connect different tools in your end-to-end development experience...and how you can optimize some of these opportunities that you have.

The talk began with a discussion of some of the challenges with developing and deploying ML applications. The models trained by ML require large amounts of data, and tracking and managing these datasets can be difficult. There is also the challenge of feature engineering: extracting and cataloging the features in the datasets. Training an accurate model can require many experiments with different model architectures and hyperparameter values, which must also be tracked. Finally, after the model is deployed to production, it must be monitored. This differs from monitoring conventional web apps: in addition to standard performance data such as response latency and exceptions, model predictions must be measured against ground truth, with the entire lifecycle process repeated if the real-world data drifts from the originally collected data.

MLOps can help data scientists and engineers to see the challenges as opportunities. Dr. Lazzeri listed seven important MLOps capabilities:

Create reproducible ML pipelines
Create reusable software environments for training and deploying
Register, package, and deploy models from anywhere
Track governance data for the end-to-end ML lifecycle
Notify and alert on events in the lifecycle
Monitor the app for operational and model issues
Automate the end to end lifecycle with different pipelines

She then discussed several open-source packages that can help with these capabilities. First, she mentioned three popular frameworks for training models: PyTorch, TensorFlow, and Ray. Dr. Lazzeri noted that in a survey of commercial users, TensorFlow was used by about 60% and PyTorch around 30%. She mentioned that Ray has many features specialized for reinforcement learning, although some are in beta or even alpha status. She also mentioned two frameworks for interpretable and fair models: InterpretML, which can train explainable "glass box" models or explain black box ones, and Fairlearn, a Python package for detecting and mitigating unfairness in models. Dr. Lazzeri also recommended Open Neural Network Exchange (ONNX), an interoperability framework that allows models trained in various frameworks to be deployed on a wide variety of hardware platforms.

Next, Dr. Lazzeri discussed ML pipelines, which manage data preparation, training and validating models, and deployment. She outlined three pipeline scenarios and their recommended open-source framework: Kubeflow, for managing a data-to-model pipeline, Apache Airflow for managing a data-to-data pipeline, and Jenkins, for managing a code-to-service pipeline. Each scenario has different strengths and can appeal to a different persona: Kubeflow for data scientists, Airflow for data engineers, and Jenkins for developers or DevOps engineers. Finally, she gave an overview of MLFlow, an open-source platform for managing the end-to-end ML lifecycle. MLFlow has components for tracking experiments, packaging code for reproducible runs, deploying models to production, and managing models and associated metadata.

The session concluded with Dr. Lazzeri answering questions from the audience. Several members asked about ONNX. Dr. Lazzeri noted that in her survey, about 27% of respondents were using ONNX; she also noted that models from Ray and PyTorch both perform well on ONNX. She recommended the use of automated machine learning (AutoML) as a good solution for helping developers scale their model training. She concluded by noting that although there are tools that can help, monitoring the accuracy of ML in production is still somewhat manual.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter