Challenges and Solutions for Building Machine Learning Systems

According to Camilla Montonen, the challenges of building machine learning systems have to do mostly with creating and maintaining the model. MLOps platforms and solutions contain components needed to build machine systems, but MLOps is not about the tools; it is a culture and a set of practices. Montonen suggests that we should bridge the divide between practices of data science and machine learning engineering.

Camilla Montonen spoke about building machine learning systems at NDC Oslo 2023.

Challenges that come with deploying machine learning systems to production include how to clean, curate and manage model training data, how to efficiently train and evaluate the model, and how to measure whether or not the model continues to perform well in production, Montonen said. Other challenges are how to calculate and serve the predictions the model makes on new data, how to handle missing and corrupted data and edge cases, how and when to efficiently re-train this model, and how to version control and store these different versions, she added.

There is a set of common components that are usually part of a machine learning system, Montonen explained: a feature store, an experiment tracking system so that data scientists can easily version the various models that they produce, a model registry or model versioning system to keep track of which model is currently deployed to production, and a data quality monitoring system to detect when some issues with data quality might arise. These components are now part of many MLOps platforms and solutions that are available on the market, she added.

Montonen argued that the tools and components do solve the problems for the systems they were designed for, but often fail to account for the fact that in a typical company, the evolution of a machine learning system is governed by factors that are often far outside of the realm of technical issues.

MLOps is not about the tools, it’s about the culture, Montonen claimed. It is not about just adding a model registry or a feature store to your stack, but about how the people who build and maintain your system interact with it, and reducing any and all friction points to a minimum, as she explained:

This can involve everything from thinking about git hygiene in your ML code repositories, designing how individual components of pipelines should be tested, thinking about how to keep feedback loops between data science experimentation environments and production environments, and maintaining a high standard of engineering throughout the code base.

We should strive towards bridging the divide between the practice of data science, which prioritizes rapid experimentation and iteration over robust production quality code, and the practice of machine learning engineering, which prioritizes version control, controlled delivery and deployment to production via CI/CD pipelines, automated testing and more thoughtfully crafted production code that is designed to be maintained over a longer period of time, Montonen said.

Instead of immediately adopting a bunch of MLOps tools that are more likely to complicate your problems instead of solving them, Montonen suggested going back to basics:

Begin with an honest diagnosis of why your machine learning team is struggling.

The largest gains in terms of data scientists’ development velocity and production reliability can be gained with a few surprisingly basic and simple investments into testing, CI/CD, and git hygiene, Montonen concluded.

InfoQ interviewed Camilla Montonen about building machine learning systems.

InfoQ: How well do the currently available MLOps tools and components solve the problem that software engineers are facing?

Camilla Montonen: Most big MLOps tooling providers grew out of projects started by engineers working on large language model training or computer vision model training, and are great for those use cases. They fail to account for the fact that in most small and medium sized companies that are not Big Tech, we are not training SOTA computer vision models; we’re building models to predict customer churn or help our users find interesting items.

In these particular cases, these ready-made components are often not flexible enough to account for the many idiosyncrasies that accumulate in ML systems as time goes on.

InfoQ: What’s your advice to companies that are struggling with deploying their machine learning systems?

Montonen: Find out what your machine learning team is struggling with before introducing any tools or solutions.

Is the code base complex? Are data scientists deploying ML pipeline code into production from their local machines, making it hard to keep track of which code changes are running in production? Is it hard to pinpoint what code changes are responsible for bugs that arise in production? Perhaps you need to invest in some refactoring and a proper CI/CD process and tooling.

Are your new models performing worse in online A/B tests compared to your production models, but you have no insight into why this happens? Perhaps you need to invest in a simple dashboard that tracks key metrics.

Having a diagnosis of your current problems will help you identify what tools will actually solve them and help you reason about tradeoffs. Most MLOps tools require some learning/maintenance/integration efforts so it is good to know that the problem you are solving with them is worth these tradeoffs.

About the Author

Ben Linders

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Ben Linders

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter