BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Why Most Machine Learning Projects Fail to Reach Production

Why Most Machine Learning Projects Fail to Reach Production

Listen to this article -  0:00

Key Takeaways

  • Most ML projects fail to reach production. Five recurring pitfalls drive failures in ML projects: choosing the wrong problem, data quality/labeling issues, the model-to-product gap, offline-online mismatch, and non-technical blockers.
  • Define a clear business goal before starting, and validate that it truly needs ML. Translating business goals into ML requires heavy data engineering, objective‑function design, and sometimes expensive infrastructure, making late pivots costly.
  • Treat data as a product: prevent leakage and bias, invest in labeling and golden sets, and build evaluation pipelines early to avoid brittle releases.
  • Manage uncertainty with a balanced portfolio: ship low‑risk/high‑impact wins to justify investment, while incubating riskier bets that can be game‑changing.
  • Encourage early collaboration and active engagement of cross-functional teams. Successful ML teams align stakeholders, scope an MVP, build end‑to‑end early for A/B testing, and iterate based on monitoring.

This article is a summary of my talk at QCon San Francisco 2024. For the past decade, I have worked as an applied scientist and machine learning engineer across multiple domains, including social media, fintech, and productivity tools. Over the years, I have seen many projects succeed, but just as many also fail to make it to production. This article reflects on why success or failure happens and what we can learn from it.

In this article, I discuss common pitfalls that cause machine learning projects to fail, such as the inherent uncertainty of machine learning, misaligned optimization objectives, and skill gaps among practitioners. First, I outline the ML project lifecycle and what makes it different from ordinary software projects. Then I dive into five common pitfalls (with examples) and explain how we can reduce the chance of encountering them.

Failure Rates of Machine Learning Projects

I have worked on machine learning projects across various domains, including social media platforms, fintech solutions, and productivity tools, recently. Some of these projects reached production, while many did not. Each effort taught me something interesting and introduced me to new technologies, but it is not a great feeling when a project you poured your heart into does not generate the impact you had hoped for. I wondered: am I alone? How severe is this problem?

Older studies reported failure rates in ML projects as high as eighty-five percent . More recent surveys tell a similar story. In a 2023 Rexer Analytics study\({}^{[1]}\) of over three hundred ML practitioners, only thirty-two percent said their projects reached production. Rates vary by industry: big tech has years of AI adoption, while traditional enterprises and startups are still navigating their path to effective, seamless AI adoption.

Not all failures are bad. ML projects are inherently uncertain and experimental. Often, you cannot know if ML will help you reach a conclusion until you explore the data and try some baseline models. If, based on these preliminary studies, you decide to pivot or kill the project quickly, that should be considered a success, the classic fail fast principle that encourages innovation.

In this article, I focus on bad failures: projects that drag on without a clear definition, models that are not deployed despite good offline performance, or solutions that are not adopted even after deployment.

The Lifecycle of ML Projects

From a high-level, simplified perspective, a typical ML project lifecycle has six steps. It starts with identifying a business goal to be optimized using ML, which needs to be framed as an ML problem. Framing the problem involves exploring and processing related data to train various models. The best performing trained models are deployed and monitored. The feedback we get from the monitoring process will then be used to refine the entire system.

A simplified high-level diagram illustrating the machine learning project lifecycle. (Source: author)

There are two important points to this iterative lifecycle:

  • It is a lengthy, multi-step process with numerous handovers across teams, which increases the risk of failure due to the inherent complexity involved.
  • ML projects are data‑centric optimization problems. Feedback signals from data, models, and monitoring are equally essential for a successful outcome.

Pitfall 1: Tackling the Wrong Problem

For the five common pitfalls that we want to cover, one of the most critical is optimizing the wrong problem. In Rexer Analytics' survey, when the ML practitioners were asked whether objectives are clearly defined before initiation, twenty-nine percent answered "most of the time", and twenty-six percent said this rarely happens. That lack of clarity is a common battle ML engineers fight even before committing to a project. 

Iterating over business goals and varying with some ambiguity to start a project was common in the past. For machine learning projects, however, this iteration has become a more severe obstacle. To understand this issue, we need to explore how business goals are being transformed into a machine learning solution. Of course, we do not want to miss the first step, which is identifying whether or not this problem relates to machine learning. 

Once we make sure this is a machine learning issue, we need to frame it as an ML problem, which involves identifying the specific data to extract signals (data engineering) and training multiple models/architectures/hyperparameter settings. Depending on the model's sophistication, the training step may often require more expensive infrastructure (e.g., GPUs).

At the end of the day, we are trying to optimize a mathematically defined objective function. This objective function is highly dependent on what type of business goals we are trying to solve. Late changes to business goals require adjustments to data, objective functions, and pipelines, which may result in the loss of work. That is why, to begin with a good machine learning problem, we usually want to ask many questions to the business team.

Here is an example that I want to share showing how we can increase our chance of picking a winning project. A few years ago, I was part of a centralized AI fintech team that supported multiple business lines. Every line pitched its project as "the most important", often with jargon we did not share. Our team had to navigate the noise and prioritize investments with the best chance of success. 

Throughout the year, I was able to work on many different types of projects across various business lines. The biggest success – for the company and for me – was a self-explanatory predictive model for the personal and commercial banking (P&CB) space. Three factors led to the success of this project:

  • Direct revenue relevance: P&CB is a major profit center, so there was a strong top‑level drive.
  • Mutuality with an existing system: Our model slotted into a long‑running end‑to‑end system with monitoring/reporting; we only needed to swap the model.
  • ML feasibility: The incumbent model was simple with modern architectures and features, so we were confident we could outperform it.

Starting a machine learning project just because everybody else is doing it, or because technically it is feasible, is not enough. The best ML projects hit the sweet spot of desirable (stakeholder pull), profitable (business impact justifies cost), and feasible (technically solvable). Ask hard questions up front: Is the goal clearly defined? Do projected profits justify costs? Which assumptions are realistic? What risks might the model expose?

Portfolio balance matters too. Low‑risk/high‑impact projects are "low‑hanging fruit". High‑impact/high‑risk projects are worth pursuing if you are aware of the risks and maintain a balanced portfolio. Wins justify investment in AI infrastructure and talent, while riskier bets can be game‑changing.

Pitfall 2: Data Pitfalls

The second pitfall is the challenge posed by data. This is one of the most common pitfalls of ML projects, and perhaps the one that your team complains the most about. Where are the data? Can we deal with a large amount of data processing? Even though the company has already invested a lot in solving those problems, it does not mean that there are no other hidden challenges that will hurt your project in the end. There is a famous saying in the machine learning world: "garbage in, garbage out". ML projects depend entirely on recognizing patterns in the data. If the data is flawed, then it is highly likely that the conclusion you find from this study will not be trusted.

Over the years, the machine learning community has established a standard structure for data pipelines, encompassing data collection, processing, and feature engineering. There is also a list of common tasks that people typically perform for their data preparation: filtering for duplications and outliers, filling in missing data, and resampling to handle data imbalance. Every proper machine learning team uses or adapts these standard procedures, but their use is far from sufficient. 

If you are interested in how ML projects can fail, this GitHub repository contains a long list of failed machine learning solutions identified over the years. The majority of these failures are related to data. Even though big tech companies or university researchers develop all the solutions, they are not immune to mistakes that involve data.

A 2022 Princeton University review\({}^{[2]}\) found critical pitfalls in twenty-two peer‑reviewed papers; those results propagated into over 290 follow‑ups across seventeen fields. One of the key issues was data leakage. The study categorized data leakage into eight different categories, from mixing training and testing data to sampling biases. Dealing with various types of data leakage makes it challenging to identify and prevent it early on. 

Data preparation work usually feels like exploring an iceberg. What you see on the surface is just the beginning. Many problems, especially the ones related to your own data, are hidden beneath. Large organizations also face the problem of data silos: teams may not know all available features, leading to false "unsolvable" conclusions. Labeling is another major challenge. Typically, we need to collect and label a gold dataset for evaluation, provide detailed annotation guidelines, and perform quality checks. Even then, you may find that the labeled data still lacks some consensus, and you cannot use it for model training.

With Model‑as‑a‑Service and pre‑trained models, teams can skip training via external APIs, but evaluation remains hard. In early GenAI work, many teams relied on human eyeballing or tiny example sets. That is OK at the very beginning, but without robust evaluation pipelines, you end up with reactive patches and an unknown blast radius. Ultimately, we still need to invest heavily in the evaluation of LLMs and GenAI.

While it is impossible to fix everything perfectly from the beginning, this still cannot be emphasized enough: take the time to explore and understand your data. Look for new features and clean it up based on your observations, rather than applying a standard process. Invest in collecting high-quality labels. In the end, machine learning success depends heavily on the data.

Pitfall 3: From Model to Product

The third problem I want to highlight is the challenge of turning a machine learning model into a functional product that serves large-scale, real-time users. This transition is not just about deploying the code. It takes immense effort to address production constraints and additional requirements.

Engineering system overview of a real-world ML system. (Source\({}^{[3]}\))

Google's famous diagram shows how small the ML code is compared to the surrounding infrastructure. The majority of the code originates from the supporting infrastructure, including resource management, serving systems, monitoring tools, logging, and other related components. The MLOps landscape has matured, and there are many resources available to help. Of course, for the first-time adopters, this sounds like heavy lifting, but once you build foundational pipelines, you can support multiple ML solutions and deploy them more seamlessly.

To get a sense of the gap between a machine learning model and a full machine learning powered solution (beyond what MLOps alone can cover), let’s take retrieval-augmented generation (RAG) as an example. In essence, RAG extracts relevant information from your own database. It provides it to a large language model (LLM), allowing the model to answer questions with that extra context in mind. A quick demo can look deceptively simple (an LLM API, a vector database, and a bit of orchestration). But turning that into a production-ready RAG system (for example, one that powers customer support) is a completely different story. You need ways to evaluate performance and control quality. You may also need a more advanced or agentic RAG setup, rather than a basic one, along with explainability features, so customers can trust the answers. On top of that, there is the engineering side: reducing latency with caching or inference tweaks, and keeping privacy, fairness, and security in check (including defenses against hallucinations and jailbreaks).

Beyond technical metrics, it is also important to monitor business-oriented indicators. These indicators can include measures of product quality (e.g., the frequency of user adoption or engagement with a new feature), customer experience (e.g., satisfaction scores or retention rates), and overall platform health (e.g., revenue growth, active usage, or churn rates). The key principle is that optimization efforts on the technical side should not undermine the broader success and sustainability of the business.

In short, winning teams are cross‑functional and align early on requirements, quality gates, and production constraints instead of working in silos and hoping issues will be fixed later. 

Pitfall 4: Offline vs. Online

The fourth pitfall is offline success and online failure. This pitfall is probably the one that caused the most emotional waves with the team. Why do solid offline models fail online? Because data, solutions, and metrics differ between stages. Offline models use historical (often cleaned/sampled) data and ML‑centric metrics. Online models use real‑time data, an end‑to‑end system, and business‑aligned metrics\({}^{[4]}\).

(Source: author)

Let me share an example from my first production launch, which was many years ago. At the time, we were developing a photo recommender to promote photographers' work within a creative community. The business team raised the concern that new users would register, post one or two photos, and never return. With that in mind, our data science team started to do some exploration to find out what the problem was. They found a high correlation between early likes and retention. With that insight, the task became to surface new users' work faster by promoting each new user's photos as soon as possible, so they could receive the reaction they wanted and return to our website.

At the time, most recommenders relied on collaborative filtering: if User A and B liked many of the same items, we recommended items liked by A but unseen by B. This also explained why our new users did not receive many likes, a phenomenon known as the cold-start problem: because the new items had not had any interactions before, they also had little to no visibility.

To solve the cold-start problem, we built a content‑based recommender system to predict photo popularity from the image itself. The pipeline: for User A, find similar photos to past likes, then filter with the popularity classifier to reduce low‑quality recommendations.

Once the classification model was optimized, we integrated it into the production pipeline. Online A/B tests, however, were mixed: likes increased, but session length dropped, which was a negative signal about engagement. For some reason, our recommender system was causing a disturbing experience; users no longer wanted to continue scrolling through our website.

After many iterations, we eventually found a better way to incorporate the popularity signal. As a result, the recommender system became far more complicated. This was a long road, paved with multiple steps involving not only the difference between offline evaluation and online evaluation, but also monitoring a set of business metrics instead of just one primary business metric.

Once the model is integrated into production, it becomes part of a much larger system, encompassing the entire solution. Recommenders often merge multiple models whose outputs are not orthogonal, so that a strong offline model may have diminishing impact in the merged system. The lesson here is: do not over‑optimize offline. Push to A/B testing quickly to validate alignment with business goals.

Pitfall 5: Non‑Technical Obstacles

The last pitfall, often overlooked, is related to unseen non-technical obstacles. Going back to the survey we examined earlier, when people were asked about the main obstacles they faced when deploying models, the two most common answers were not related to technology at all: a lack of stakeholder support and inadequate active planning. 
According to Rexer Analytics’ survey, these are the top ten answers to the question: "What are the main impediments to models being deployed at your organization and/or at your client organizations?"

  • Decision makers unwilling to approve the change to existing operations
  • Lack of sufficient, proactive planning
  • Lack of understanding of proper way to execute deployment
  • Problems with the availability of the data required for scoring the model
  • No assigned person to steward deployment
  • Staff unwilling or unable to work with model output effectively
  • Technical hurdles in calculating scores or implementing / integrating the model or its scores into existing systems
  • Privacy/legal issue
  • Model performance not considered strong enough by decision makers
  • Unable to provide the degree of model transparency decision makers require

Managing stakeholders is tricky because many decision‑makers do not have AI backgrounds; they may be swayed by headlines or prior software experience, underestimating ML risk and uncertainty. This is where AI experts actually play a role. It is not just about building the model, it is also about making sure your stakeholder understands what the right expectations are for your AI projects.

That means our job includes education. Stakeholders need to understand how ML learns (and why data pipelines matter), why ML projects are inherently uncertain, model limitations (reputation and safety risks), and the realistic costs of building and deploying. 

There is also the matter of managing and planning ML projects. Three guiding principles stand out:

  • Define a clear MVP with a simple optimization goal. Starting simple usually outperforms starting complex.
  • Build end‑to‑end early, enabling A/B testing and getting production feedback as soon as possible.
  • Iterate rapidly based on feedback, revisiting objectives and expanding data as needed.

(Source: author)

A strategy I have seen work is to separate a project incubator (for early, high‑risk bets) from the product line (for scaling proven solutions). This enables innovation while managing risk.
The key takeaway here is that managing machine learning projects is different from managing a traditional software engineering project. We need to adapt to these major challenges to ensure your team can receive support from a non-technical perspective.

Conclusion

While there is no way to guarantee we will avoid all mistakes, there are some principles or best practices we might consider in reality. We want to choose a project that is feasible, desirable, and profitable. We want to be data-centric. Furthermore, we want to encourage early collaboration and active management of cross-functional teams. We want to build an end-to-end solution quickly for testing purposes. We want to adapt your project management plan based on the nature of machine learning projects.

This is just a partial list, among many other reasons, why a machine learning project may fail. However, it can serve as a good starting point for us to kickstart the discussion. I want to leave you with a favorite quote of mine, from Charlie Munger: "Learn everything you possibly can from your own personal experience, minimizing what you learn vicariously from the good and bad experiences of others, living and dead".

References

\({}^{[1]}\) "Rexer Analytics - Data Science Survey". 2023. Rexeranalytics.com. 2023.
\({}^{[2]}\) Kapoor, Sayash, and Arvind Narayanan. 2023. "Leakage and the Reproducibility Crisis in Machine-Learning-Based Science". Patterns 4 (9): 100804–4.
\({}^{[3]}\) Sculley, D, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison. 2015. "Hidden Technical Debt in Machine Learning Systems".
\({}^{[4]}\) Yi, Jeonghee, Ye Chen, Jie Li, Swaraj Sett, and Tak W. Yan. 2013. "Predictive Model Performance". Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August, 1294–1302.

About the Author

Rate this Article

Adoption
Style

BT