Why do Machine Learning Projects Fail?

Here’s a common scenario:

Boss: I have an idea. It’s going to change our company!


Boss: We’re going to implement AI! Everywhere! In the whole company!


This same scenario is playing itself out across organizations all over the world. It’s a symptom of the fact that machine learning is a seemingly permanent fixture in Gartner’s Hype Cycle for Emerging Technologies. It’s all over the place. Every business has to have it and they have to have it now.

This is great and all, but the reality is that most organizations are poorly equipped to implement machine learning solutions themselves. In fact, most organizations don’t even know the prerequisite questions to ask ahead of attempting to solve a problem with machine learning.

Before looking to solve your organization’s first problem with machine learning, there is one key question to consider:

Why do machine learning projects typically fail?

At CompassRed, we know that there are a lot of answers to this question. We’re going to concentrate on 4 of them. Most importantly, we’ll discuss how to successfully solve each of these. In no particular order, they are:

  1. No one understands what machine learning is (and is not).
  2. We fail to begin with the end in mind.
  3. A failure to trust the process.
  4. There’s no measurement for success.

1. No one understands what machine learning is (and is not).

Solution: Understand the types of questions you can answer.

Machine learning has built a bit (read: a lot) of mysticism around it. People who are less familiar with what machine learning is tend to believe that it can solve anything. The reality is that this is not the case. One of the best ways to make sure your machine learning project does not fail is to simply understand what it is and is not.

At the highest level machine learning is very simple. Machine learning labels stuff. More specifically, machine learning labels stuff probabilistically. There are two really broad ways that this occurs. In the first way, we know ahead of time what some of our labels are already. We call this supervised learning. In the second way, we don’t know any of our labels yet. We call this unsupervised learning. Each of these methods — supervised and unsupervised learning — are good at asking different types of questions.

Supervised learning is great at answering questions like:

  • Will my customer churn?
  • Will a visitor to my website make a purchase?
  • How likely is the photocopier to jam in the next 10 minutes?
  • How long will the customer stay on my website?
  • What is the likelihood of the customer to click on this link?
  • Is this purchase fraudulent?

Unsupervised learning is great at answering questions like:

  • What are my customer segments?
  • Which customers interact with my website in the same way?
  • Which photocopiers fail the same way?
  • What papers are likely written by the same person?
  • Which purchases are similar?

You’ll notice that there is one question stem amongst all of these questions that is not present: why. Machine learning problems cannot solve “why” questions. An analyst may be able to get to build a case for an answer to a specific “why” question, but the algorithm itself will not be able to provide that answer.

2. We fail to begin with the end in mind.

Solution: Focus on outputs.

Stephen Covey wrote The Seven Habits of Highly Effective People with the average person in mind. As it turns out, many of his habits could be applied directly to companies trying to implement machine learning models as well.

Often when implementing a machine learning solution for the first time, people and companies tend to get caught up in the technology being used. After all, machine learning is a highly technical topic, so it shouldn’t come as a surprise that the people looking to implement the solution care deeply about the technical portions of the implementation.

The reality, though, is that the machine learning problems that we select to solve should be deeply driven by the value that the solution could create for the organization. It’s important to grapple with that value outside of a technical context. In particular, it’s worth spending time focusing on the end user of the solution and the actionability of the output.

Selecting the right problem with the right output will help drive significantly more change in an organization than selecting the perfect model to solve that problem.¹ One way to start thinking about this is to use something like the machine learning canvas. This canvas helps you to think through the entire problem — from the value proposition of the task to how the solution will be evaluated and modeled.

3. A failure to trust the process.

Solution: Iterate. Then iterate again.

Machine learning problems, and data science problems more broadly, are solved by following a very specific process. There are a few different ways to think about this. A very popular process is referred to as CRISP-DM (Cross-Industry Standard Process for Data Mining). A slightly simplified version of this is the data exploration model outlined by Hadley Wickham in his book R for Data Science.

The data exploration process

Organizations that are excellent at implementing solutions to machine learning problems have an excellent understanding of this process and how to manage. The typical rule of thumb for data science projects is that 80% of the time on a project is dedicated to data cleaning — the importing, tidying, transforming, and visualizing steps — and only 20% of the time on a project is dedicated to modeling and communication. Prior to any of this happening, though, a significant portion of time will be dedicated to understanding the business needs of the problem.

For a first iteration through the data science process, though, a significantly larger proportion of time should be dedicated to building the business understanding around the problem. This will help to ensure that you are truly solving the right problem and building a solution that will be meaningful for your organization.

4. There’s no measurement for success.

Solution: Build one. Or two. Or three.

Machine learning is inherently an exercise in optimization. In order to optimize anything, there has to be a measurement of what is being optimized. For a machine learning project, there must be at least two measurements for success — one from a technical perspective and one from a business perspective. The technical measurement will help you to understand if a model is worthwhile to be deployed in the first place. The business measurement will help you to understand if the implementation of the model is helping to actually solve the problem.

From a technical perspective, it is important to consider all of the different metrics available to you. Accuracy is rarely the best measurement to utilize, even though it is typically the easiest measurement to understand. In particular, consider the case of an imbalanced dataset.

Here’s a hypothetical machine learning problem — can we classify if an incoming email is spam or not? The vast majority (hopefully for your sake) of emails that you receive are not spam, with only a small minority of emails that we should classify as spam. If we were to simply classify every single incoming email as not being spam, then our accuracy would be extremely high — perhaps 99% or above. However, a model that only outputs a single answer — every email being legitimate — would not be a particularly effective model. Instead, we should consult a confusion matrix and select an appropriate metric based upon what we care the most about.

Based upon the confusion matrix, every one of our predictions can be classified into one of four buckets:

True Positives: I predicted that an email would be spam and it actually was.

False Positives: I predicted that an email would be spam and it was not.

False Negatives: I predicted that an email would not be spam and it was.

True Negatives: I predicted that an email would not be spam and it was not.

There are quite a few metrics that we can derive by considering these four buckets— precision, recall, F-measure, specificity, and sensitivity to name some. Which of these you choose should match the context of the business question that you are answering. For example, in our spam email classification example, you might care most about identifying true positives (emails that we predicted were spam and actually were). In this case, sensitivity would be a great metric to utilize².

From a business perspective, we should be measuring something that will actually move the needle forward for our company. There are a lot of different metrics like this to consider, all of which will be dependent upon your own context. They could include questions such as:

After implementing our model have we:

  • increased revenue?
  • increased the number of customers?
  • decreased customer churn?
  • reduced the amount of spam our email users receive?

Understanding the business metric that we care most about will help us to evaluate if a model is actually making a difference for your organization. This is likely the most important metric for the project overall. Especially if this is the first (or one of the first) machine learning projects that your organization is undertaking, it will help to demonstrate the value of machine learning for future projects.

¹ A great example of this is from a paper put together by Randy Olson and a team from the Institute for Biomedical Informatics out of the University of Pennsylvania. The team sought to understand how “13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems” performed, allowing them to “quantify the effect of model selection and algorithm tuning for reach algorithm and dataset”. The paper outlines that 5 algorithms are likely to provide coverage for the 165 classification problems (GradientBoostingClassifier, RandomForestClassifier, SVC, ExtraTreesClassifier, and LogisticRegression). In addition, they demonstrate that selecting the correct algorithm and the correct hyperparameters can typically improve accuracy by approximately 20% for a given classification problem. Clearly a 20% improvement is vitally important, however it doesn’t matter how great of an improvement you can get if the problem you’re solving isn’t the right one in the first place. [scroll back]

² Another key item to consider is a metric called “Area Under the Curve”. This relates specificity to sensitivity. It’s essentially the standard-bearer for classification modeling. [scroll back]