5 Ways to Fail at Building Machine Learning Applications

and your guide to avoid them while you are at it!

Akhil Anurag
Analytics Vidhya
6 min readMar 4, 2024

--

Photo by Zac Durant on Unsplash

Some recent industry report suggest ML projects failure rate is still over 75%. Gartner predicted in 2019 that 85% of all AI implementation will fail by 2022. More than 50% of ML solution still could not make it from the pilot to production stage. This situation leave organizations seeking increasing investment in ML/AI in a tight spot as justifying the Return on Investment (ROI) becomes increasingly difficult.

Today, be it the psychological want of an organization driven by FOMO(fear of missing out) or genuine need to use AI/ML as a differentiator and a value driver for their business, one thing is certain, you can’t let ML projects(and thus, ML Products) fail at scale.

Where does the failure happen across the ML value stream?

There are multiple stages in building a ML based application and none of these stages is certain of success. While trying not to get it too much details and keeping it high level, these stages can be split into 4 phases as depicted in the figure below.

You can read my article covering the ML value stream in more detail here

ML project can fail at any of these stages

Though a ML project can fail across any of these stages, the cost of failure can be drastically different. Think of a ML use case which was not defined properly but was still build, deployed and operationalized. In all likelihood, the customer will flag the issues leading to loss of credibility and reputation. The wastage of effort, time & cost will only be the secondary impact.

So, why does it fail at those stages and what can be done to avoid these failures?

Let us start identifying the reasons for failures, once we discover the reasons, hopefully, avoiding them shall be easy.

1.Lack of User Research — In my experience, this is the the number #1 reason for ML failure. This happen at the Define stage and is generally associated with one or more factors missing from the table below:

Why we need User Research in ML Product Development

2. Lack of Governance — To get a ML solution live in an application is a multi team collaboration and hence governance & oversight become critical to monitor the quality of implementation and adherence to success criteria.

Product development is a relay race, not a hurdle race.

ML products is either a collective success or a collective failure story

Absence of governance thus result into problem manifesting itself across the entire value stream(majority on the Design and Build phase), these often go unnoticed until the issues are large enough to sink the entire boat. Let us look at some of the common problem which get created by lack of governance:

Why Governance is Required in a ML Product Development

3. Lack of Testing Strategy — How do you know your ML product is going to pass the acceptance criteria and meet customer expectations, are you going to wait till the end? I hope not. That’s calling for doomsday.

This is why your ML testing strategy should be well in place, covering all the quintessential elements at all the important milestones through the product journey.

Your testing strategy should include the below tests as a minimum.

Example of Continous Tests Needed in ML Product Development

One important thing to bear in mind is all these testing and feedback belong to different stages of product development and hence have varying cycles(amount of time it takes to implement and get the test results). These are represented as concentric circles below, where the shorter loop happens at engineers desk building development tests and the longer loop happen with model performance or user feedback.

Inspired by Hendrickson, Elisabeth representation for traditional software development

4.Too Complex — In a ML driven product, complexity may mean different things to different stakeholders.

For Data Scientists — Complexity is around choice of algorithm

For ML Engineers — Complexity revolves around design/architecture choices which can either render the ML system un-usable and non-maintainable later

For Product/Users/Customer — Complexity is around ease to understand ML decisions and build trust on the results that they are correct, fair and non -biased

Complexity of ML today has become a selling point for these products, but wary of the consequence it might have for organization, shareholders and most importantly their customers if it goes south

These choices are not straightforward and you might have to find the best trade off between performance and explainability.

Trade off between performance and explainability

Goal — Make the ML model more interpretable and explainable by answering these 3 questions.

  1. How does the ML model work?
  2. How does it make it’s decision?
  3. How can we trust the model?

As we head in the direction of selecting more complex model, it is important to consider-

  • Is explainability more important for my use case compared to performance? — This can be domain and industry specific. Example -Banking, Insurance, Healthcare are some sectors where model needs to be more explainable and fair. Not saying, they don’t have to be in other domains but the cost sometimes is lesser elsewhere. A price elasticity model which determines the rate of interest for a customer needs to be more explainable and unbiased compared to a model which is used to allocate personalized discount coupons.
  • Are we prioritizing model explainability work for data scientists? — Even with complex models, we can reach a level of explainability by applying techniques available today (SHAP/LIME/Anchors/Counterfactuals etc.). They don’t get seen because teams never prioritize them.
  • Do we have a model risk governance in place? — As organization keep on adding ML products at scale, it becomes paramount to have a model risk governance to assess if models are specified correctly, are operating correctly and with reasoning.

5.Operational Issues — If I were to decide which step is most critical in the whole flow of a ML product development, from the inception to the release where the risk of failure is maximum, it has to be when the ML solution is to be deployed.

More than ~50% of all ML models are never released.

Also, failure happen post deployment too because of operations issues. These are of worst nature as by this time the ML solution starts to be used by your customer resulting in a bad experience and lose of trust.

Here is a quick summary of reasons why we either fail to deploy or operationalize our ML solutions.

Issues at deployment and post deployment that lead to ML product failure

If deploying and operationalizing a ML product is such a pain then what can be done differently to mitigate this risk of failure.

This figure below tries to map possible solutions to each identified problem:

Possible improvements to reduce deployment and operational failures

Conclusion

After going through these top 5 reasons of ML product failure you might develop a sense that its just too much to take care. And you are right!!

Successfully building a ML powered product is no easy job, it requires skilled teams, a deep customer obsession and effective collaboration among them.

Hopefully this guide can be used as a map to take you to your north star while you are building your next ML product by bringing some order to the chaos.

--

--

Akhil Anurag
Analytics Vidhya

Building Scalable AI/ML Products|Product Manager|Data Scientist|Writes for Data Science Practitioners.