Dear AI startups: Your ML models are dying quietly

Published in

Castodia

7 min readMay 8, 2019

Despite all the recent advancements in AI, the industry is still in its early stages. There are many misconceptions about the field and the set of best practices for machine learning (ML) deployment have not been defined yet. In this article, we’ll show you a scenario where a small change to the front-facing interface could reduce the accuracy of an ML model powering the application.

Let’s say you’re have an e-commerce website. You have user logs you’ve been collecting over the last couple years and you have now decided to start using them to optimize your shoppers’ experience. You are interested in building a custom recommendation engine using some of the latest AI advancements.

Building and deploying your model. Check!

After some extensive research, you decide to hire a data scientist, Sara, and pair her with your back-end developer, Jose.

Sara spends about 2 months cleaning the data from your logs. Next, she spends another 3 months developing and testing different types of ML models. After some time, your data scientist comes up with an amazing recommendation system that employs the latest state-of-the-art deep learning (DL) algorithm: Semi-Supervised Classification with Graph Convolutional Networks. This algorithm achieves 97% accuracy on historical dataset.

Now it’s time to integrate this model into the website! Sara says that it might be a bit challenging since she has a limited experience with ML deployment, but Jose is confident he can integrate it into the website as long as the model could be deployed as an endpoint. Because the model was built with Pytorch, Sara decides to use AWS SageMaker.

She then spends the next 2 months trying to deploy the model. That seems a bit long, but there isn’t much you can do. Deployment of ML models is a rather complicated process and most data scientists are not trained for that. There is a steep learning curve involved. Luckily, you hired a great data scientist. Sara manages to deploy the model quickly and Jose got his endpoint. In no time, the ML solution is integrated into the site. You can now focus on monitoring the user experience using Google Analytics or other analytic solutions.

It’s hard to concretely measure if the ML solution improved the user experience since you were not tracking the performance of recommendations beforehand, but it seems to be working and users click on the recommended items. Success! 9 months after you decided to become a data driven company, you have a functioning AI solution! Good job!

Website redesign: Check!

Suddenly, you front-end developer, John, has to quit because he is moving back home. So you hire a new front-end developer, May. She comes from a design background and just completed a front-end developer bootcamp. She uses React and has an impressive GitHub portfolio. John quickly syncs her in and off he goes.

May is full of new ideas and proposes some changes to the website. First, she proposes to customize the website for different markets. Next, she wants to update the input forms to improve user experience. For example, there is no need to capture “first name” and “last name” when signing up, it’s 2019! You approve changes but instruct her to check with Sara to avoid any problems with your ML solution. After a couple of months, the new website is ready, launched, and is looking great!

But wait…

A couple of months later, while you’re analyzing the performance of your website, you see no improvements. You updated the look, you have sophisticated recommendation engine, you even run new marketing campaign, why don’t you see improvements? After analyzing your website logs, you learn that customers are not buying more than one product and bounce rate is high. Furthermore, the new markets you expanded to have a particularly high bounce rate.

What’s wrong?

You spend the next week trying to figure out if the marketing campaign was a failure. No, the reports indicate that people actually followed up and visited the website after seeing your ads.

You then spend another week trying to see if new design changes are at fault. No, visitors seem to love the design.

You check if the product recommendations generated are poor. But how do you even test this? You ask Sara to check the performance of recommendations. She writes a script to measure performance and a week later comes back with astonishing results: the accuracy of the recommendations dropped to 40%! How did this happen?

How did we get here?

Sara wrote a proper model using the historical log files you provided. Those log files were generated by John, who didn’t have to worry about the structure of the data, as long they are logged and can be retrieved later. Data logging in full-stack development has mainly been used to track errors, not to make informed decisions. That was done using Google Analytics. This is why it took Sara so long to clean the data at the beginning.

However, when May updated the website, she made some changes that affected the data preprocessing pipeline down the road. For example, she changed the input field name from “email” to “user_email”. She also merged 2 input fields “first name” and “last name” to “your name”, removing an input field that the model relied on. Finally, she also introduced a new measuring unit, mixing “kg” and “lbs” from different countries, so the model now get 2.2 for weight instead of expected 1.0

No one’s at fault here. These changes happen all the time. They are very common and the best practices have not been defined yet. The idea that a front-end developer has to be aware of the data and the consequences of changing the way it’s captured is in itself a new idea. And the fact that data scientists should provide front-end developers with unit tests is not common at all.

Let’s dig a bit deeper into what happens when the format of the data input is changed.

If the data preprocessing pipeline did not address the changes at the front-end, this altered data is then saved in the database risking to spoil it. Next, this altered data is fed into the model that produces wrong recommendations. These wrong recommendations are then returned to the end-user. Because these recommendations are “garbage”, the user gets a bad experience. Not only do they not click on the recommendations, but they also get upset that you recommended them an utterly oversized dress when they were just buying a pair of socks.

DL models die quietly!

Be cautious, DL models die quietly. There is no error message, no 404 page, no notification. It might take months for you to detect that the model is quietly failing on you. Why is that? There are a few reasons:

First, in many cases, it is hard to predict if the model is performing as expected. How do you confirm it is 97% accurate in the real world? It is possible, but not trivial.
Second, there is no immediate feedback loop. Even if you have a proper measurements in place, it might take months before the error is traced back to the DL model.
Finally, AI models are usually far better than humans at their specific task. Often, when developing DL models, we are not benchmarking them against humans, but against other DL models (AlphaGo Zero for example). So, if your ML model is much better than humans at its job, who can detect when it makes a mistake?

How do I prevent this?

You have to be very careful when developing your DL model. There are few actions you can take to avoid this:

Hire data engineer. If you’re thinking about becoming a data driven company, make sure your data game is on point. You need a person in your team who “owns” the data pipeline and knows its corner cases
Make sure the entire team is aware of the data, its properties, purpose, and so on. Specifically, everyone who’s in the path of data flow must have a clear communication.
Monitor your data, its key points before and after ML integration. You need benchmarks. Monitor the data during updates and use of different versions.
Watch out for outliers. Keep the averages of the data and watch out for deviations.
Have unit tests! Yes, this is something uncommon, but your front-end developer must have unit tests for input data.
Deploy models partially, roll-out both simultaneously and split the traffic while testing a new version

These are just some of the techniques we employ internally at Sanau and offer to our clients. Becoming a data driven company is not as simple as it might seem, but the benefits are exponential. It is the future of business and we encourage you to make a transition. If you have any questions, feel free to reach out to us at Sanau: hello@sanau.co

TL;DR:

Changes to the front-end might “quietly” drop the accuracy of your ML model if these changes affect preprocessing pipeline. For example, if you switch from kg to lbs in the front end, but did not address it at the preprocessing step, your model suddenly gets a value that 220% higher that it was trained for. These errors propagate further spoiling your database and damaging user experience. There is no easy way to catch such errors, since it’s hard to measure true accuracy in production, there is not immediate feedback and no experts than can easily catch machine errors. To avoid this, your organization must be aware of its data strategy, hire the right people, and continuously monitor the data.

Note:

ML model are rarely designed to make recommendations based on user’s name or email. We used them for illustrative purposes only.