The Unlearning Machines

Published in

The Hands-on Advisors

5 min readFeb 17, 2020

A firmly rooted belief within applied machine learning is that machine learning is in fact continuous learning. That our models improve over time. This is logical in itself since new data is usually generated all the time and thus the amount of that usable data grows as time passes. Hence improvement.

More often than not, the opposite is true. The problem lies in our method of applying machine learning. We apply it to predict if something is happening, not trying to actively do something about it.

Take predictive maintenance. The field has been studied for quite some time and almost every big industrial player is approaching it with machine learning in one way or the other. In predictive maintenance however, we are not doing any maintenance with our machine learning systems — just trying to predict if that should be done or not. The sophisticated models even give out some hints on what may be wrong. However, applying that information is left to experts.

Therein lies a problem in (supervised) machine learning. We are feeding in data about a system the machine knows nothing about and giving it a task to predict if it something might happen, but not having anything to do with the actual remedy. Let’s take two examples from this. Say you are running a forklift company sporting its first predictive maintenance model in production to prevent those lifts breaking down at customer sites. The situation is as follows:

Forklift A has been running for 3000 hours and is experiencing lowered pressures in its hydraulics.
Forklift B has been running for 6000 hours and is experiencing lowered pressures in its hydraulics.
You know from previous experience (in machine learning lingo we call this data but we should call it randomly-gathered-stuff-from-random-contexts to avoid problems) that when forklifts experience lowered pressures, especially with large uptimes, you are going to face trouble. So your machine learning system predicts that B will fail with 90% certainty and A with 80% certainty.

For both machines, a request to check if they are okay or not is sent to customer sites for a manager on duty. For forklift A, some routine checks are made, oil is added and the thing is considered to be fine. For forklift B, an extremely experienced engineer mends the forklift with care and it keeps running smoothly for a long time. Forklift A fails horribly the next morning, breaking it and causing downtime to operations. Back at headquarters, a data scientist checks that the newly implemented machine learning model has updated itself with new data and gets applause from fellow engineers. We have predictive maintenance working in production.

What’s hidden is that when the model was first actually used and then trained again, we’ve just unlearned quite a lot from what we’ve learned about the problem when studying it carefully before going live. Our data scientist might have spent some weeks tuning and tweaking the model before it went to production until a sufficient accuracy was achieved. While this is all good and fine, we forgot to account for what happens when predictions are actually being acted upon. Now that it’s live, experts are acting on the predictive information but the model usually knows next to nothing about this.

Think of most machine learning models as isolated boxes which include information that varies with each iteration.

Since the model is trained, say, daily, it always learns things anew. Before we went live, we learned what might trigger failures in our fleet of forklifts. Now, as we predicted failure and experts acted on some of those predictions our training data has effectively changed drastically. Some of our machinery became better, some maybe even worse. Maybe the top 10% of most probable machine failures have been acted on, some better than others, and that 10% will now perform differently and end up back in our training data (randomly-gathered-stuff-from-random-contexts), which will cause our model to drift towards irrelevant noise from the original signal.

This is because we are working on predicting if something will happen, not predicting what to do about it and receiving feedback from that action.

Now, some might think that “we should use time-to-event models to mitigate this” or “perform bespoke causality analysis instead of simple machine learning”. While that’s true, we are still not shifting from predicting if something will happen (a state) to predicting the effectiveness of an actual treatment (taking an action and learning from that).

It gets worse.

Take churn management in a telco. You work in similar fashion to our previous data scientist and craft a model that predicts churn. You go live with it and the company starts acting on that information. The top 10% of possible churners are being contacted and some are won back (maybe customers who are actually nearly impossible to win back, but you used a call centre with highly skilled people to win them back) and some are not (maybe customers who would have been easy to win back with an email, but they happen to hate call centres and you called them). Again we’ve unlearned our original signal. The next iteration of that model will predict something entirely different and it only gets worse at it as time goes on.

This is not to propose that this type of machine learning does not work at all. There are situations where models do improve over time just by fitting them as new data is gathered. For example, most image recognition and NLP tasks work like this and even some recommendation systems can be fitted daily without dramatic issues. But for even those trivial tasks, we should be really careful with what we are trying to learn and from what data. The sad fact is that most of the in-production machine learning models I’ve seen live are actually unlearning machines. They are not getting better over time but worse.

To remedy this, we should be paying closer attention to predicting treatments and learning from those instead of naively predicting states. Better yet, we should tie our cost structure to those treatments and optimize finances while we are at it.

In most cases that type of approach would require a lot of silo-breaking and quite a bit of agility from the organization itself. But that’s another story. In the meantime, let’s be smart about what we are predicting in the first place and make machines learn something, even over time.

The Unlearning Machines

Written by Jarno Kartela