Set and Don’t Forget

How machines and humans working together improves the outcomes of deployed models.

Published in

Slalom Data & AI

6 min readNov 22, 2022

It’s no longer rare to see machine learning (ML) models being used to support a variety of business decisions, from whether a medical claim should be paid or sent to the fraud investigation team, to what route will be more efficient for a delivery truck or what discount should be offered to a distributor.

But while these ML-based solutions can be powerful tools to improve core operations, a significant number of organizations still consider the job done when the model is installed and running, unaware of the risks that this “set and forget” approach creates for the business.

Consistent monitoring is key

In practice, without robust post-deployment monitoring, the gains extracted from predictive models are likely to be short-lived.

To understand why that happens, imagine a company using an ML model to recommend personalized retention offers to individual customers. The model goes through a rigorous test of accuracy that shows an increase from 75% to 90% in the customer retention rate. The new model is deployed, the promised results are achieved, and everybody is happy.

Yet, there is no guarantee that the model will preserve its performance level as time passes. It is possible that within a few months of deployment the company will see a steep decline in the number of customers who accept its personalized retention offers.

This risk exists because the quality of algorithmic decisions can degrade over time for various reasons. Perhaps a competitor is now having a promotion that makes previously successful renewal terms less appealing. Or post-pandemic patterns of behavior changed what customers value.

Because of the dynamic environment in which ML-based decisions are made, organizations interested in protecting their investments need mechanisms in place to monitor model inputs, outputs and outcomes.

Original image from Google Cloud adapted from Hidden Technical Debt in Machine Learning Systems.

Automated monitoring is not a panacea

Nowadays there is no shortage of tools to simplify model monitoring, including Amazon SageMaker Model Monitor:

With Model Monitor, you can set alerts that notify you when there are deviations in the model quality. Early and proactive detection of these deviations enables you to take corrective actions, such as retraining models, auditing upstream systems, or fixing quality issues without having to monitor models manually or build additional tooling.

It’s true that many ML tasks no longer require active human involvement (including many aspects of data cleaning and formatting, algorithm training, tuning, monitoring). Still, while automated tools can make a big difference in the quality and scalability of model monitoring routines, they can’t — at least for now — entirely replace human judgement.

We still need the human touch

To understand why human intervention can be a critical piece of an effective ML monitoring process, let’s go back to our example of a retention offer recommendation system.

Imagine that the company decided to protect its investments in ML by implementing a powerful monitoring tool that not only tracks and evaluates model performance, but also investigates and debugs issues and trigger processes to improve accuracy in production.

Over time, the monitoring logs start to show a decline in the retention rate for customers receiving the ML-based retention offers: from 90% to 85%, then 82%. Once the indicator reaches 80%, an alert automatically triggers a retraining process. The ten percentage points lost are recovered, and everyone is pleased. But later the business realizes that the profit generated with the original model offering 80% retention rate was greater than the profit of the refactored model with 90% retention. The fix improved model accuracy, but at the expense of profitability.

What happened here?

Something that we often see in real-life application of machine learning: concept drift.

Concept drift means that the statistical properties of some relevant variable change over time. We may see drift happen in the target variable (e.g., a variation in which offer is predicted to retain more customers of a certain profile) but also in other relevant variables, such as the contribution to profitability of each customer segment.

To see how this might unfold, let’s use a simplified example where the business has ten customers approaching the end of their contract each month.

When the model was first developed, the company only dealt with high-margin customers. To achieve the best business results, the model only had to maximize its overall accuracy: the more customers are retained, the higher the business profit.

In month one, the real-life conditions in which the model operates reflect the training data: all customers due to renew are high-margin, and the model delivers the expected performance in terms of retention and profitability.

In month nine, the situation has changed. Now only five of the customers remain in the high-margin segment (depicted in green in the graphic below), while the other five belong to a low-margin segment (depicted in red). The desired outcome is no longer only to retain as many customers as possible, but rather, maximize customer retention while minimizing the loss of customers in the high-margin segment.

In month 9, which is better: a model that makes two mistakes in ten, but retains all high-margin customers, or a model that only makes one mistake but loses a high-margin customer and reduces overall profitability?

A fact in ML life is that optimizing for one goal (maximize customer retention) may force us to do worse than we could have on the other (minimize the loss of customers in the high-margin segment).

In a dynamic business environment, one cannot expect the algorithm itself to discover and correct situations like that. There is a central role that people still have to play in defining what constitutes a “good” solution when this kind of trade-off needs to be addressed.

This is why, in model monitoring, the best outcomes are the result of associating the complementary strengths of humans and machines.

Combining the best of both worlds

When algorithms are used to support a task like approving loans, pricing insurance, or selecting the best retention offer, machines can bring to the table the speed, scalability and quantitative capabilities to analyze terabytes of data and proactively detect model quality issues before they become an emergency. On the other hand, people are still better at combining facts together and reasoning on new cases with much fewer data points than what machines require to learn.

An example by Ben Dickson illustrates this well. If step out of our home and notice that the street is wet, the sidewalk is dry, it’s sunny, and there is a road wash tanker parked down the street, we can easily make the connection “the road is wet because the tanker washed it, not because it just rained.” We can go from our observations to the right conclusion even if we haven’t seen a tanker washing a street before. An ML model would be incapable of making the same connection unless it has been trained using multiple relevant examples involving both rain and street washing.

Of course, this doesn’t diminish the value of automated model monitoring tools. Such tools can be an invaluable source of information about how well (or poorly) your algorithmic decision-making is doing, and play an important role mitigating risks by raising early alerts when anything seems out of place. For example, notifying a business person when the monitoring tool detects an unusual change that requires investigation (e.g., all customers are suddenly being recommended retention offer A, when in the past only one third of customers typically received this recommendation).

Yet, while automated monitoring tools play a key role in ensuring a viable afterlife for any ML-based model, in most real-life scenarios human reasoning is still essential. People, not machines, are the agents more capable of looking at the downstream and upstream effects of the model decisions in order to understand and respond effectively to the underlying changes in market trends, consumer preferences, or competitive threats.

In companies achieving end-to-end success from their deployed ML models, it’s often the combination of human and machine capabilities that we’re seeing give rise to the greatest success.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.