Model monitoring

By Dana Tokmurzina

Published in

ABN AMRO Developer Blog

14 min readJun 10, 2024

Titanic. Image source: https://www.pigeonforgetncabins.com/a-hands-on-experience-at-the-titanic/

Do you know what connects Titanic and data science? If you work in the data field, you may immediately think of the famous Titanic dataset that aspiring data scientists often use as their first project. However, I want to draw another similarity between the two.

Titanic, the most innovative passenger ship of its time, tragically sunk when faced with an unexcepted obstacle. This example can serve as an analogy for data science models. While they may be incredibly accurate at the time of their creation, what truly matters for their success is how they perform in real world scenarios.

In this article, I want to share a solution that can assist deployed data science model in avoiding Titanic’s fate. Based on an example business case, we will go step by step through implementing model monitoring system for our data science project in the bank.

Why model monitoring?

Before we go deeper, let’s review the process of creating a data science model. The typical workflow involves gathering requirements, collecting data, developing a model, and facilitating its deployment. However, deploying a model does not mark the end of the process. There may be various issues that arise post-deployment, which can prevent deployed machine learning (ML) models from delivering the expected business value. To illustrate this, consider an example where a loan approval model suddenly starts rejecting every customer request. This can result in many negative outcomes: customer dissatisfaction, potential monetary loss, and a negative NPS score. Hence, monitoring a model and proactively detecting issues to deploy updates early is crucial!

What are ML performance metrics?

Given my interest in this subject, I came across several resources, but the one that I found most insightful and comprehensive read on post deployment monitoring was Chip Huyen’s book, “Designing machine learning systems”. Huyen places emphasis on the significance of post deployment monitoring and categorizes related issues into two primary groups: operational metrics and machine learning (ML) performance metrics. Let’s explore these categories.

Operational metrics could be represented for example by failure in loading predictions into a table, the inability of a server or cluster to start, or a breakdown in the data pipeline. They are usually easier to detect and may be encountered in other software engineering projects.

Machine learning performance metrics are issues related to a model’s performance degradation over time. These metrics are dependent on both data and model that have been built. In our case, as we work in the bank, our data consists of dynamic customer behavior features, changing products and prices, including the impact of external factors like geopolitical situations, pandemics, economics, and legal regulations on these data.

As data and relationship between different features may change, our designed model may fail to perform at the expected level. This may happen because the model captures the relationships between input and output for a specific time in the past and cannot adapt to the constantly changing world (unless we explicitly program it to do so).

In this article we will be focusing on ML performance metrics.

How is model monitoring done?

Manually checking for all the changes in data and models is not a scalable solution. Moreover it also requires a solid understanding of data science and business background to accurately detect any issue; for instance, determining how much variation in data constitutes a change can be answered through statistical tests and domain expert knowledge.

To address these challenges, we need a tool that automatically captures such issues, provides a comprehensive overview of ML performance metrics, and alerts us if any action is needed.

Post deployment monitoring

In order to better understand how model monitoring works, it can be helpful to go through a practical example of the steps involved in the post-deployment phase of a machine learning project. For instance, let’s consider a scenario where a commercial team requests a prediction of customers who are likely to churn their mortgage product. The data science team would then run an exploratory analysis and, if the results are positive, develop a predictive model that aligns with the business requirements. The model must pass performance and robustness checks from a data science point of view before it can be put into production. The expected business value of this model is to predict in time which customers are more likely to churn.

To make sure that the model is robust enough to handle real-world scenarios, we have implemented a monitoring process (note that we may also encounter operational issues, but they are not within the scope of this article).

What are the steps involved in designing and implementing an effective post-deployment monitoring system for ML performance metrics?

Brainstorm and list all the requirements

First, data scientists and business experts involved in the project discuss and write down a list of requirements, that includes the crucial metrics about the data and model used. For instance, you might track metrics like recall and lift scores through different model runs.

Some metrics can be evaluated through unit tests. In case the input data does not change in size, it is possible to write unit tests that check the size of the incoming data. However, for variable-length data, monitoring the data size is more effective. Therefore, it is advised to divide metric check into two categories: unit tests for predefined schemas, rules, immediate feedback and ongoing monitoring for continuously evaluating variable metrics.

Some metrics may not be readily available at times. For instance, in loan approval use case, it may take years to confirm whether a loan has been successfully repaid. This situation makes it impossible to assess model predictions by merely comparing the actual outcomes with the predicted values, so traditional metrics like accuracy and recall are impractical to use. Instead, you might consider monitoring prediction drift, which refers to tracking the change in model predictions over time and ensuring it does not deviate much with historical values.

Specifically for our mortgage churn project, we differentiated the metrics into those that can be verified by unit tests and those that require continuous monitoring. Additionally, we categorized the metrics into those related to data and ones related to model itself.

Data issues

The dynamic nature of the world means that data distributions can change over time. For instance, after a marketing campaign, it is possible to get more users of certain demographics, and this may lead to change in input distribution over time, leading to what is known as data distribution drift. This drift typically comes in three main forms: concept drift, covariate shift, and label shift, which are the primary focus here.

Understanding drifts with mathematical notations

Let’s consider the data as X (input) and the output as Y (predictions).

In supervised case, our machine learning model aims to predict Y given X, denoted as P(Y|X). The training data comes from the joint distribution P(X, Y) which, thanks to Bayes’ theorem, breaks down into P(Y|X) * P(X) or P(X|Y) * P(Y).

Covariate drift occurs when P(X) changes while P(Y|X) remains constant.

Label drift is observed when P(Y) changes, while P(X|Y) is constant.

Concept drift happens when P(Y|X) changes, while P(X) remains constant.

Explaining drift types with examples

Covariate drift is a phenomenon where the distribution of input variables changes over time, while the conditional distribution of the target variable given the input remains constant (i.e., P(Y|X) does not change). This makes it difficult to detect the drift, as the output distribution appears to be consistent. For instance, let’s consider a scenario where data for training a model was collected by surveying individuals within multiple universities. As a result, the majority of respondents happen to be students aged 20–40.

However, if the model is intended to be used by a broader population (including those over 40), the skewed data may lead to inaccurate predictions due to covariate drift. To detect covariate shift, one can compare the input data distribution in train and test datasets. One solution to tackle this issue is using importance weighting to estimate the density ratio between real-world input data and training data. By reweighting the training data based on this ratio, we ensure that now data better represents the broader population. This allows training of a more accurate ML model. In deep learning, one of the popular techniques to adapt the model to a new input distribution is to use fine-tuning.

In target/label drift, the nature of the output distribution changes while the input distribution remains the same. For instance, if historical data shows that people aged 55+ are more interested in pension-related banners, but a bank app malfunction prevents clicks on these banners, the click rate P(Y) will be affected. However, it would still be true that most people who manage to click are 55+ (P(X age = 55 | Y click = 1)), assuming the app fails randomly across all ages. Label shift may still allow the model to be somewhat effective but could skew its performance metrics, such as accuracy, because the base rates of the target classes have changed. Similar to handling covariate shift, you can adjust the weights of the training samples based on how representative they are of the new target distribution.

Concept drift occurs when the relationship between the inputs and targets changes over time. This means that the patterns or associations the model learned during training P(Y|X) no longer hold in the same way, even though P(X) input is the same. The Netherlands provides a good example of how changes in the housing market can affect the probability of buying a house P(Y|X) this year, compared, for instance, to two years ago. Factors like increasing interest rates and prices, changes in market trends, and consumer behavior can alter the relationship between the input and output.

For example, due to rising prices, younger customers may prefer to stay with their parents for more extended periods before moving to their own homes. If the model relies on outdated associations, such as targeting younger demographics for mortgage campaigns, its predictions will become less accurate because the underlying concept has changed. Thus, it is crucial to update the model regularly to account for changes in market trends, consumer behavior, and other relevant factors that may impact P(Y|X).

In our mortgage churn project, we encountered changes in the housing market that affected the performance of our model. The generated predictions were not consistent with the actual churners. After retraining the model, we observed that new features are now significantly contributing to model predictions.

Note, it is crucial to validate such changes with domain knowledge experts to determine whether they are market-driven or due to data quality issues.

Solution design

Data-related checks

Earlier we have introduced possible data-related issues that may occur after deploying the model. To start with, it is good to establish basic data quality checks, such as verifying data schema consistency:

Data types;
Column names;
Percentage of missing values in the data;
Number of unique categories, label schema consistency.

To detect differences in distributions, a simple method is to compare their statistical properties such as :

Averages: mean, median;
outliers: max, min;
quantile or percentile distributions;
skewness, kurtosis, etc.

For example, if you are comparing the training data from the current year to that of the previous year, and you observe a variance in the mean values of some of the features, that can mean you have some changes in the distribution. To make sure that changes are statistically significant and not result of random fluctuation, you need to run a two-sample hypothesis test. There are several common statistical tests that can be used to compare distributions, and a list that is provided below. More detailed information on statistical tests can be found here.

To evaluate this type of statistical drifts it is possible to use:

Kolmogorov-Smirnov (KS) test (suitable for any distribution but works mostly for one-dimensional data, expensive to run and can produce many false positive alerts);
Kullback-Leibler divergence;
Wasserstein Distance (used with larger data >1000 observations, numerical data with more than 5 unique values);
Population Stability Index (PSI);
Chi-squared test, Jensen-Shannon divergence (for categorical and numerical with unique values < 5);
Proportion difference test for independent samples based on Z-score (for binary categorical data).

What to compare with statistical tests:

Train and test datasets (as in the case of covariate drift);
Datasets after feature transformations and processing, to make sure processing steps do not introduce any unintended bias or significant change in the data distribution;
Prediction distributions (as in case of concept drift or target drift);
Prediction distributions between different models;
Between different experimental conditions, time periods, different batches, various customer segments (age groups, geographic regions, customer segments), campaigns variants.

Model performance-related checks

As mentioned earlier, a developed model may be impacted by changes in input data or changes in the relationship between input and output. To ensure your model performs as intended, the resulting guidelines are advisable to follow:

It is recommended to monitor model performance metrics on a regular basis to identify any deviations from the established baseline. This involves assessing metrics like precision, recall, F1 score, and lift performance at different percentiles. For instance, in the mortgage churn project, we assessed lift performance at various percentiles and examined precision/recall performances due to highly imbalanced data. If performance metrics fall below the threshold, it sends an alert to the team.
Ensure that the model can generalize well on unseen data by comparing its performance between training and testing datasets. If the model performs significantly better on the training data than on the testing data, it is a clear indication of overfitting.
Monitor for data leakage, which occurs when information that was not available at the time of prediction is unintentionally included in the training data, resulting in a false performance boost. In this post, Eugene Yan describes a data leakage in a project aimed at estimating a patient’s hospital bill. One of the features used to train the model is estimated hospitalization days. Imagine that estimated days are now replaced with actual number of days from past data and this feature is still used for training purposes. That implies that future information is provided, leading to a much higher performance on the training data than on unseen data.
Analyze model outputs for bias by splitting the data into segments and examining the model’s performance across these segments to ensure fairness and compliance with ethical standards.
The model’s predictions can also impact your future data. For example, if the model only recommends investment products to customers, these products will appear to be continuously popular, while other products would be less in demand. This phenomenon is known as a degenerate feedback loop.
Checking SHAP (SHapley Additive exPlanations) values, that provide an explanation for each feature’s contribution to the model’s prediction for a particular instance. If we observe that a feature’s SHAP value has changed significantly over time, this may indicate that the feature’s importance has changed or that the feature has been influenced by external factors. This could be an indication that the model needs to be retrained or that the feature needs to be updated or removed.

Example of data drift monitoring dashboard (source )

Tools

As previously mentioned, manually reviewing all changes in data and models is not a scalable approach. There are various ways and tools to establish a monitoring system depending on the needs. When choosing a monitoring tool, it’s crucial to consider several key factors, such as cost, time, existing IT infrastructure, and legal requirements for industries like healthcare and banking. Based on these factors, you can decide whether to use a separate monitoring platform, leverage the built-in functionality of your current IT ecosystem, or develop a custom solution.

Currently several options available on the market designed to assist data scientists in monitoring and evaluating the performance of their models in post-production phase. The choice can be based on what existing platform or ecosystem of tools you are using in your team, for example AWS has already inbuilt monitoring capabilities like Amazon SageMaker Model Monitor or for Databricks users, Databricks Lakehouse monitoring. External monitoring tools range from just checking for data quality to full functioning MLOps platforms. Great overview of tools: here and here.

Unit testing

Unit testing guarantees the early detection of errors during the development phase, ensuring that everything works as intended and the code remains stable as the project expands. Unit tests are great for ensuring expected quality, for instance checking whether input data is consistent with a predefined data schema. Commonly used options for unit testing include Pytest and Unittest.

Model monitoring tools

Several open-source libraries are available to aid in data and model change analysis, including Evidently, Great Expectations, and Alibi Detect. As we have experience with Evidently, let me provide an overview of its functionality.

Evidently assists data scientists in monitoring their model performance over time, making it easier to detect issues. The tool includes capabilities for detecting various kinds of data drifts using a wide range of statistical tools, ability to set custom thresholds for alerts, interactive visualizations for monitoring, and saving results in an HTML report or JSON file. Additionally, dashboards can be hosted in Evidently Cloud, which facilitates collaboration. Evidently supports tabular data, text, and embeddings.

Example of data drift detection with Evidently.

We recommend reading this blog post to gain a better understanding of all Evidently capabilities.

Alerting

It is equally important to set up an alerting system too, so your team won’t miss any issues. However, it is not convenient if the alerts are too sensitive, and trigger frequently, creating unnecessary workload and diverting attention from more critical tasks. Therefore, it is essential to discuss optimal thresholds and frequency for alerting beforehand. Additionally, alerts should be descriptive, providing alerted individuals with a clear understanding of the issue and the ability to trace them back.

In our team, we are utilizing Evidently to monitor data and model drifts. Output from Evidently are logged in MLFlow and Azure Insights logs. These logs can be seamlessly transferred to Azure Insights Dashboards, where customized dashboards can be created and shared with the team. Alerts can be generated based on the same logs with Azure monitor. Additionally, we leverage Databricks alerts to monitor data ETL issues.

Example of creating alerts in Azure Monitor (source).

Conclusion

I hope you agree with us that deploying a model is not the end of the process. Investing in proper monitoring can prevent losses, missed opportunities, and customer dissatisfaction by ensuring that the model performs as expected in real-world conditions. Numerous issues can arise post-deployment, and being prepared to detect and address them helps maintain the product’s quality.

While we’ve focused on common post-deployment issues, it’s important to recognize that more advanced models, such as neural networks or hierarchical models, can present their own unique challenges. Machine learning monitoring is an iterative process that requires ongoing refinement and adaptation. As the field evolves, new tools and techniques emerge, enhancing our ability to monitor and maintain models effectively. We hope this article has given you a hint how model monitoring process looks like. With robust monitoring practices, your model can withstand the turbulent currents of the real world ensuring its long-term success and reliability.

Links for other relevant articles: