Model Monitoring — Essential Concept for Beginners

Sze Zhong LIM
Data And Beyond
Published in
6 min readFeb 3, 2024
Photo by Amber Kipp on Unsplash

As data scientists, building models is just the beginning of our journey. Ensuring that these models perform effectively over time is equally crucial. Visualize a car you just bought. As the season changes (to winter), you might have to modify parts of your vehicle, in this instance winter tires, to ensure the vehicle can be operated safely and smoothly. Similarly, models are trained on past data, and we need to monitor whether the conditions in production are similar to our training conditions, and that they work like how we want them to.

This is where Model Monitoring comes into play. Model monitoring is the process of continuously evaluating a model’s performance, fairness, and stability in production environments. In this article, we delve into the key aspects of model monitoring:

  1. Data Drift
  2. Model Performance
  3. Model Fairness
  4. Model Explainability.

Data Drift

Data drift refers to the phenomenon where the statistical properties of the target variable or input features change over time. It is commonly encountered in real-world applications due to evolving trends, seasonality, or changes in user behavior or even something politically-related such as a change of policy.

For example, consider a credit scoring model used by a lending institution to assess loan applications. Data drift may manifest as variations in the distribution of credit attributes, such as income levels or debt-to-income ratios, over time. By scrutinizing incoming loan application data against historical patterns, they can identify deviations indicative of data drift and recalibrate the model accordingly.

An illustration I made to show how data drift affects the model, and how recalibrating the model will affect the result.

Monitoring data drift involves comparing the distributions of incoming data with the training data. Various statistical methods such as Kolmogorov-Smirnov tests or Kullback-Leibler divergence can be employed. We may also use a Stability Index (CSI — Characteristic Stability Index, PSI — Population Stability Index) to monitor the change in the population distribution.

Image from mwburke’s github post

For continuous data, we can use kernel density estimation, statistical tests like Kolmogorov-Smirnov tests, and even distance-based methods like Wasserstein Distance that will provide us with a quantitative measure of drift between continuous data distributions. For categorical data, we can use chi-squared test to assess the independence between categorical variables.

Model Performance

In the model training stage, we evaluate our model using the validation dataset and test dataset, both of which the model has not seen before. We will normally choose the metrics that we want to evaluate the data on based on the nature of the problem. For instance, in a class imbalanced dataset, we might focus more on the recall rather than the accuracy.

As we put our model into production, we want to continue monitoring the performance of the model to ensure that it is working as intended. Continuous monitoring can help us identify subtle shifts in model behavior and performance degradation over time.

While it is essential to focus on the metrics most relevant to the model’s objectives during model monitoring, maintaining a holistic view of model performance is equally important. By monitoring a comprehensive set of performance metrics and considering the perspectives of various stakeholders, data scientists can ensure that their models remain effective, reliable, and aligned with the evolving needs of the application domain.

Below is a link to some metrics that are commonly used to evaluate models.

Model Fairness

Model fairness is a critical aspect of model monitoring, ensuring that machine learning models make predictions without perpetuating biases or discrimination against certain groups. Achieving fairness involves assessing and mitigating the potential impact of sensitive attributes such as race, gender, or age on model predictions.

According to MAS (Monetary Authority of Singapore) which released the Principles to Promote FAIRNESS, Ethics, Accountability, and Transparency (FEAT) in the use of AI and Data Analytics in Singapore’s Financial Sector, fairness is about ensuring:
1) individuals / groups are not systematically disadvantaged
2) the model minimizes unintentional bias

You may find the link to the document here.

To provide a real-world example of how Model Monitoring Model Fairness was used, you can google more about COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) and race bias.

Image from American Council of Science and Health. Link

COMPAS is a machine learning tool used in the criminal justice system in the US, to assess the risk of recidivism among individuals involved in the justice system. It analyzes various factors such as criminal history, socioeconomic status, demographic information, to generate risk scores, which inform sentencing, parole and other decisions. Despite RACE was not explicitly included as a feature in the model’s design, thru rigorous model monitoring, it was discovered that the COMPAS system exhibited biases concerning race. The issue stemmed from indirect correlations between certain input features and race, such as zip code or socioeconomic status, leading to disparate outcomes in risk scores and sentencing decisions among different racial groups, notably African American and Caucasian individuals. Even in the absence of direct race-based features, these proxy variables inadvertently introduced biases into the model’s predictions, highlighting the complexities and challenges of addressing fairness and equity in algorithmic decision-making within the criminal justice system.

Model Explainability

Model explainability refers to the ability to understand and interpret how a machine learning model arrives at its predictions or decisions. I did some explaination on XAI (eXplainable AI) in my article on LIME.

But how is it relevant to model monitoring? Basically we can monitor the consistency and stability of a model’s explanation over time to detect any drift in the model behavior. In my opinion, this is similar to data drift, just that instead of the raw data distribution, we will be looking at the drift in feature importance ranking / SHAP / other types.

Conclusion

In conclusion, model monitoring is a critical component of ensuring the reliability, effectiveness, and fairness of machine learning models in real-world applications. By continuously assessing model performance, fairness, stability, and explainability, organizations can detect and mitigate issues such as concept drift, data drift, bias, and model degradation.

Lastly, there is a video by WhyLabs that provides an Intro to ML Monitoring using their software and I found it good enough to improve my understanding of this by the live simple examples they provided.

Some additional good links:

Data Drift
1) https://towardsdatascience.com/dont-let-your-model-s-quality-drift-away-53d2f7899c09
2) https://medium.com/meliopayments/data-drift-detection-46bfaa9f743c
3) https://www.listendata.com/2015/05/population-stability-index.html

Model Fairness
1) https://medium.com/thoughts-and-reflections/racial-bias-and-gender-bias-examples-in-ai-systems-7211e4c166a1
2) https://medium.com/data-science-at-microsoft/measuring-fairness-in-machine-learning-3211b62340b
3) https://www.mas.gov.sg/schemes-and-initiatives/veritas

--

--