Monitoring AI/ML model drift for successful deployments

Using Teradata Vantage ModelOps for model monitoring

Pablo Escobar de la Oliva

Published in

Teradata

10 min readJan 11, 2023

This post is co-authored with Will Fleury ModelOps Director of Engineering at Teradata

Monitoring AI/ML models refers to the process of tracking the performance of models in production to ensure the models behave as expected. It has become one of the model operations (ModelOps) critical capabilities that organizations are implementing to scale efficiently on their AI/ML journeys. Monitoring is part of the ModelOps’ lifecycle management, and is located after model deployment but, as we will see in this post, monitoring requires additional considerations during the whole model lifecycle.

When tracking the performance of a model, we look for the drift of this performance, or we could say directly we look for the ‘model drift’ which refers to the model degradation in performance over time due to changes in the data the model uses and targets.

Data is expected to evolve over time — especially in dynamically changing environments when non-stationarity is typical. Model drift may be on these two data drift causes: whether the target variable characteristics change (the “concept” drift) or the features (input variables) characteristics change (the “virtual” drift).

Concept Drift

“Concept” is used in AI/ML analytics to refer to the functional relationship between the model features and target, we can simply say is the meaning of the target variable. When concept drift happens, the statistics of whatever you’re trying to predict, or estimate have changed. Such changes can happen either with or without changes in our features. There is a need for a new labeled feature set space and to create a new model since the existing model wouldn’t work for this updated definition. Classic examples of concept drift are:

Mail spam classification when initially only dependent on the number of emails and now depends on the nature of the content and additional features.
Financial transactions fraud on new transaction channels.
A new regulation requires new calculations on the model.

Virtual Drift

The “Virtual” drift, also named data drift or feature drift in other articles, happens when the characteristics or statistics of the features change but not necessarily the concept. The interesting aspect of this is we don’t need to know the true labels to alert on these feature drifts, and we can track this easily using the independent distribution of each feature. Feature drift may require a new model with an updated representative training set unless this comes under data integration or quality issues. Classic examples of data drift are:

Data pattern changes due to seasonality — for example specific sales season
Customer pattern changes — for example, due to the pandemic, the increase in the use of mobile apps for banking

The following diagrams [1] help to visualize concept vs virtual model drift. As they show, only the concept drift makes the previous decision model obsolete and that is our final objective to identify.

In practice, Concept drift identification requires feedback (newly labeled data) to confirm the performance degradation so we can say this is a “reactive” monitoring of our models.

When virtual drift appears, this may not change the concept but is an early indicator (“proactive” monitoring) for our model drift.

Once we’ve identified and explained the drift types if we apply this to the data that models use, then we can name, calculate and monitor the following:

Performance Drift (Predicted Data vs Actual Data)
Feature (Input Data) Drift
Prediction (Label Data) Drift

Performance Drift (Predicted Data vs Actual Data)

The simplest form of model drift monitoring is to ensure you regularly compare the predicted results to the ground truth result. To do this, you need access to the actual result so you can compare, this is what we called before the “reactive” monitoring. This information is not always available promptly (email campaigns, churn, etc), and sometimes not at all. However, when it is, this then turns into an evaluation problem where the performance metrics determined by the data scientist are calculated on the new ground truth datasets and can be compared.

For monitoring the model quality performance data scientists need to identify the metrics they want to monitor in the evaluation stage of the lifecycle before the deployment, so after deployment, the same evaluation metrics are used and can be compared periodically when ground truth is available. For alerting, they need to identify the thresholds for each metric, or the monitoring system can determine if the trend of a metric result over time requires attention.

In Teradata Vantage ModelOps, the identification of this metric is straightforward using the evaluation template, data scientists only need to name each metric in a json format. Note there is no limit to the user to use any custom or library-generated metric.

In this example, we use a scikit-learn xgboost classification model and the metrics from the library:

evaluation = {
'Accuracy': '{:.2f}'.format(metrics.accuracy_score(y_test, y_pred)),
'Recall': '{:.2f}'.format(metrics.recall_score(y_test, y_pred)),
'Precision': '{:.2f}'.format(metrics.precision_score(y_test, y_pred)),
'f1-score': '{:.2f}'.format(metrics.f1_score(y_test, y_pred))
}

In the absence of real-time ground truth, drift in prediction and feature distributions are often indicative of essential changes in the model (Virtual drift may signal Concept drift). There are cases under Data integration issues we have a Virtual drift but no Concept drift.

Feature (Input Data) Drift

Data drift is based on understanding and monitoring for changes in the dataset statistics the model was trained on vs the dataset statistics the model is currently predicting. As noted already, data is expected to evolve over time! Therefore, the monitoring of this data needs to be able to capture this evolution and know when the data has evolved past a certain “divergence” threshold or if it has simply changed completely. The following diagram shows some of the types of data drift that are possible over time.

The process of analyzing this can be an offline process where we are simply computing at some scheduled interval the statistics of the datasets which we fed to the model, or it can be online where the statistics are computed on the fly as the data is fed into the model. Note that we only need to record distribution statistics (histograms and frequencies) about the dataset features. We do not need to have access to or keep information related to every data point. In many ways, this makes the online capture and storage of the dataset statistics more attractive as it allows us to decouple the dataset storage systems from the actual monitoring of them.

An additional benefit of only capturing dataset statistics is that we don’t need to worry about personally identifiable information. When the data may cross a few different systems or where there is a lag in the data being unified, this is very beneficial. Instead of just blindly comparing the column statistics for the dataset used in training vs current, we can further zoom in on important changes by analyzing the column statistics relative to the importance of the given column. This combination of feature importance with data drift is very powerful.

Prediction (Label Data) Drift

Prediction drift is specific to monitoring the data statistics of the model output. While we could include this in the data drift definition, I think it is more relevant on its own as we cannot use things like feature importance to make additional decisions here. What we can infer though is whether a model prediction has suddenly deviated from what “normally” predicts. A simple example of a regression model is the mean of the predicted value has deviated from the mean over some previous time interval. A classification-based example might be that the frequency of predicted classes suddenly changes relative to their historical values.

For monitoring both feature and prediction statistics and distribution, Teradata Vantage ModelOps requires identifying the features and target before training, or if the model is trained externally and imported, identifying the data that has been used for model training. For this, we provide simple mechanisms to data scientists from our UI.

Features and target identification in ModelOps UI using ModelOps datasets, defining a SQL query to select the features we want to use in the model:

Features identification from SQL query in Teradata Vantage ModelOps

After identifying the features, we need to gather the importance of the features for the model, this happens in our training code template, or if externally trained this can be inserted manually in the ModelOps UI.

In this example, a data scientist is using the xgboost importance, but note we can use any other library for extracting the importance (shap, lime, etc..)

feature_importance = model["xgb"].get_booster()
.get_score(importance_type="weight")

The user needs to compute and record these statistics. Note this is done for training and then in every evaluation and scoring. For BYOM models this is not needed as ModelOps makes monitoring automated. For git models, ModelOps SDK provides this simple way to run and record training, evaluation, and scoring stats:

# training
record_training_stats(train_df,
                      features=feature_names,
                      targets=[target_name],
                      categorical=[target_name],
                      importance=feature_importance,
                      context=context)

# evaluation
record_evaluation_stats(features_df=test_df,
                        predicted_df=predicted_df),
                        importance=feature_importance,
                        context=context)

#scoring
record_scoring_stats(features_df=features_tdf, 
                     predicted_df=predictions_df,
                     context=context)

With this done, Teradata Vantage ModelOps will compute and show the distribution and statistics for each of the model features after training, and then will show in the model drift panel the overlay of training vs evaluation/scoring statistics

Features and Target distribution and other statistics computed in Teradata Vantage ModelOps

The dataset statistics display the following measures for the selected feature.

count (cnt)
minimum (min)
maximum (max)
mean
standard deviation (std)
skewness (skew)
kurtosis (kurt)
standard error (ste)
coefficient of variance (cv)
variance (var)
sum
uncorrected sum of squares (uss)
corrected sum of squares (css)
missing values (nulls)

How do we calculate and monitor this drift information?

Proper ModelOps capabilities involve robust alerts and monitoring methods so organizations can take actions (pre-emptive preferably) to reduce business impact. Ideally, this monitoring, alerting and action-taking is automated.

To calculate and monitor the data drift, in features and/or target data statistics. we can follow a statistical or model-based approach to determine if the differences between training and scoring data are enough to raise alerts.

The statistical approach uses relatively simple metrics to compute, and data scientists are used to dealing with these metrics. Here is the most commonly used:

Population Stability Index (PSI) is a measure of population stability between two continuous population samples (training and evaluation/scoring).
Kullback–Leibler (or KL) (also called relative entropy and I-divergence) is a type of statistical distance measure of how one probability distribution is different from a reference probability distribution.
Kolmogorov-Smirnov test (or KS test) is a nonparametric test of the equality of one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test).

With Teradata Vantage ModelOps once we follow the steps in the lifecycle to identify the dataset the model uses, we provide automated, proactive data drift (features and target) monitoring every time a model is scored.

Vantage ModelOps uses the independent features distribution and their importance stored information to calculate these 3 explained statistical methods: PSI, KL, and KS while generating alerts based on the most common values used for identifying enough divergence. These alert thresholds can be customized, but by default are optimized for most use cases without the need of configuring or coding anything else.

Feature drift monitoring in Teradata Vantage ModelOps

Vantage ModelOps uses Application Performance Monitoring (APM) to record and monitor the above scenarios. We capture these metrics across batch, streaming, and RESTful applications supporting Python and SQL languages.

Vantage ModelOps uses Prometheus alert manager as an APM to provide support for alerting in the UI. Prometheus is a very easy-to-use yet powerful tool that allows you to make the alerting rules very sophisticated and keep the alerting rule complexity encapsulated for the end user. Besides, this enables the integration to other metrics systems from on-premises or cloud providers if needed.

Feature drift alert in Teradata Vantage ModelOps

In conclusion, we can summarize that monitoring AI/ML is a critical capability of our model operations and key to success in the organization's AI/ML initiatives. Monitoring starts before models are deployed and continues after deployment until the model is eventually replaced or retired. Having robust alerts and monitoring methods to track the performance and data drifts is a key component to systematically flag any divergence or suspicious behavior and take action to adapt production models to a new context.

References

[1] A Survey on Concept Drift Adaptation
JOAO GAMA, University of Porto, Portugal
INDRE ZLIOBAIT E, Aalto University, Finland
ALBERT BIFET, Yahoo! Research Barcelona, Spain
MYKOLA PECHENIZKIY, Eindhoven University of Technology, the Netherlands
ABDELHAMID BOUCHACHIA, Bournemouth University, UK

https://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf

Teradata Vantage ModelOps user guide

Feature drift user guide: https://docs.teradata.com/r/Enterprise_VMware_IntelliFlex/Teradata-VantageTM-ModelOps-User-Guide/User-Guide/Deployments/Drift-Monitoring/Viewing-Feature-Drift
Models’ monitoring user guide: https://docs.teradata.com/r/Enterprise_VMware_IntelliFlex/Teradata-VantageTM-ModelOps-User-Guide/User-Guide/Monitoring-Alerts