Monitoring Model Drift with Python

Jeanine Schoonemann

Published in

Cmotions

12 min readApr 16, 2022

a practical guide for using the popmon package

Introduction

Our clients often ask us what the expiration date of a predictive model is. Unfortunately, this is not something where you can apply a general rule of thumb. The expiration date of a model completely depends on the changes in the world around the model, a.k.a. the model drift, which can be split into concept drift and data drift as my colleague Jurriaan Nagelkerke explains into detail in this very interesting article. In this article you can not only learn about model drift, concept drift and data drift, but also why (automatic) retraining isn’t always the best solution, and even better yet, what you should do to keep your models alive and kicking! I highly recommend you to read this article before reading this one. In this article I assume you already know about model drift and are looking for a practical example how to implement monitoring of models in production.

To be able to monitor the model drift, ING has created their own Python package: popmon. We love (their work on) this package, but it did take us a little while to comprehend how to use this package in a way that would help us get a sense of our model drift. That’s why we decided to help all of you newbies to popmon, who are also interested in monitoring model drift. The beauty of this package lies in the fact that ING definitely applied the KISS (keep it simple stupid) method when developing it. Meaning we only need a couple of smart functions to reach our goals. Unfortunately for us, Data Scientists, we always want to exactly grasp what we are doing, so this also meant a deep dive into this package and into what these functions were doing. So, let us take you on a (short and simple) ride through popmon.

Our use case

Since there are multiple use cases for popmon, we want to start with describing the use case we will be focusing on in this article: a Data Scientist/Analyst created a predictive model, which will be taken into production and therefore model drift monitoring needs to be setup for this model.

To be more precise: We want to keep track of the data that is used to score our model and constantly check whether this data is not too different from the data where the model was trained on. This is a very common use case, although often this monitoring is not set up.

Installation of the package

And then, before we can really start, we need to start by installing popmon, which is as simple as:

!pip install popmon

Initialize the notebook

# import packages
import pandas as pd
import pickle
import popmon
from popmon import resources# make sure all output in each cell shows, without explicitly printing them
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"# adjust default display settings of pandas dataframes
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

The data

This is a dataset containing multiple features, for the sake of simplicity, we only keep a couple of these features and pretend we used these features to predict a column with the name isActive. Since this is the dependent variable, which we can’t monitor in production since it is unknown, we’ve removed this variable from our trainingset as well.

Import our example data

df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df = df[["age", "balance", "eyeColor", "favoriteFruit", "gender", "latitude", "transaction", "currency", "date"]]
df["date"] = df["date"].dt.to_period("M")df.head()

_ = df['date'].value_counts().sort_index().plot(kind='bar')

Create the different batches we would have when this model would have been in productionBecause we’re using a single example dataset for now, we need to split this into a couple of different batches to recreate our example of taking a model into production and scoring new data regularly:

our traindata
the first batch of scoring data when the model is in production
the second batch of scoring data when the model is in production
the third batch of scoring data when the model is in production

Let’s say we want to check our model scoring data for changes on a monthly basis.

# our traindata is created on data of the first 8 months of 2015
df_train = df[df.date < "2015-09"]# batch_1 would be the first batch to be fed to the model after this has been put into production
batch_1 = df[df.date == "2015-09"]# batch_2 would be the second batch to be fed to the model after this has been put into production
# we won't use this dataset in this example, but you can use it when playing around with this notebook yourself
batch_2 = df[df.date == "2015-10"]# batch_3 would be the third batch to be fed to the model after this has been put into production
# we won't use this dataset in this example, but you can use it when playing around with this notebook yourself
batch_3 = df[df.date == "2015-11"]

Initalizing the monitoring

We will use the trainingdata to learn popmon what our data looks like, already dividing the data into the different months, so popmon knows what it should expect from a new month. This function uses another package built by ING: histogrammar. A key strength of popmon (using histogrammar) is to save all that is needed to monitor for each feature in a histogram object. Histograms are small in terms of storage and safe in terms of privacy risks as no specific values need to be stored. Also, histograms can easily be compared to other histogram objects, resulting in fast processing once the histograms are created. These histograms are used to calculate all the metrics used to create the alerts. Popmon generates a whole load of metrics, we’ve given a short overview and explanation of all of these metrics in this article.

hists = df_train.pm_make_histograms(time_axis="date")
hists

We can get the binning specifications from the histogrammar objects, this way we can make sure that the histograms created on a new batch will be exactly the same.

bin_specs = popmon.get_bin_specs(hists)
bin_specs

Now we’re ready to compare new periods to the reference (train) period. This comparison will be done using the metrics from the histograms we’ve created, the output created are other metrics calculated based on the histogram metrics, therefore you won’t see any histograms in the next piece of code. Popmon provides us with a lot of useful output and will guide us in evaluating whether new data is similar (enough) to reference data. We will use that information without looking at the historgrams.

Before we look at more popmon output, it’s good to briefly introduce some key concepts of popmon. For more detail, we refer to the package documentation, but these concepts are essential to understand:

profiles: per batch (time period), popmon calculates several profile statistics for each feature, like the mean, min, max, missings, distinct value count.
comparisons: each new batch is compared to the specified reference (see next concept) and several comparison statistics are calculated, mainly significance tests (chi square, kolmogorov-smirnov) to evaluate the difference in the histogram distribution of the new batch compared to the reference.
references: what to compare with? popmon offers four options: reference data — a specified external source — rolling window over preceding time periods, prev1 — the preceding time period and expanding hence all preceding time periods.
alerts: a traffic light style alerting based on the profiles and comparisons, to only warn when something important (enough) is different in the data, compared to the reference. With different, customizable thresholds green, yellow and red traffic lights are defined for each feature and each metric.

We won’t go into detail for now, but it might be useful to know that it is possible to change the monitoring rules popmon uses to create the alerts. We can change the rules for a single metric (which is applied to all features) or for a (group of) feature(s). Below you can find a code example on how to change the monitoring rules, where we for instance set the boundaries for age to 18 and 100 for a “yellow” alert and to 0 and 120 for a “red” alert. Meaning that if a value in the feature age is outside of these boundaries an alert will be triggered. Popmon uses very intelligent and useful monitoring rules by default, so most of the times it probably won’t be necessary to change this at all.

monitoring_rules = {"*_pull": [9, 3, -3, -9]
                    , "*_zscore": [7, 4, -4, -7]
                    , "[!p]*_unknown_labels": [0.5, 0.5, 0, 0]
                    , "age:min": [120, 100, 18, 0], "age:max": [120, 100, 18, 0]
                   }

Oh and one more thing, because there is one term you will often see in the output but might not immediately be clear to you: the pull. Pull refers to ‘the normalized residual of a value of interest with respect to the selected reference’. Not sure if that made things clearer yet… In our own words, the pull is the standardized version of each statistic so that they are on the same scale and we can have the same threshold for different statistics. The pull indicates to what extent a value on a statistic should be interpreted as an actual difference in the new data as compared to the reference data. It’s calculated for every profile of every feature like this:

pull=(value−reference[mean])/reference[standarddeviation]

Ready to calculate the monitoring metrics and alerts

To illustrate the type of objects popmon creates, we start with calculating all the metrics (and alerts) on the traindata. Do note that this is not necessary when we want to see if new scoring data is similar (enough) to training data to be used for scoring.

The train data contains multiple periods/batches of data (the time-axis parameter we used to make the histograms). The datastore object we’re creating here is essentially a dictionary holding all calculated metrics and more importantly the alerts that were generated due to these metrics. It is recommended to explore this dictionary to see what you can expect and, even better, to decide how you want to use this information to create your own alerts. Where you can use these values to trigger the messages (for instance Slack, Email) you want to receive. Making this completely and easily adjustable to the preferred way of working in your company.

# calculate the metrics based on all histograms and the monitoring rules we adjusted (when no adjustments made you can ignore this parameter, popmon will use its own default)
datastore = popmon.stability_metrics(hists=hists, monitoring_rules=monitoring_rules)

Since we’ve asked popmon to calculate stability_metrics on the set of training period histograms without specifying a reference, it will check whether all periods within the training period have similar values and distributions, as compared to all other periods within the training period.

# let's look at what content can be found in the datastore
datastore.keys()

# you can access the datastore like you would any other dictionary
# we will just show you one random example here
datastore['comparisons']['age']

Depending on the way your company prefers to work, you can store these objects as (pickle) files, or write them to a database. This way you can retrieve and expand them every time you use the model on a new batch of data.

# save the objects as pickles (we could also create jsons and/or store in a database)
pickle.dump(hists, open(f"all_hist.pkl", "wb"))
pickle.dump(monitoring_rules, open(f"monitoring_rules.pkl", "wb"))
pickle.dump(datastore, open(f"datastore.pkl", "wb"))

Now let’s try to use the monitoring on our first new batch of data

Earlier, we prepared the scoring batches. Using those batches of data we are ready to check for drift, compared to the histograms we made based on the training data. So let’s start doing that now.

Create the histograms of the new batch, using the same bin specifications

# start by generating the histograms on the new dataset, using the same bin specifications to make the histograms comparable
new_hists = batch_1.pm_make_histograms(time_axis="date", bin_specs=bin_specs)
new_hists

Now that we have histogram objects for the new scoring batch and we already had created those for the training data, we can calculate the metrics of the new batch, using our traindata as the reference data.

batch_datastore = popmon.stability_metrics(hists=new_hists, monitoring_rules=monitoring_rules, reference_type='external', reference=hists)

To get a deeper understanding of what we end up with now, let’s have a deeper look at four objects in the batch_datascore we just created:

profiles We introduced these earlier on: This contains a lot of statistics on the distribution of the features in the new batch: The number of values, mean, min, max, etc. And since the batch has multiple observations, it also contains information on the deviation of each of these statistics: the standard deviation of the mean, of the number of values, of the min,… And finally, it contains a comparison of these statistics with those in the reference data, resulting in all _pull metrics in the profiles. These are used to specify what profile statistics seem really different from the reference data and get a yellow or red traffic light.
comparisons We also introduced this earlier: This contains actual comparisons between the distributions in the new batch compared to the reference data: It holds many test statistics like pearson, chi square and Kolmogorov-Smirnov and the statistical significance of the values. For alerting, also in the comparisons _pull metrics are added to specify which comparisons are big enough to get a yellow or red traffic light.
alerts This object summarizes and guides us if there are significant issues with the new data. The monitoring_rules we specified earlier are used to evaluate whether thresholds for pull, z-scores or custom threshold values like minimum or maximum age are met.
traffic lights These indicate which of the pull or zscore or customer thresholds for feature statistics triggered the traffic light. In this overview of all considered values for alerting, a 0 indicates no alert (‘green’ traffic light), a 1 indicates a yellow traffic light and a 2 is most severe: a red traffic light.

Let’s explore the profiles, comparisons and alerts now for one feature, age:

# let's checkout the same element in the datastore as we saw before
batch_datastore['profiles']['age']

# let's checkout the same element in the datastore as we saw before
batch_datastore['comparisons']['age']

# Let's look at the alerting summary: age has 2 yellow traffic lights . 
# Since green is coded as a 0, yellow as 1 and red as 2, the worst value is 1, hence yellow. 
batch_datastore['alerts']['age']

# The traffic_lights object tells what triggered the yellow traffic sign. 
# - The min value was in the 0-18 yellow range we specified ourselves. 
# - The pull value of the standard deviation indicates a difference in spread in the new data.  
batch_datastore['traffic_lights']['age']

In the same way we can create the monitoring report to give us a bit more insightful information

Since the datastore can be a bit overwhelming, you can also create a report showing you all the information in a visual way, making it possible to get a better overview and click around the features to get a sense of all the possibilities.

batch_monitoring_report = popmon.stability_report(hists=new_hists, monitoring_rules=monitoring_rules, reference_type='external', reference=hists)batch_monitoring_report

Next steps

We know we’ve only scratched the surface of all the beautiful and smart things popmon has to offer. But this was just meant to give you a quickstart in using this package. The possible next steps, as we see it, would be:

finding out which metrics are used and what they do/mean, which we explained here
finding out if the default settings of these metrics apply to your model

This last step is a very difficult one, in our experience we haven’t found a reason yet to deflect from the default values used by popmon. The only reason was to create fixed minimum and maximum values for a feature like age. But we haven’t found another reason to adjust the default settings yet. But please let us know if you do and why!

We hope this article helped you getting started using popmon for the monitoring of model drift. If you are looking for more example notebooks, ING created some really nice ones as well:

You can also checkout our other two articles in this series:

Good luck, and more important, have fun!