Monitoring Feature Draft in Models

Methods to keep track of changes (drift) in data after model deployment

Seungjun (Josh) Kim

Published in

Geek Culture

7 min readDec 20, 2021

From Pexels Photos. Photo classified under the “royalty free” section. Also labeled as “free for use”.

What is Model Drift?

You are done with making a fantastic model that performs very well even on various validation sets. You are ready to deploy the model for real-word data. After deployment, however, you discover that the performance of the model is very poor. What happened? What could have gone wrong?

Model deployment gives way to numerous issues that were unanticipated while the model was being built. One such example is the issue of “Drift”. Simply put, drift means the change in certain components (e.g. concept, predictions, labels of target variable, features etc.) of model over time. This can be due to various factors from change in Key Performance Indicators of the business to the discrepancy of feature distributions between the baseline data and inference data. Here, baseline data means the data used for building the model (e.g. training data) and inference data represents real-world data that is fed into the model after deployment.

There are also several types of drift including concept, prediction, feature, and label drift. These different types are well explained in this article.[1] In this post, I focus on “feature drift”, changes in distributions of features.

Ways to Detect Feature Drift

Then, what are some methodologies that enable us to monitor feature drift?

1. KS (Kolmogorov Smirnov) Test

It is a non-parametric test of the equality of continuous probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). In essence, the test answers the question “What is the probability that this collection of samples could have been drawn from that probability distribution?” [2] Non-parametric, in this context, means it does not have to assume any distribution to perform this test.

You can either perform a one sample test or two sample test. In the one sample test setting, the null and alternative hypotheses would be:

Null Hypothesis: The samples do indeed come from given distribution P

Alternative Hypothesis: The samples do not come from given distribution P

In a two sample test setting, you can compare the underlying distributions of two independent samples.

Both one sample and two sample tests can be performed using the Python SciPy library’s stats module.

## ONE SAMPLE TEST EXAMPLE from scipy.stats.kstest documentation ##from numpy.random import seed
from scipy import statsx = np.linspace(-15, 15, 9)
stats.kstest(x, 'norm')
>> KstestResult(statistic=0.444356027159..., pvalue=0.038850140086...)## TWO SAMPLE TEST EXAMPLE from scipy.stats.kstest documentation ##rvs1 = stats.norm.rvs(size=n1, loc=0., scale=1, random_state=rng)
rvs2 = stats.norm.rvs(size=n2, loc=0.5, scale=1.5, random_state=rng)
stats.ks_2samp(rvs1, rvs2)
>> KstestResult(statistic=0.24833333333333332, pvalue=5.846586728086578e-07)

In the first example above (one sample test), p-value is smaller than 0.05 (assuming that the significance value threshold we set was 0.05), we reject the null hypothesis. We have sufficient evidence to say that the sample data does not come from a normal distribution. In the second example (two sample test), p-value is greater than 0.05 so we do not have sufficient evidence to argue that the two samples were drawn from a different distribution.

2. KL (Kullback Leibler) Divergence Test

Divergence can be understood as distance metric that quantifies the difference between two probability distributions. But it is not quite the same as other distance metrics because it is not symmetric. What does it mean for a metric to be symmetric? This means the divergence for distributions P and Q would give a different score from Q and P. KL divergence is also a key component of Gaussian Mixture Models and t-SNE. Divergence is usually denoted as ||.

The formula for the KL test can be written in two ways depending on whether the variable is a continuous or discrete random variable.

KL Test Formula for continuous random variable

KL Test Formula for discrete random variable

It is pretty straightforward to implement in python.

import numpy as np
from scipy.stats import norm
import seaborn as snssns.set( )# Function of calculating kl_divergence
    ##### The where function applied below is to make sure no calculation is performed on values that are equal to 0 #####
def kl_divergence(p, q):
    return np.sum(np.where(p != 0, p * np.log(p / q), 0))x = np.arange(-10, 10, 0.001)
p = norm.pdf(x, 0, 2)
q = norm.pdf(x, 2, 2)# calculate KL divergence between p and q
kl_divergence(p, q))

3. Jensen-Shannon Divergence Test

This test is another way to quantify the difference (or similarity) between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It can be understood as an extension of previous KL divergence test.

It uses the KL divergence to calculate a normalized score that is symmetrical. This means that the divergence of P from Q is the same as Q from P. It also always has a finite value. It is more useful as a measure as it provides a smoothed and normalized version of KL divergence, with scores between 0 (identical) and 1 (maximally different), when using the base-2 logarithm. [4]

The formula is as follows:

JS(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)
where M = 0.5 * (P + Q)

Using the kl_divergence function already defined in the previous section, we can implement the JS divergence test as follows:

def js_divergence(p, q):
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)

4. Chi-square Goodness of Fit Test

Remember that this test is different from the Chi-Square Test of Independence (which is often implemented by the scipy.stats.chi2_contingency module). Chi-Square Test of Independence checks whether two categorical variables are independent from each other. Chi-Square Goodness of Fit Test checks for totally different things. Those two tests are two separate tests!

The null and alternative hypotheses of the Chi-Square Test of Fit Test are:

Null Hypothesis: A variable follows a hypothesized distribution.
Alternative Hypothesis: A variable does not follow a hypothesized distribution.

This tutorial explains well on how to perform this test!

5. Population Stability Index

Population Stability Index, PSI for short, is a popular metric in the field of finance which measures population stability between two population samples.

Python implementation is here!

6. Observing Statistical Features

We can also compare statistical features between the baseline and inference data to detect model drift. Statistical features include, but are not limited to:

Mean, Max, Min, Median, Mode
Missing Value Frequencies
Number of Distinct Values in Categorical Variables
Distinct Values of Categorical features

etc…

7. Special Algorithms

There are some algorithms that are developed to detect drift. Adaptive Windowing (ADWIN) and the Page-Hinkley method are some of those examples.

The Adaptive Windowing (ADWIN) algorithm uses a sliding window approach. The user can set the window size. The algorithm slides the widow and detects if there are any changes in the new data that gets fed into the model deployed. When two consecutive windows display discrepancies (e.g. distinct means), then the older sub-window is dropped. The user can define a threshold that is used to notify model drift.

There are multiple libraries that implemented this ADWIN algorithm. The skmultiflow library’s drift module is one and the river library’s drift module is another.

The Page-Hinkley method calculates the mean of the observed values and keeps updating the mean as and when new data arrives. A drift is detected if the observed mean at a certain time frame is greater than a threshold value defined by the user. This algorithm is implemented on the river library’s drift module as well.

How to address drift?

Weight Data

We can weight multiple vintages of the data so that the weights are inversely proportional to the age of data. This will ensure that the most recent data (newest data) will have the most influence on updating the model.

Online Learning

We can set a scheduler (there are various schedulers available in Python or pretty much in any programming language) so that the model can update itself at every set interval. More preferably, the models can be triggered to be retrained and updated every time new data arrives. This will ensure the model is up-to-date with the changes in the data distribution. This is usually referred to as the “incremental learning” approach.