Data Drift

Published in

Let’s Deploy Data.

3 min readJul 28, 2020

“Change is the only constant in life”. This also holds true for machine learning models, as over time the input data for a model may change. And this is referred to as Data Drift.

We may not always run our model in a static environment(i.e. using static data). If we are running our model on static data, then our model should not lose any of its performance because the data we are predicting comes from a static environment(i.e. the data similar to what we used to train our model). What if our model exists in a dynamic and changing environment? Here comes the problem of data drift.

Data Drift can be troublesome for building a well-working Machine Learning model. Data drift causes degradation in the model’s performance, as the input data drifts farther and farther from the data on which the model was trained. The features used to train a model are selected from the input data. When the statistical properties of this input data change, it will have a downstream impact on the model’s quality.
For example, data changes due to seasonality, personal preference changes, trends, etc. will lead to incoming data drift.

Therefore monitoring and identifying data drift helps in detecting model performance issues but may also enable us to trigger the retraining process more often to avoid them.

CAUSES OF DATA DRIFT:

Change in the upstream process
Data quality issues
Natural Data drift. (Since data is not static)
Changes in the relationship between different features.

How can we predict these drifts?

Since the major issue is concerned with the dynamic behavior of the data, the best approach to predict these drifts is by monitoring the statistical properties of the data, model predictions, and their correlation with other feature variables.
For example, you could deploy dashboards that plot the statistical properties to see how they change over time. Another thing we could monitor is the outcome of the prediction alongside other data like its correlation to the number of active users. For example, if the number of spammers increase or decrease at a rate very different than that of the active users, there might be something going on. Note that an issue like this doesn’t necessarily mean drift. There could be other phenomena like spam waves or seasonality changes (spammers celebrate holidays, too) that could cause such variation in the data.

Adaptive windowing(ADWIN) is an algorithm that detects data drift over a stream of data. The scikit multi-flow package can detect the data drift with the help of this algorithm. Adaptive windowing algorithm functions by keeping track of several statistical properties of data within an adaptive window that automatically enlarges and shrinks. Here is a snippet of code taken from the official documentation.

Monitoring the data drifts:

The process of monitoring for data drift involves specifying both the baseline data set(i.e. training data set)and a target data set(i.e. input for the model)and comparing these two data sets over time will help in monitoring for differences.

Comparing input data vs. training data. This is a proxy for model accuracy; that is, an increased difference between the input vs. training data is likely to result in a decrease in model accuracy.
Comparing different samples of time series data. In this case, you are checking for a difference between one time period and another. For example, a model trained on data collected during one season may perform differently when given data from another time of year. Detecting this seasonal drift in the data will alert you to potential issues with your model’s accuracy.

Data Drift

CAUSES OF DATA DRIFT:

Written by Sowmithra