Data Drift Detection through Effective Statistical Testing

Michael Levinger
Melio’s R&D blog

--

(Read this in Hebrew — here)

As data-driven decision-making becomes increasingly popular across industries, monitoring the performance of machine learning models is more important than ever.

A critical aspect of this monitoring process is detecting performance drift, which is when a model’s accuracy or other performance metrics change over time.

However, detecting performance drift in real-time isn’t always possible because true labels aren’t always available.

Here is where monitoring data drift comes to the rescue

Data drift detection is the process of identifying changes in the statistical properties of the input data vs. a reference dataset (e.g. trainset) to make sure the inputs that are coming into the model fit with what it’s familiar with.

To take appropriate actions (e.g. retrain the model, replace the model, make changes in the preprocessing layer, make changes in the ETL, etc.), you must identify the source of the drift and determine whether the changes are not only statistically significant but are also worth acting upon. To do this, you need to use the proper statistical tests.

In this post, we’ll explore why it’s crucial to use the proper statistical tests when monitoring data drift and discuss some commonly used statistical tests.

Without the proper statistical tests detecting data drift is hard!

When monitoring data drift, it’s essential to distinguish between “real” data changes and “random” fluctuations that naturally occur in the data. Statistical tests can help you determine whether the observed changes are likely to be due to chance or are significant enough to warrant further investigation.

Suppose you don’t use statistical tests to determine the significance of data changes. In that case, you might misinterpret random fluctuations as actual performance changes, leading you to take unnecessary actions that could be harmful to your model’s overall performance.

To develop a useful monitoring layer, you’ll need to focus on properly applying the statistical tests on the features. It will help to prevent a high false positive/false negative rate, meaning, alerting too much when there’s no threat or not alerting as much when there is a threat.

Commonly used statistical tests

There are several commonly used statistical tests that can be used to monitor data drift. Each statistical test has particular properties and in-built assumptions. Here are a few of them:

  • The Mann-Whitney test is a non-parametric statistical test used to compare two independent samples. It’s useful for data drift detection because it can detect differences in the distribution of a metric between two time periods or groups, even if the data isn’t normally distributed or contains outliers. However, it has lower power than parametric tests, can only detect differences in the median, cannot determine the direction or magnitude of changes, and requires careful consideration of significance level and effect size.
  • The Chi-square test is a statistical test used to compare categorical data between two or more groups. It’s useful for data drift detection because it can identify changes in the distribution of categorical data over time and doesn’t make assumptions about the data distribution. However, it requires a relatively large sample size, and it is difficult to interpret when there are large numbers of categories (20 or more) in the independent or dependent variables and cannot determine the direction or magnitude of changes. It’s also designed for categorical data only, and may not be appropriate for other types of data.
  • ANOVA is a statistical test used to compare the means of three or more groups. It’s useful for detecting changes in the mean of a continuous variable over time or between groups. ANOVA can handle multiple groups and is a parametric test, providing greater statistical power than non-parametric tests when assumptions are met. However, it assumes normality and equal variances of the groups, and can’t determine the direction or magnitude of changes. It may be less suitable for small sample sizes or data with outliers.
  • The t-test is a statistical test used to compare the means of two groups. It’s useful for detecting changes in the mean of a continuous variable between two time periods or groups. One advantage is its simplicity and ability to handle normally distributed data. However, it assumes normality and can be sensitive to outliers, and can only compare two groups.
  • The KS (Kolmogorov-Smirnov) test is a non-parametric statistical test used to compare the distribution of a continuous variable between two populations or time periods. It’s versatile and can compare distributions of any shape or size. It’s often a default choice for detecting a distributional change in numerical features. While it does the job in many cases, the test can be “too sensitive” for larger datasets. It would fire alarms for many real-world use cases all the time, just because you have a lot of data and small changes add up. You need to account for such test behavior when picking your drift metric.

Remember that there’s no such thing as “objective” data drift and a “perfect” test for it — it depends on the use case and data. For example, some tests are more sensitive than others so intuitively might be a good fit for features with high importance in the model.

When fitting the statistical test to the features, consider:

  • The correlation between the feature and the model performance (using SHAP values can be useful)
  • The feature type and distribution
  • The size of the samples you compare
  • The size of drift you want to detect
  • The cost of the model performance drop

To sum up

Monitoring performance and data drift are crucial for ensuring the long-term success of machine learning models.

However, detecting performance drift is only the first step. To make informed decisions about model performance, it’s crucial to use the proper statistical tests to determine whether observed changes in the data are statistically significant.

Using statistical tests can help you differentiate between real data changes and random fluctuations. By using the appropriate statistical tests, you can take data-driven actions that will improve your model’s performance and drive better outcomes for your business.

Visit our career website

--

--