Data Drift Monitoring and Its Importance in MLOps

Learn about data drift and how to Mitigate it

Sage Elliott
WhyLabs

--

Machine Learning (ML) is now an essential tool in most modern businesses, driving everything from predictive analytics to AI-enhanced applications. However, to ensure the effectiveness of your models, it’s important to continuously monitor and manage ML performance, this process is known as Machine Learning Operations (MLOps). One crucial aspect of MLOps is managing “data drift.” But what is data drift, and why is it so important to monitor it in your MLOps pipeline?

This post covers:

  • What is data drift
  • The consequences of ignoring data drift
  • Data drift monitoring in MLOps
  • How to data drift with whylogs
  • Mitigating data drift
  • Conclusion

What is data drift?

Data drift refers to the change or variation in the input data of your ML model over time. It can occur due to a variety of reasons: the data might change naturally with time with seasons, the patterns and behaviors of the users might evolve, or the business environment itself might shift, altering the data being fed into the model.

Simply put, the model’s predictions are only as good as the data it is trained on. If the data that the model is seeing in the production environment starts to drift outside the distribution of the data it was trained on, the model’s performance could decrease substantially.

In this blog we’ll focus on covariate drift, this form of data drift occurs when the statistical properties of the input features in production change over time. We’ll cover other types of model drift in future blog posts.

Example of data drift occurring

The consequences of ignoring data drift

Depending on your ML application, ignoring data drift can have serious consequences. The performance of your ML models can decline without your knowledge, leading to inaccurate predictions and suboptimal decisions. This could also lead to a loss of trust in the models or product, making stakeholders and customers reluctant to rely on them.

For example, consider a credit card fraud detection model. The patterns of fraudulent transactions may change over time as fraudsters adapt their strategies. If the model is not adjusted to reflect these changing patterns, the number of false positives and false negatives can increase, potentially resulting in financial loss or even damage to the company’s reputation.

Data drift monitoring in MLOps

Given the potential consequences, integrating data drift monitoring into your MLOps pipeline is important. Continuous monitoring can help you detect and address any data drift to maintain your ML models’ performance and reliability.

To implement data drift detection, you first need to define what constitutes a significant drift for each feature in your model. Then, by continuously comparing the distribution of the training data with that of the data in production, you can detect any significant drifts.

Different statistical tests can be used for comparison, such as the Kullback-Leibler (KL) divergence or the Kolmogorov-Smirnov (KS) test. These tests give you a measure of how much the data distributions differ, which can be used to trigger alerts if the drift exceeds a certain threshold.

Example of using statistical tests for data drift detection

Mitigating data drift

Once data drift is detected, the next step is mitigation. A common approach is to annotate the new data and retrain the model. You may want to compare model performance between models before deploying to production.

A well structured MLOps pipeline can help automate these steps, minimizing the manual effort required to retain models and ensuring faster response times by triggering workflows when data drift is detected. At a minimum, ML monitoring should be configured to send an alert when data drift occurs so you can take action.

Example of a ML pipeline with AI Observability

How to detect data drift

Fortunately, MLOps is a quickly maturing field and many tools now exist to help make ML pipelines robust and responsible! We’ll take a quick look at how you can use the open source library, whylogs.

Once you install whylogs in any Python environment using pip install whylogsprofiles of your dataset can be created with just a few lines of code! These data profiles only contain summary statistics about your dataset and can be used to monitor for data drift and data quality issues without compromising your raw data.

import whylogs as why
import pandas as pd

# profile pandas dataframe
df = pd.read_csv("path/to/file.csv")
profile1 = why.log(df)

Next, we can get a data drift report between profiles using the built in NotebookProfileVisualizer. By default whylogs will use KS test to calculate the drift distance between the profiles, but other popular drift metrics can be configured instead.

# Measure Data Drift with whylogs
from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=profile_view1, reference_profile_view=profile_view2)

In the example below we can see that data drift has been detected for the “petal length” feature in the iris dataset and drift score has been calculated using KS test.

Data drift report from whylogs

To get a better visualization of the data drift on an individual feature, we can use the `double_histogram` to overlay the histograms of the petal length feature for each profile.

visualization.double_histogram(feature_name="petal length (cm)")
Data drift visualized on the individual feature with whylogs

In this example, we can see the distribution between the two profiles hardly overlap, indicating a very large distribution drift.

To return the data drift metrics use `calculate_drift_scores` from whylogs. This will return a Python dictionary containing the data drift metric, scores, and thresholds for each feature. Learn more about adjusting these parameters in this example.

from whylogs.viz.drift.column_drift_algorithms import calculate_drift_scores

scores = calculate_drift_scores(target_view=profile_view1, reference_view=profile_view2, with_thresholds = True)

print(scores)

Returned data drift metrics in a Python dictionary.

{'sepal length (cm)': {'algorithm': 'ks',
'pvalue': 0.2694519362228452,
'statistic': 0.11333333333333329,
'thresholds': {'NO_DRIFT': (0.15, 1),
'POSSIBLE_DRIFT': (0.05, 0.15),
'DRIFT': (0, 0.05)},
'drift_category': 'NO_DRIFT'},
'sepal width (cm)': {'algorithm': 'ks',
'pvalue': 0.9756502052466759,
'statistic': 0.05333333333333334,
'thresholds': {'NO_DRIFT': (0.15, 1),
'POSSIBLE_DRIFT': (0.05, 0.15),
'DRIFT': (0, 0.05)},
'drift_category': 'NO_DRIFT'},
'petal length (cm)': {'algorithm': 'ks',
'pvalue': 0.9993989748100714,
'statistic': 0.04000000000000001,
'thresholds': {'NO_DRIFT': (0.15, 1),
'POSSIBLE_DRIFT': (0.05, 0.15),
'DRIFT': (0, 0.05)},
'drift_category': 'NO_DRIFT'},
'petal width (cm)': {'algorithm': 'ks',
'pvalue': 0.9756502052466759,
'statistic': 0.053333333333333344,
'thresholds': {'NO_DRIFT': (0.15, 1),
'POSSIBLE_DRIFT': (0.05, 0.15),
'DRIFT': (0, 0.05)},
'drift_category': 'NO_DRIFT'}}

You can use these values to monitor for data drift between two profiles directly in your Python environment.

We can go further in ML monitoring for data drift by using the WhyLabs Observatory. The WhyLabs Observatory makes it easy to store, visualize, and monitor profiles created with whylogs.

Using the WhyLabs platform to monitor data drift & ML performance

In order to write profiles to WhyLabs, we’ll create a free account and grab our `Org-ID`, `Access token`, and `Project-ID` to set them as environment variables in our project.

# Set WhyLabs access keys
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'YOURORGID'
os.environ["WHYLABS_API_KEY"] = 'YOURACCESSTOKEN'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'PROJECTID'

Once the access keys are set up, we can easily create a profile of your dataset and write it to WhyLabs. This allows us to monitor input data and model predictions with just a few lines of code!

# initial WhyLabs writer, Create whylogs profile, write profile to WhyLabs
writer = WhyLabsWriter()
profile= why.log(dataset)
writer.write(file=profile.view())
Data Profiles visualized in WhyLabs

Now we can enable a pre-configured monitor with just one click (or create a custom one) to detect anomalies in our data profiles. This makes it easy to set up common monitoring tasks, such detecting data drift, data quality issues, and model performance.

Preset monitor configuration for data drift and data quality detection

Once a monitor is configured, it can be previewed while inspecting the feature it’s set to monitor.

Data drift detection in WhyLabs

When data drift is detected, notifications can be sent via email, Slack, or trigger a workflow using PagerDuty. Set notification preferences in Settings > Global Notification Actions.

Alert and workflow trigger configuration in WhyLabs

That’s it! We have gone through all the steps needed to monitor for data drift in ML pipelines to get notified or trigger a workflow if drift occurs.

If you’d like to follow along with a full example in a notebook checkout the WhyLabs onboarding guide.

Data drift conclusion

As we’ve seen, data drift is a critical consideration in the life cycle of ML models. As the world and the data we collect continually evolve, our models must adapt to stay relevant and reliable. Integrating data drift monitoring into your MLOps pipeline is necessary to ensure the continuous delivery of high-performing ML models.

By understanding, monitoring, and mitigating data drift, you can increase the longevity of your ML models, maximize their value, and keep stakeholders confident in the insights they produce. The ultimate goal is to make your ML systems robust, reliable, and resilient in the face of change, a principle that lies at the core of effective MLOps.

Learn more about how to detect data drift with these resources:

Ready to implement data and ML monitoring in your own applications?

--

--