Visually Inspecting Data Profiles for Data Distribution Shifts
Editor’s note: Felipe de Pontes is a speaker for ODSC Europe 2022 coming up June 15th-16th. Be sure to check out his talk, “Visually Inspecting Data Profiles for Data Distribution Shifts,” there!
The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are a major post-production concern for any ML or data practitioner.
Distribution shift issues, if unaddressed, can mean significant performance degradation over time and even turn the model downright unusable. In a production environment, where there are large volumes of often sensitive data, detecting and diagnosing these issues can become especially challenging.
In this tutorial, we will see how we can inspect data for distribution shift issues by comparing distribution metrics and applying statistical tests for drift values calculations.
We’ll also learn how to leverage data logging to enable the processing of large volumes of data while addressing privacy concerns in order to inspect for data distribution shift issues in a production environment.
For this tutorial, you’ll need only a Python 3 environment, either on your own device or on a cloud device, such as a Google Colab Notebook.
But what is Data Distribution Shift anyway?
Unlike in traditional software development, the performance of your supervised machine learning model will degrade with time, which is known as model decay, or model degradation. One of the most common causes of model decay is due to changes in the distribution of data during production when compared to the data you used to test and validate your model. This can happen in different ways, such as changes in the distribution of input data, output data, or even changes in the relationship between the input and output.
Why is this a problem?
This is a problem because, ultimately, distribution shifts can affect the performance of your model, leading to all sorts of negative impacts on your organization. Ideally, your model’s performance should be constantly monitored. Still, it’s not always easy to have performance results readily available, because the required ground truth to do so might not be available or, if it is, it might come in a delayed fashion. In those cases, you can use the data you have as a signal or proxy for your model’s performance.
Let’s see a practical example of how we can inspect and detect distribution shifts with a simple case study.
Case Study: Covariate Shift with Wine Quality Dataset
As a case study, let’s use UCI’s Wine Quality Dataset. The goal of this task is to model wine quality based on physicochemical tests. This can be viewed as a classification task, where we predict the wine’s quality based on its features, like pH, density, and percent alcohol content.
In order to create a scenario of distribution shift, we will split the available dataset into two groups: wines with alcohol content (alcohol feature) below and above 11. The first group is considered our baseline (or reference) dataset, while the second will be our target dataset. This is an example of a covariate shift, one of the possible types of data distribution shifts, and it means that the input distribution changes between our reference and profile datasets, but the relationship between the input and output doesn’t change. Since we’re only concerned about changes in the input data, we’ll skip the model training altogether and focus on the input features.
The example used here was inspired by the article A Primer on Data Drift. If you’re interested in more information on this use case, or the theory behind Data Drift, it’s a great read!
Loading the dataframes
Let’s first download the dataframes. They are already preprocessed and split into target and reference dataframes.
import pandas as pd
"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/wine_reference.csv"target_df = pd.read_csv(target_url)
reference_df = pd.read_csv(reference_url)
Data logging with whylogs
In a production setting, we need ways of monitoring data that are scalable and efficient. For a number of reasons, such as storage requirements or privacy concerns, using raw data for debugging/monitoring purposes might not be feasible.
For this reason, we’ll leverage data logging to generate statistical summaries of our data, which we can then use to track changes in our dataset, ensure data quality and visualize key summary statistics. In whylogs, these statistical summaries are called profiles, which we’ll use to visualize the effect of covariate shift in our data.
First of all, we can install whylogs:
pip install whylogs
Let’s first create a profile of our target dataframe:
import whylogs as why
results = why.log(target_df)
profile = results.profile()
We can keep updating the profile by logging additional data, but for the moment let’s generate a Profile View from it, in order to continue our inspection:
profile_view = profile.view()
The profile_view is a lightweight statistical fingerprint of your dataset, which can be stored for later use or sent over to monitoring platforms. It will provide you with valuable statistics on a column (feature) basis, such as:
- Counters, such as number of samples and null values
- Inferred types, such as integral, fractional, and boolean
- Estimated Cardinality
- Frequent Items
- Distribution Metrics: min, max, median, quantile values
target_summary = profile_view.to_pandas()
Let’s do the same for our reference dataframe:
result_ref = why.log(pandas=reference_df)
profile_reference = result_ref.profile()
profile_view_reference = profile_reference.view()
There are a lot of other exciting features of whylogs that are out of scope for this short tutorial but are worth mentioning. One of these is the fact the generated profiles are mergeable, which means that the profiles produced can be combined with other profiles. In streaming systems, profiles can be captured over a mini-batch, and merged into different time granularities of data without losing statistical accuracy. This is made possible with a technique called data sketching, pioneered by Apache DataSketches. If you want to know more about this and other aspects of whylogs, feel free to check out our open-source repository!
Inspecting and comparing distributions with the Profile Visualizer
There are a number of ways we could inspect our profiles for data shifts and other data quality issues. The first one is to simply compare distribution metrics like mean, median or quantile values, which could be done with the Profile View obtained in the previous section. The downside of this approach is that there might be real cases of data shift that are not perceived by simply inspecting these metrics.
There are some other approaches we can take. Let’s go through them by using whylogs’ visualization module, the Notebook Profile Visualizer.
To do so, we can start by instantiating a visualizer, and setting the target and reference profiles obtained in the previous section:
from whylogs.viz import NotebookProfileVisualizervisualization = NotebookProfileVisualizer()
Applying Statistical Tests
Instead of simply comparing distribution metrics, we can quantitatively measure data shift (or drift) by applying two-sample hypothesis testing. We can use these tests in order to compare two different sets of data to verify if both come from a common underlying distribution.
There are a number of different methods for this purpose. In this tutorial, we will use two of the most popular ones: K-S test (for numerical features) and chi-squared test (for categorical features). We can do so by simply generating a Summary Drift Report, which will yield the p-values for all of the common features between distributions, alongside other useful information, such as overall metrics, side-by-side histograms, and distribution metrics:
The null hypothesis is that the samples are drawn from the same distribution, which means that a low p-value is indicative of different distributions. In this example, we see that drift was detected for all of our features.
In addition to statistical tests, there are other approaches you can take to tackle distribution shifts, such as visually inspecting histograms and distribution charts for individual features, which can be useful to confirm the disparity between distributions. In a more general topic, setting rule-based data validation is key in ensuring the quality of your data, which includes distribution changes, be it from external factors or systemic errors such as pipeline errors or missing data.
For a more in-depth view on this topic, you can sign up for my upcoming workshop at ODSC Europe this June “Visually Inspecting Data Profiles for Data Distribution Shifts“. In the workshop, we will also see how to visually inspect histograms and distribution charts and how to do data validation with whylogs’ constraints. We will dig deeper into the concept of distribution shift and explore other popular packages in order to detect data shifts.
See you there!
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.