Analyzing Data Drift: How We Designed Visualizations to Support Feature Investigation
The world is not stationary — things change. Thus, it is unlikely that our production data will follow the same distribution as the training one for our Machine Learning model.
In the first blog post, we introduced Feature Investigation, a system for automatically detecting and analyzing data drifts, as well as how the algorithm works. Now, it is time to learn about the visualizations developed to help data scientists analyze the outputs and identify patterns and major changes in feature distributions.
Table of Contents
Designing the Visualizations
First things first. In the beginning, we knew the problem that the Feature Investigation system tackles and its relevance. However, we still had a lot of questions about the current practices and needs of data scientists, as well as other solutions out there. In other words, we didn’t know what interface to design and how to integrate it into users’ workflow.
Before we go through the visualizations, let’s recap our design process. We knew the what, why, and who, but not the how, when, and where. Of course, we started by interviewing data scientists.
User Interviews
We conducted four semi-structured interviews with data scientists working on different projects/clients at Feedzai. Each interview lasted between 30 to 45 minutes. Together we identified current data monitoring and distribution analysis practices, how these practices are integrating into their workflows, triggers for these processes, and pain points.
From these interviews, we gained some interesting and generalizable insights:
- Data scientists compare feature distributions considering the production environment’s most recent or relevant time window with the historical/training ones. Sometimes they also compare distributions over time (and not just two snapshots).
- Data scientists leverage charts and summary statistics via single-use Jupyter notebooks or spreadsheets.
- Data scientists prioritize the most important features for predicting the target variable (Feature Importance results), raw features, and features obtained from external data enrichers. Two examples of problems are null values coming from the data enrichers and missing categorical values for one of the features.
- Data scientists have two main triggers to investigate features: Model/System performance changes and deployments. An example of detecting changes is checking the distribution of model scores.
- If the value of the performance metric is different from the expected one, data scientists start by readjusting the threshold to maintain the desired performance. After that, they contact the relevant data provider (data enricher or client). Before this step, data scientists may have to conduct a manual feature analysis to identify potential causes. After solving the problem, they reset the first action (or act as needed).
Using all the information collected so far (including internal documentation), we moved on to the next part: defining the user flow and requirements.
User Flow
The user flow is a diagrammatic representation of the typical data scientist workflow that we aim to support with Feature Investigation and its visualizations. Each box corresponds to a step or task. They are connected, forming the expected path and branching off whenever there is more than one suitable option. This technique is interesting not only to help us design a solution but also to validate mockups. By following the mockups based on the user flow, we can confirm whether they support all possible paths defined earlier or not.
For Feature Investigation, we identified five major parts:
- Data preparation: Get the data, run Feature Investigation, and select the outputs to analyze.
- Overview: Do the first screening and identify problematic features.
- Profiling: Freely explore features/periods and select features for further analysis.
- Knowledge management: Document insights and save assets.
- Root cause analysis: Characterize the previously identified features and the respective main changes.
Breaking down each of the parts, the complete user flow looks like this:
In other words (or images), the typical workflow boils down to the following steps:
After the user flow, we compiled a set of requirements that established the objectives to be accomplished with the visualizations.
User Requirements
At the end of the day, data scientists need to quickly generate hypotheses and decide the following steps to ensure model performance is not affected by data drifts and data quality issues. In this way, we have established five user requirements:
- Identify and compare data drift patterns.
- Identify relevant features and periods.
- Compare related features.
- Identify the main changing values in the distributions of the features.
- Integrate with the current workflow of data scientists.
After researching and defining the problem, we started to mock possible solutions and shape the visualizations for Feature Investigation. We also reviewed the literature — check out our short paper for more information on related work.
Mockups
For this stage, we used Figma as our collaborative UI design tool. At this point, we had already decided to design two complementary visualizations and approached them in the expected order of use. In other words, after we were satisfied with the design of the first visualization, we moved on to the second one. We will explore each in the “Introducing the Visualizations” section. For now, let’s focus on the process.
We were motivated to include this stage in our project for two reasons:
- To have a moment focused on designing a concrete solution that is as complete as possible, iterating quickly through the various versions.
- To have a clear baseline to plan the development and a common goal for the development itself. This reason was particularly important because we were two people implementing the visualizations, and there were multiple stakeholders.
This way, we prepared everything from paper wireframes with our first sketches to mockups that helped us during development. In addition to mockups for each of the visualizations in their different main states, we also prepared several variations for some parts of them.
A tool like Figma is handy for creating multiple versions of a Y-axis, for example, so that we can compare the various ideas and decide which one to use. After experimenting with various combinations of font families, font sizes, text alignments, tick styles, and other properties, we implemented a version similar to the Y-axis in the previous image. We have left-aligned the text (instead of right-aligning as usual) to make it easier to compare feature names based on their prefixes.
After we finished the mockup stage and organized the final mockups (which may change during development), we started planning the implementation of the visualizations for Feature Investigation.
Development Planning
To plan our effort, we set milestones and more granular tasks. We added these tasks to GitLab as issues in order to manage the work from the Feature Investigation repository. As for the milestones, there were four between October 2021 and January 2022:
- MVP Development.
- User Testing.
- Development.
- User Acceptance Testing.
On November 18th and 19th, we conducted two user tests to validate our core features and get feedback (User Testing). Up until this point, we have focused on implementing the most important features (including styling) and having a first working version of both charts (MVP Development).
After organizing the collected feedback and checking our backlog, we got back to development (Development). The focus was on finishing and fine-tuning each visualization to get to the final module version for Feature Investigation.
In January 2022, before the release near the end of the month, we conducted User Acceptance Tests with two data scientists who also integrated the Feature Investigation project (User Acceptance Testing). These tests consist of executing a set of test cases/tasks previously defined from a script to verify that each piece is working as expected. Beforehand, we prepared an environment similar to what a data scientist uses with a real-world dataset. In our case, we used the dataset we leveraged during development to see the output as representative as possible. Both tests went as expected, and we found less than a handful of minor issues (we improved a part before we released Feature Investigation and the rest after that).
As a final note, in early January, we added some unit and component tests to our codebase. We also improved one of the visualizations (Histogram) in terms of keyboard navigation and ARIA labeling to make it more accessible. Leaving these additions for late development is something we want to improve on our future projects.
Fast forward to the present, let’s explore the visualizations and how they can help with data drift analysis!
Introducing the Visualizations
The visualization module comprises two charts: the Overview and the Histogram. We designed these charts to be primarily used in sequence. After the data scientist gets a set of drifting features from the Overview, they will try to figure out the issues for each one from the Histogram: Are there new values? Or null values? Has the distribution been skewed?
Let’s move on and get to know each visualization in detail and some of our design choices. To do so, imagine the following situation: Two months have passed since we deployed our first model. Now, we want to understand how the production data distribution is behaving. We still do not have new labels to evaluate the model’s performance, but we can see if any features are drifting. Our Reference distribution corresponds to the training dataset, while the production data to the Target distribution. We don’t know what to expect, as this is the first time we have deployed a model and are conducting this type of analysis. So, let’s start by leveraging the Overview and understanding the features’ behaviors in general.
Overview
The Overview is a heatmap combined with a toolbar and a context menu. These control elements adjust the layout and filter the data to help the data scientist focus on the most relevant parts. This interactivity is particularly useful in this visualization because the heatmap can be huge and overwhelming. There are cases with more than 200 (and even 500!) features and a significant time window for analysis.
Nevertheless, we can have a first look at all the features by checking the heatmap and scrolling up and down. The Y-axis corresponds to features and the X-axis to time (e.g., days). Each row allows us to analyze the behavior of a feature over time and extract possible patterns. When hovering in each cell, there is a tooltip with information about the associated values.
Toolbar
Since we don’t have any features in mind to check, let’s rearrange the layout to highlight features that are worthy of our attention. We have three possibilities at the top, more precisely on the toolbar:
- Features widget: A selection dropdown to show and hide specific features in the visualization. It includes a search prompt within the menu, as well as two helper buttons to select all or none of the features. By default, all features are selected.
- Feature Sorting widget: A selection dropdown to sort the features (Y-axis) according to certain criteria. By default, features keep their original input order.
- Feature Grouping widget: A selection dropdown to cluster features according to an attribute vertically. Currently, it supports the
Raw / Engineered
option to create a nested layout where the raw features are separate from the ones prepared based on them. In practice, we ended up with two heatmaps.
Let’s focus on the second widget. In this widget, we have four criteria to sort the features as if we were going to sort a table:
- Original order: The default option. The features are sorted according to the input data.
- Alphabetical order: The features are sorted alphabetically in a case-insensitive way.
- Most alarms: The features are sorted in descending order of alarm count. An alarm corresponds to a feature whose p-value, at a given moment, is less than or equal to a threshold (significance level). In other words, a feature drifts when there is an alarm.
- Least sum of p-values: The features are sorted in descending order of the total sum of p-values per feature. This option overlaps with the previous one but allows pulling features that, over time, have smaller p-values, even if they are not small enough to be considered alarms. We may want to keep an eye on these features to act proactively.
If there are ties, the features keep the same original relative order. By selecting the Least sum of p-values
option, we move all (near-)alarming features to the top. We first notice that there are six features (M1
-M3
) in alarm whose distributions are significantly different from the Reference ones. We represent alarming features with black squares ⬛.
Color Scale
Before we go further with our analysis, let’s understand how color works in this heatmap. Color is our visual encoding channel for p-values. After the toolbar, we have a legend for the color scale (and the symbol ✱ used to identify raw features on the Y-axis). We have adopted a tripartite color scale. In the rightmost part, we have all p-values equal to or below the threshold. They are clamped and identified by the black color ⬛. On the other hand, in the leftmost part, we have all p-values greater than a new threshold introduced for the visualizations: the analysis threshold. They are also clamped and identified by a light gray tone ⬜. We have a sensible default for this threshold, but the data scientist can adjust it when starting a new Overview. Finally, in the middle, we have a linear scale of p-values that follows a sequential color scheme (YlOrRd 🟨 🟧 🟥).
At first glance, this choice may seem strange. Before we continue, please take a look at the image below:
Do you think your first impression, at least, is the same in both images? In this image, the data is the same, but the heatmap on the left uses a color scale based on the minimum (the threshold) and maximum values, while the heatmap on the right leverages our tripartite color scale. We believe that using the maximum (normalized) p-value in the color scale, that is, the full range of values, can lead to misinterpretations of the state of the features and make it challenging to analyze the respective patterns.
In this way, the analysis threshold acts as the new maximum p-value. The analysis threshold allows us to focus on a shorter, more relevant range of p-values (occupying the same space). It helps us surface actionable patterns when changes between distributions become more noticeable. It doesn’t matter, for example, whether the (normalized) p-value is 15 or 20 since, for the user, the interpretation is the same — it’s the values closest to the threshold (0.01), and their evolution, that matter. In practice, both endpoints work as categories, while the middle works as a continuous linear scale.
Investigation View
Now we know how to use the heatmap and have an overview. Let’s continue our analysis and drill down to one of the top features.
probability_log_T…
is an engineered feature, and although it has no alarms, its distribution appears to be very different from the Reference one. To investigate this feature, we left-click on the feature name to open the context menu. From this menu, in addition to copying the feature name, we can access the investigation view. In this view, the heatmap is focused on the selected feature and its related ones.
There are two types of related features:
- Feature ancestors: These are the features used to create the feature under investigation. Raw features have no ancestors.
- Feature descendants: These are the features created from the feature under investigation. Some of them can be created from more than one feature.
Some features only have one of the types, while others have both types. There are also features unrelated to any other ones (from the engineering point of view). The investigation view helps surface the features’ hierarchy based on the selected feature (and others added later, if any).
In addition, we have equipped this view with some actions to facilitate the analysis of a subset of features. Right below the feature under investigation, it is possible to add more features to investigate on the Y-axis. The Related features
section is then updated with the ones related to the added feature. As long as there is more than one feature under investigation in this view, it is also possible to adjust the subset of visible related features. By default, all features related to at least one of the features under investigation are shown. However, when clicking on COMMON
, only the ones related to all features under investigation are considered.
Using this view and its various actions, we can get insights progressively, locking the analysis on an initial feature and its relationships. For example, we can see that the probability_log_T…
feature’s pattern is similar to that of the related card_proxy_stdDev…
feature, at least for the most recent days (see the first image in this section). To return to the initial state, we need to click on the reset button next to the title of the first heatmap.
After analyzing some features carefully, we can go even deeper and compare the Reference and Target distributions for one of them. Looking at the Overview again, we can identify a gradually drifting raw feature that deserves our attention: C9
.
Histogram
The Histogram is a bar-based chart to compare distributions of one of our features. If the feature is continuous, this visualization is a histogram. On the other hand, if categorical, it is a bar chart. For simplicity, we call it just Histogram.
In this visualization, the Reference distribution is encoded with lines ( — ), while the Target distribution is encoded with gray bars. This design choice was motivated by the main task to be supported by the Histogram: compare the same bins between distributions to see if the Target bar is above, below, or similar to the Reference line (intra-bin comparison). Comparing distribution shapes is secondary while looking for concrete changes to generate hypotheses.
Distributions
Please note that we don’t use the Histogram to compare the original data distributions but the moving histograms (“What are Moving Histograms?” section, first blog post). In the case of continuous features, by default, we show bins of equal width (although the intervals of each bin are different) to facilitate intra-bin comparisons. Near the X-axis, we can toggle between this discrete option and the corresponding linear bins, that is, the original varying-width bins from the moving histograms.
In addition to the feature values, there are also some complementary bins on the X-axis:
- For continuous features, the rightmost bin (
NaN
), separated from the histogram, accounts for the null values. If this percentage is relatively high, there may be a technical problem with the input data, for example. In addition to this bin, the leftmost and rightmost bins of the histogram accommodate values smaller and larger than the limits of the Reference data range. In other words, these bins are used to record values that did not exist in the Reference period. - For categorical features, the rightmost bin (
New values
) accounts for all new values of a given feature in the Target distribution. In other words, this bin accommodates the new categories that did not exist in the Reference period (and their total percentage).
Interactivity
Since some features have many bins, we can brush the chart area with the mouse, so we only see a subset of them. When hovering in each bin, there is a tooltip with information about the associated values and differences.
For the Y-axis, we only consider relative frequencies (percentages). Near this axis are toggle buttons for the scale type: linear (default) or logarithmic scale.
Similar to the Overview, the Histogram also has a toolbar. Here, we can find a slider and a date picker to select the date for when we want to compare the distributions of a feature. The p-values are also encoded in the slider (via the slider color), approximating the corresponding row in the heatmap.
Feature Stability
Between the toolbar and the chart, we have a legend. In addition to clarifying which mark corresponds to each distribution, it also summarizes the stability of a feature during the Reference period according to three categories: low
, moderate
, and high
. During the Reference period, changes over time also occur in the feature distributions. So, this metric allows us to consider the changes in our feature distributions over this period based on the divergence values (“Build Reference” section, first blog post).
In this way, feature stability helps to weigh the changes we see in the Histogram. We can find some noticeable changes, but the p-value is not that low (see the image below for an example). While some features are very stable (high
), others are not (low
or moderate
). Thus, this metric helps us judge bin changes and contextualize the conclusions we can get.
Combining the Overview and the Heatmap, we can explore Feature Investigation outputs and gain insights into the current state of our data distributions. This information is essential to mitigating performance degradation. Before we wrap up this blog post, here are a few words about how we implemented these visualizations.
Implementation and Tech Stack
We designed the visualizations to be used via JupyterLab. Thus, they are easily integrated into notebooks, one of the main tools of data scientists. Due to this requirement, we considered the output size of notebook cells when arranging the layout of both visualizations.
To have a Python subpackage for Feature Investigation with full flexibility to develop visualizations/small interfaces, we followed an approach based on a Python API, web-based visualizations, and IPython’s display()
. The most similar implementations we found are those of PipelineProfiler and NOVA.
Going into more detail, Python is responsible for preparing and serializing data from the Feature Investigation output and the user-facing API. In this way, each visualization function is imported and used like any other in a Python package. On the other hand, we developed the UI using HTML, CSS, JavaScript, and React as the framework. To connect both parts, Python is also responsible for preparing an HTML file for each visualization, embedding it in JupyterLab (from IPython’s display()
).
In short, we have adopted a tech stack consisting mainly of:
- IPython for embedding web-based visualizations in JupyterLab.
- pandas as the DataFrame package.
- React as the front-end framework.
- Lodash for its utilities.
- MUI as the UI component library.
- D3 for the math needed to create the visualizations, such as the scales (rendering/DOM is React’s responsibility).
- visx as the Data Visualization component library (for axes, for example).
We checked out some component libraries and opted for the MUI. It has all the UI components we needed and an icon set. One of these components is the tooltip one, as its implementation allows it to be used together with SVG and Canvas elements. We also chose MUI for its ease of use, documentation, and maturity.
Finally, the Overview chart area is implemented with Canvas for scalability reasons (typically, there are too many cells to use SVG properly). On the other hand, the Histogram and all guides, such as axes, are SVG (implemented with visx and custom code).
Future Work
Feature Investigation is a system for automatically detecting and analyzing data drifts. It supports data scientists in mitigating performance degradation that could negatively impact the business and its users. In addition to data drifts, it can help detect internal errors or data quality issues. Thus, we developed two complementary visualizations designed for data scientists to extract insights and define the next steps.
To validate the “final” visualizations, we met with one of the data scientists who helped us define the requirements. Over 45 minutes and following a think-aloud protocol, we gathered several findings — check out our short paper for more details! In fact, this blog post is not about the final version of our work. In the future, we want to:
- Conduct more user testing and interview users after using Feature Investigation in their projects.
- Implement features from our backlog that were not considered for the current version (e.g., improved Y-axis label truncation and screenshotting).
- Improve support for keyboard navigation and screen readers.
- Explore ways to integrate and visualize individual instances.
- Evaluate reimplementing both visualizations as a JupyterLab extension using ipywidgets as a framework. One advantage is the synchronization of data between Python and the front-end.
Finally, if you have any questions or feedback, let us know in the comments section.
Beatriz Feliciano, Francisca Calisto, Hugo Ferreira, Javier Pérez, Ricardo Moreira, Rita Costa, and Sanjay Salomon: thank you so much for your feedback and help on this blog post!
Beatriz Malveiro: it was great working with you on this project and reaching such a result — thank you very much!