Detecting Data Quality Issues: A Practical Guide with WhyLabs

Piyush Talreja
11 min readMar 27, 2024

--

Run AI With Certainity — WhyLabs

Deploying a model into production is like sending it into an ever-changing world, where the data it encounters can be vastly different from what it was trained on. Without a way to track and adapt to these shifts, the model’s performance could degrade, leading to inaccurate predictions and costly mistakes. WhyLabs serves as a critical tool in this scenario, providing the means to monitor the data landscape continuously, identify issues early, and ensure that the model remains accurate and reliable over time.

What is WhyLabs?

Source: WhyLabs

WhyLabs is an advanced platform designed to assist data scientists and machine learning engineers in monitoring and evaluating the performance of their models in production environments. It provides in-depth insights into model behavior over time by tracking variations in data distributions and pinpointing potential degradation in model performance. This capability empowers users to maintain high-quality AI applications, reducing the risk of costly errors. In simpler terms, it functions like a diagnostic tool that keeps a vigilant eye on AI models to ensure they are learning correctly and making accurate predictions.

What is a Drift?

In the context of machine learning models, “drift” is a term used to describe the change in the data patterns or the relationships between input and output variables over time. This change can lead to a model’s performance degrading since the model’s assumptions about the data no longer hold true.

Data Drift: It refers to a change in the model’s input data distribution.

Concept drift: It refers to a change in the relationship between the inputs and the desired outcome.

Source: WhyLabs

What WhyLabs Offers?

  • Data Drift Detection: WhyLabs identifies shifts in data distributions, enabling teams to swiftly address changes that could affect model accuracy.
  • Model Performance Monitoring: The platform continuously assesses model output, ensuring that performance remains at peak levels and alerting to any degradation.
  • Privacy Preservation: The platform tracks statistical data profiles rather than raw data, protecting user privacy and complying with data regulations.

Scope of this blog:

The blog mainly discusses the utilization of WhyLabs for identifying data drifts and guides in setting up email notifications to alert users when a drift is observed.

Setup & Installations

  1. Sign Up

To get started with WhyLabs, head over to the WhyLabs website and look for the sign-up section to create your account

https://auth.whylabsapp.com/u/signup

2. Getting an Access Token

I have recorded a video that will guide you through the process of obtaining an access token.

WhyLabs: Getting an Access Token

3. Creating your first model in WhyLabs

Once you get the access token, you can create a new “model resource” in WhyLabs. A model resource is a container that will store the metadata about your ML model.

WhyLabs: Creating a new model

After the project is created, the current state of the project will look like the below image. You can check this state by heading to the WhyLabs dashboard.

WhyLabs: Model Summary & Layout

We can observe that there are currently no data profiles and monitoring is not configured as well. This will be updated once we create profiles.

Note: The creation of profiles via Python code will be covered in the later part of this blog.

After completing steps 2 and 3, ensure you have the following three items available: the API Key, the Org-ID, and the Project-ID (aka Model-ID)

4. Installing the WhyLogs package in Python

WhyLabs leverages WhyLogs as a foundational component within its platform for monitoring and debugging machine learning models in production. WhyLogs provides efficient data logging capabilities, enabling WhyLabs to collect statistical summaries of data used by machine learning models without storing raw data, facilitating streamlined monitoring and analysis. This integration enables WhyLabs to offer comprehensive insights into model behavior and performance while optimizing resource usage.

Use the pip package manager to install the whylogs package. Use the below command:

!pip install whylogs

Using WhyLabs with Movie Streaming Scenario

We use past user data to train movie recommendation models, but the data encountered after deployment can be different due to changing user habits or new content. WhyLabs helps by monitoring this live data and notifying us if there are any unexpected changes or new patterns. This monitoring assists in keeping the data used for recommendations accurate. WhyLabs also identifies any potential bias or decline in how well the model is working, which supports data scientists in updating the model to keep up with the latest viewer preferences while ensuring user privacy is protected.

Identifying Data Drift Using WhyLabs

  1. Data Overview

To analyze data drift, I have simulated data spanning seven different days, from March 20, 2024, to March 26, 2024. Each day’s dataset represents a snapshot of the data landscape at that specific time.

Below is a snapshot of a part of the dataset for March 20th, which will serve as our baseline for comparison. By leveraging WhyLabs, we can track changes in the dataset across these days and detect any significant drift.

2. Import relevant libraries and set authentication keys

import whylogs as why
from whylogs.api.writer.whylabs import WhyLabsWriter
import numpy as np
import pandas as pd
import datetime
import os

# set authentication & project keys
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'org-gS6frY'
os.environ["WHYLABS_API_KEY"] = 'BQCxVBBu66.hZxPJ8yRh9WpJSHa1alFuyJXPUKTUfmPsseHPKirVKUAj0ffdgDoq:org-gS6frY'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'model-5'

3. Load Data

The data files named data_2024–03–24, data_2024–03–25, and data_2024–03–26 contain ratings that differ significantly from our baseline data. I’m attempting to set up a data drift alert specifically for the ratings feature using this data. The same steps can be followed to detect drift on any other feature.

# Import data batches

# Reference data
X_batch_1 = pd.read_csv('data_2024-03-20.csv')

# No Drift
X_batch_2 = pd.read_csv('data_2024-03-21.csv')

# No Drift
X_batch_3 = pd.read_csv('data_2024-03-22.csv')

# No Drift
X_batch_4 = pd.read_csv('data_2024-03-23.csv')

# Drift
X_batch_5 = pd.read_csv('data_2024-03-24.csv')

# Drift
X_batch_6 = pd.read_csv('data_2024-03-25.csv')

# Drift
X_batch_7 = pd.read_csv('data_2024-03-26.csv')


dfs = [X_batch_1, X_batch_2, X_batch_3, X_batch_4, X_batch_5, X_batch_6, X_batch_7]

4. Creating a profile in WhyLabs

A profile in the context of WhyLabs usually represents a dataset profile at a specific point in time or for a specific batch of data. Now, I am going to create a single profile for the data collected over the date of 20th March.

This code logs a pandas DataFrame X_batch_1 to WhyLabs for analysis and then writes the generated profile to a file using a WhyLabsWriter object.

# WhyLabs needs timezone aware datetime object
dt = datetime.datetime(2024, 3, 20, 00, 00, tzinfo=pytz.UTC)

# Single Profile for 1st Batch
writer = WhyLabsWriter()
profile= why.log(X_batch_1)
profile.set_dataset_timestamp(dt)
writer.write(file=profile.view())

This will create a single profile for our first batch. Let’s check out the WhyLabs dashboard.

Select your WhyLabs project, and head over to the profiles tab.

WhyLabs Profile Analysis

The Profiles tab in the WhyLabs platform displays various statistical metrics and metadata about a machine learning model’s data.

Select Resource: This allows you to select the model or data resource for which you want to view profiles. Here, ‘MovieRecommenderSystem Model’ is selected.

Select Segment: You can segment your data to see profiles for specific subsets; ‘All data’ is currently selected.

Profiles: This section allows you to select and compare different dataset profiles. These profiles (P1, P2, P3, etc.) represent snapshots or versions of the dataset at different times or under different conditions. You can add additional profiles for comparison using the “Add profile” button.

Batch Profile Lineage: It displays the range of dates for the selected batch profile, allowing users to understand the time frame of the data they are viewing.

Column Name and Graphs: For each column, there is a histogram that represents the distribution of values within that column for the selected profile (P1). It seems that the columns shown are ‘genre’, ‘movie_name’, ‘occupation’, ‘rating’, and ‘user_id’.

Column Metrics: For each column, several metrics are presented:

  • Frequent Items: Common values or categories within the column.
  • Discreteness: Indicates if the column is discrete or not.
  • Total Count: The number of values within the column.
  • Null Fraction: The proportion of missing values.
  • Estimated Unique Values: The number of unique entries in the column., etc.

5. Configure Data Drift Monitor for Rating Column in WhyLabs

In the below video, I have demonstrated how to enable drift monitor for the ‘rating’ column.

Enabling a Drift Monitor

Here, I have selected the reference date range as 20th March, which going to serve as the baseline model to which the future input distributions will be compared. I have set the drift threshold to 0.6. If the drift score is greater than this number, this will be considered as a drift by our drift monitor.

Configuring Actions for Email Alerts

In the above video, I selected pre-defined action rating-drift which sends an email to the specified email address in case of an anomaly.

To create a new action, click on the Edit notification actions settings “here” hyperlink.

Creating Actions

You can enter your email address and provide a unique ID for this action. This ID will then be displayed in the Actions drop-down menu when you create or edit a monitor.

Creating Email Action

6. Creating more profiles for the remaining data.

As mentioned earlier, I created a single profile for the dataset snapshot dated March 20th. Moving forward, I created additional profiles for the incoming data, ranging from March 21st to March 26th. This approach is tailored for a production environment where data is continuously received on a daily basis.

In the data for March 24th and 25th, every user gave a 1 star rating on purpose. On March 26th, however, all the ratings were 4 stars. I set up these ratings intentionally to trigger the drift detection in our monitoring system.

dt = datetime.datetime(2024, 3, 21, 00, 00, tzinfo=pytz.UTC)
writer = WhyLabsWriter()
profile= why.log(X_batch_2)
profile.set_dataset_timestamp(dt)
writer.write(file=profile.view())

dt = datetime.datetime(2024, 3, 22, 00, 00, tzinfo=pytz.UTC)
writer = WhyLabsWriter()
profile= why.log(X_batch_3)
profile.set_dataset_timestamp(dt)
writer.write(file=profile.view())


dt = datetime.datetime(2024, 3, 23, 00, 00, tzinfo=pytz.UTC)
writer = WhyLabsWriter()
profile= why.log(X_batch_4)
profile.set_dataset_timestamp(dt)
writer.write(file=profile.view())


# Drift Data
dt = datetime.datetime(2024, 3, 24, 00, 00, tzinfo=pytz.UTC)
writer = WhyLabsWriter()
profile= why.log(X_batch_5)
profile.set_dataset_timestamp(dt)
writer.write(file=profile.view())

# Drift Data
dt = datetime.datetime(2024, 3, 25, 00, 00, tzinfo=pytz.UTC)
writer = WhyLabsWriter()
profile= why.log(X_batch_6)
profile.set_dataset_timestamp(dt)
writer.write(file=profile.view())

# Drift Data
dt = datetime.datetime(2024, 3, 26, 00, 00, tzinfo=pytz.UTC)
writer = WhyLabsWriter()
profile= why.log(X_batch_7)
profile.set_dataset_timestamp(dt)
writer.write(file=profile.view())

7. Data Drift Analysis

Below screenshot displays that the next monitor run is scheduled in about 5 hours. But, we can preview the analysis for the current snapshot by clicking the “Preview Analysis” button for the rating feature in the Inputs tab of the project.

Rating Distribution across 7 days

After clicking on the “Preview Analysis” button.

Drift Detected

It’s observable that data drift occurred on the dates of the 24th, 25th, and 26th, as indicated by the Hellinger distance metric exceeding the drift threshold (0.6) on those days in the drift monitoring chart.

The Hellinger distance is a number that tells us how much two sets of data differ from each other: a small number means they’re pretty similar, and a bigger number means they’re more different.

Distance > Drift Threshold

8. Drift Alert via Email

Email Alert

9. Collaboration and Team Workflow

WhyLabs offers collaborative features that enable teams to work together on monitoring and debugging ML models. It provides shared dashboards, alerts, and insights, facilitating collaboration among data scientists, engineers, and other stakeholders.

In conclusion, by following the steps outlined above, we can effectively configure WhyLabs to detect drift in our data. This ensures continuous monitoring of our model’s performance, allowing us to maintain its accuracy and reliability in a live production environment.

10. Future Scope

In future iterations, I am planning to use this tool to include data constraint validation, and enhance data quality assurance. Additionally, it could serve as a means to detect and mitigate biases.

Strengths and Limitations

Strengths

  • Data Handling: WhyLabs supports both structured and unstructured data, integrates seamlessly with existing data pipelines and multi-cloud architectures​​.
  • Scalability: It is capable of handling large data volumes and transforming them into actionable insights efficiently​​.
  • Security and Privacy: It offers secure integration by analyzing raw data without moving or duplicating it, ensuring data privacy, especially crucial for healthcare and banking sectors​​.
  • Ease of Use: It is designed for fast and privacy-friendly integration, with a straightforward presentation of monitoring results and strong data privacy​​.
  • LLM Security Features: It provides strategies and tools for detecting and preventing prompt injections and jailbreak attempts in Large Language Models (LLMs)​
  • Language Support: WhyLabs stands out for its support of multiple programming languages, enhancing its usability across various technical environments and not restricting it to Python-only applications.

Limitations

  • Support and Learning Materials: The platform has limited documentation, issue logs, and tutorials, potentially complicating the user’s ability to resolve issues or learn how to use WhyLabs effectively. One possible solution that I would recommend to the WhyLabs team is to create mini-video tutorials for explaining one functionality at a time.
  • Room for enhancement in the UI/UX: The UI for WhyLabs is little difficult to understand for the first time users. I personally spent a long time to figure out how to delete the project model through UI.
  • Limited Profile Creation in Free Tier: Users on the free tier can only create two models.
  • Interpretability of Insights: While WhyLabs provides insights into model performance and data behavior, interpreting these insights and translating them into actionable recommendations might require some expertise. Improving the clarity and context provided with insights could help users better understand and utilize the information provided by WhyLabs.

Comparision with Evidently AI:

Evidently AI stands as a competitor to WhyLabs in the field of machine learning tool for AI Monitoring and Observability.

  • While Evidently AI is restricted to only Python programming language, WhyLabs offers support for multiple languages.
  • Evidently AI generates in-line reports within Jupyter Notebooks, whereas WhyLabs operates as a UI-based tool.
  • WhyLabs offers enhanced collaborative capabilities compared to Evidently AI.
  • Evidently AI is better in providing interpretability of insights when compared to WhyLabs.

Key Takeaways

I used this tool to detect data drifts and enable email alerts. Here are the value additions:

  • This tutorial offers practical insights into detecting shifts in data distribution
  • This tutorial offers detailed guidance on setting up email alerts within WhyLabs to notify of anomalies promptly.

Disclosure

This blog is affiliated with the CMU Machine Learning in Production course. For access to the complete set of code for implementation and data files, please visit the GitHub repository: https://github.com/piyush-talreja/MLIP-I3-WhyLabs

References

  1. https://docs.whylabs.ai/docs/

--

--