WhyLabs Weekly MLOps: Data Quality Validation

Detecting data quality issues, combining the power of LLMs and computer vision, ML monitoring for data drift, quality, and bias

Published in

WhyLabs

5 min readAug 4, 2023

A lot happens every week in the WhyLabs Robust & Responsible AI (R2AI) community! This weekly update serves as a recap so you don’t miss a thing!

Start learning about MLOps and ML Monitoring:

📅 Join the next event: LLMs in Production: Lessons Learned
💻 Check out our open source projects whylogs & LangKit!
💬 Join 1,185 Robust & Responsible AI Slack members
🤝 Request a demo to learn how ML monitoring can benefit you

💡 MLOps tip of the week:

Use data validation to detect data quality issues with the open source library, whylogs.

Once you install whylogs in your Python environment using `pip` you can create profiles of your datasets with just a few lines of code! These data profiles only contain summary statistics about your dataset and can be used to monitor for data drift and data quality issues across all industries even with sensitive data.

import whylogs as why
import pandas as pd

# profile pandas dataframe
df = pd.read_csv("path/to/file.csv")
profile = why.log(df) # can also log other data types

For detecting data quality issues with whylogs we’ll use the built in validation features. This allows us to essentially write unit tests for our data!

In this example we’ll write a test for all four features in the iris dataset to make sure they’re within the expected range.

Start by importing the constraints and metric selector methods from whylogs.

# Data Quality Validation whylogs
from whylogs.core.constraints import (Constraints,
                                     ConstraintsBuilder,
                                     MetricsSelector,
                                     MetricConstraint)

Each block that contains `builder.add_constraint` defines a data validation constraint. This consists of:

name — A string name for the constraint, I like to describe the condition
condition — This defines the passing condition for the selected metric.
metric_selector — This specifies the metric contained in the whylogs profile. It can be checking for numerical value from a distribution, a specific data type, etc.

# Using Constraints for Data Quality Validation
def validate_features(profile_view, verbose=False):

  builder = ConstraintsBuilder(profile_view)

  # Define a constraint for validating data
  builder.add_constraint(MetricConstraint(
    name="petal length > 0 and < 15",
    condition=lambda x: x.min > 0 and x.max < 15,
    metric_selector=MetricsSelector(metric_name='distribution',
                                    column_name='petal length (cm)')
  ))

  builder.add_constraint(MetricConstraint(
    name="petal width > 0 and < 15",
    condition=lambda x: x.min > 0 and x.max < 15,
    metric_selector=MetricsSelector(metric_name='distribution',
                                    column_name='petal width (cm)')
  ))

  builder.add_constraint(MetricConstraint(
    name="sepal length > 0 and < 15",
    condition=lambda x: x.min > 0 and x.max < 15 ,
    metric_selector=MetricsSelector(metric_name='distribution',
                                    column_name='sepal length (cm)')
  ))

  builder.add_constraint(MetricConstraint(
    name="sepal width > 0 and < 15",
    condition=lambda x: x.min > 0 and x.max < 15,
    metric_selector=MetricsSelector(metric_name='distribution',
                                    column_name='sepal width (cm)')
  ))

  # Build the constraints and return the report
  constraints: Constraints = builder.build()

  if verbose:
    print(generate_constraints_report())

  # return constraints.report()
  return constraints

We can now pass in a whylogs data profile to our function to generate a constraint report.

const = validate_features(profile, True)

The report will return a list with a tuple for each constraint. Such as, `[(‘petal length > 0 and < 15’, 1, 0)]`.

This can be read as [(‘Name’, Pass, Fail)]. This example would indicate it has passed.

[('petal length > 0 and < 15', 1, 0), ('petal width > 0 and < 15', 1, 0), ('sepal length > 0 and < 15', 1, 0), ('sepal width > 0 and < 15', 1, 0)]

Lets visualize our data quality report to make it easier for us to read, especially if we have a lot of constraints!

from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(const, cell_height=300)

This generates a visual data quality report that we can filter and search!

If we simply want to get a pass or fail value for the report we can .validate() .

# check all constraints for passing:
constraints_valid = const.validate()
print(constraints_valid)

If all data quality constraints pass this will return `True` or if any of the data quality constraints fail this will return `False`.

Learn more about detecting data quality with whylogs:

📝 Latest blog posts:

Hugging Face and LangKit: Your Solution for LLM Observability

Hugging Face has quickly become a leading name in the world of natural language processing (NLP), with its open-source library becoming the go-to resource for developers and researchers alike. As more organizations turn to Hugging Face’s language models for their NLP needs, the need for robust monitoring and observability solutions becomes more apparent. Read more on WhyLabs.AI

7 Ways to Monitor Large Language Model Behavior

In the ever-evolving landscape of AI, Large Language Models (LLMs) have revolutionized Natural Language Processing. With their remarkable ability to generate coherent and contextually relevant human-like text, LLMs have gained immense importance and adoption, transforming the way we interact with technology. Read more on WhyLabs.AI

🎥 Event recordings

Intro to ML Monitoring: Data Drift, Quality, Bias and Explainability — Sage Elliott

In this workshop we covered detecting data drift, measuring model drift, monitoring model performance, data quality validation, measuring bias & fairness and model explainability!

Intro to ML monitoring: Data drift, Data Quality, Model Bias and Explainability

📅 Upcoming R2AI & WhyLabs Events:

Join this workshop on August 9th — RSVP on Eventbrite

💻 WhyLabs open source updates:

whylogs v1.2.7 has been released!

whylogs is the open standard for data logging & AI telemetry. This week’s update includes:

Add rolling logger to end of Writing_to_WhyLabs example notebook
Getting Started w/ WhyLabs — change html to make link open in new tab
Put all unit test UDFs in named schema

See full whylogs release notes on Github.

LangKit 0.0.13 has been released!

LangKit is an open-source text metrics toolkit for monitoring language models.

Add helper method to configure metadata logging

See full LangKit release notes on Github.

🤝 Stay connected with the WhyLabs Community:

Join the thousands of machine learning engineers and data scientists already using WhyLabs to solve some of the most challenging ML monitoring cases!

1,185+ Robust & Responsible AI Slack members
2,320+ whylogs GitHub Stars
1147+ Robust & Responsible AI Meetup Members
9,297+ WhyLabs LinkedIn followers
892+ WhyLabs Twitter followers

Request a demo to learn how ML monitoring can benefit your company.

See you next time! — Sage Elliott