WhyLabs Weekly MLOps: Data Quality Validation
Detecting data quality issues, combining the power of LLMs and computer vision, ML monitoring for data drift, quality, and bias
A lot happens every week in the WhyLabs Robust & Responsible AI (R2AI) community! This weekly update serves as a recap so you don’t miss a thing!
Start learning about MLOps and ML Monitoring:
- 📅 Join the next event: LLMs in Production: Lessons Learned
- 💻 Check out our open source projects whylogs & LangKit!
- 💬 Join 1,185 Robust & Responsible AI Slack members
- 🤝 Request a demo to learn how ML monitoring can benefit you
💡 MLOps tip of the week:
Use data validation to detect data quality issues with the open source library, whylogs.
Once you install whylogs in your Python environment using `pip` you can create profiles of your datasets with just a few lines of code! These data profiles only contain summary statistics about your dataset and can be used to monitor for data drift and data quality issues across all industries even with sensitive data.
import whylogs as why
import pandas as pd
# profile pandas dataframe
df = pd.read_csv("path/to/file.csv")
profile = why.log(df) # can also log other data types
For detecting data quality issues with whylogs we’ll use the built in validation features. This allows us to essentially write unit tests for our data!
In this example we’ll write a test for all four features in the iris dataset to make sure they’re within the expected range.
Start by importing the constraints and metric selector methods from whylogs.
# Data Quality Validation whylogs
from whylogs.core.constraints import (Constraints,
ConstraintsBuilder,
MetricsSelector,
MetricConstraint)
Each block that contains `builder.add_constraint` defines a data validation constraint. This consists of:
name
— A string name for the constraint, I like to describe the conditioncondition
— This defines the passing condition for the selected metric.metric_selector
— This specifies the metric contained in the whylogs profile. It can be checking for numerical value from a distribution, a specific data type, etc.
# Using Constraints for Data Quality Validation
def validate_features(profile_view, verbose=False):
builder = ConstraintsBuilder(profile_view)
# Define a constraint for validating data
builder.add_constraint(MetricConstraint(
name="petal length > 0 and < 15",
condition=lambda x: x.min > 0 and x.max < 15,
metric_selector=MetricsSelector(metric_name='distribution',
column_name='petal length (cm)')
))
builder.add_constraint(MetricConstraint(
name="petal width > 0 and < 15",
condition=lambda x: x.min > 0 and x.max < 15,
metric_selector=MetricsSelector(metric_name='distribution',
column_name='petal width (cm)')
))
builder.add_constraint(MetricConstraint(
name="sepal length > 0 and < 15",
condition=lambda x: x.min > 0 and x.max < 15 ,
metric_selector=MetricsSelector(metric_name='distribution',
column_name='sepal length (cm)')
))
builder.add_constraint(MetricConstraint(
name="sepal width > 0 and < 15",
condition=lambda x: x.min > 0 and x.max < 15,
metric_selector=MetricsSelector(metric_name='distribution',
column_name='sepal width (cm)')
))
# Build the constraints and return the report
constraints: Constraints = builder.build()
if verbose:
print(generate_constraints_report())
# return constraints.report()
return constraints
We can now pass in a whylogs data profile to our function to generate a constraint report.
const = validate_features(profile, True)
The report will return a list with a tuple for each constraint. Such as, `[(‘petal length > 0 and < 15’, 1, 0)]`.
This can be read as [(‘Name’, Pass, Fail)]. This example would indicate it has passed.
[('petal length > 0 and < 15', 1, 0), ('petal width > 0 and < 15', 1, 0), ('sepal length > 0 and < 15', 1, 0), ('sepal width > 0 and < 15', 1, 0)]
Lets visualize our data quality report to make it easier for us to read, especially if we have a lot of constraints!
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(const, cell_height=300)
This generates a visual data quality report that we can filter and search!
If we simply want to get a pass or fail value for the report we can .validate()
.
# check all constraints for passing:
constraints_valid = const.validate()
print(constraints_valid)
If all data quality constraints pass this will return `True` or if any of the data quality constraints fail this will return `False`.
Learn more about detecting data quality with whylogs:
📝 Latest blog posts:
Hugging Face and LangKit: Your Solution for LLM Observability
Hugging Face has quickly become a leading name in the world of natural language processing (NLP), with its open-source library becoming the go-to resource for developers and researchers alike. As more organizations turn to Hugging Face’s language models for their NLP needs, the need for robust monitoring and observability solutions becomes more apparent. Read more on WhyLabs.AI
7 Ways to Monitor Large Language Model Behavior
In the ever-evolving landscape of AI, Large Language Models (LLMs) have revolutionized Natural Language Processing. With their remarkable ability to generate coherent and contextually relevant human-like text, LLMs have gained immense importance and adoption, transforming the way we interact with technology. Read more on WhyLabs.AI
🎥 Event recordings
Intro to ML Monitoring: Data Drift, Quality, Bias and Explainability — Sage Elliott
In this workshop we covered detecting data drift, measuring model drift, monitoring model performance, data quality validation, measuring bias & fairness and model explainability!
📅 Upcoming R2AI & WhyLabs Events:
- 8/9 Combining the Power of LLMs with Computer Vision — Jacob Marks, Voxel51
- 8/17 Building Better Computer Vision Models — Harpreet Sahota at Deci AI
- 8/23 Build and Monitor Computer Vision Models with TensorFlow/Keras
- 9/6 Monitoring LLMs in Production with Hugging Face & WhyLabs
💻 WhyLabs open source updates:
whylogs v1.2.7 has been released!
whylogs is the open standard for data logging & AI telemetry. This week’s update includes:
- Add rolling logger to end of Writing_to_WhyLabs example notebook
- Getting Started w/ WhyLabs — change html to make link open in new tab
- Put all unit test UDFs in named schema
See full whylogs release notes on Github.
LangKit 0.0.13 has been released!
LangKit is an open-source text metrics toolkit for monitoring language models.
- Add helper method to configure metadata logging
See full LangKit release notes on Github.
🤝 Stay connected with the WhyLabs Community:
Join the thousands of machine learning engineers and data scientists already using WhyLabs to solve some of the most challenging ML monitoring cases!
- 1,185+ Robust & Responsible AI Slack members
- 2,320+ whylogs GitHub Stars
- 1147+ Robust & Responsible AI Meetup Members
- 9,297+ WhyLabs LinkedIn followers
- 892+ WhyLabs Twitter followers
Request a demo to learn how ML monitoring can benefit your company.
See you next time! — Sage Elliott