WhyLabs Weekly: Creating powerful data profiles

Create privacy preserving data profiles that can be used in data and ML monitoring applications

Sage Elliott
WhyLabs
4 min readAug 25, 2023

--

A lot happens every week in the WhyLabs Robust & Responsible AI (R2AI) community! This weekly update serves as a recap so you don’t miss a thing!

Start learning about MLOps and ML Monitoring:

💡 MLOps tip of the week:

Last week we looked at how to monitor ML models for data drift in the WhyLabs observatory platform. This week we’ll take a closer look at the profiles that get created with whylogs. These data profiles contain only summary statistics about your dataset and can be used for monitoring data drift, ML performance, and data quality issues.

Once you install whylogs using `pip` you can create a profile of your dataset with just a few lines of code! You can run this example in a notebook here.

# import whylogs and pandas
import whylogs as why
import pandas as pd

# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)

Let’s create a quick example dataset from a Python dictionary. We’ll include several different data types.

# create a simple test dataset
data = {
"animal": ["lion", "shark", "cat", "bear", "jellyfish", "kangaroo",
"jellyfish", "jellyfish", "fish"],
"legs": [4, 0, 4, 4.0, None, 2, None, None, "fins"],
"weight": [14.3, 11.8, 4.3, 30.1,2.0,120.0,2.7,2.2, 1.2],
}

# Create dataframe with test dataset
df = pd.DataFrame(data)

Next, we’ll create a data profile with whylogs and view it in a pandas dataframe.

# Log data with whylogs & create profile
results = why.log(pandas=df)
profile = results.profile()

# Create profile view dataframe
prof_view = profile.view()
prof_df = prof_view.to_pandas()

# View Profile dataframe for dataset statistics
prof_df
run this example in a notebook here

The number of rows of our dataframe will be equal to the number of columns in the logged data. Each column of the statistics’ dataframe contains a specific dimension of a given Metric.

Lets take a quick look at the generated statistics:

animal

The animal row shows there are 9 entries (counts/n). All the data types are strings. Cardinality estimates that 7 different animal types are in the dataset. Frequent items show jellyfish appearing the most.

weight

Our weight data contains 9 entries. All of them are fractional values. Cardinality shows that all 9 values are estimated to be unique. Since all entries were numerical the distribution statistics are generated.

legs

We can see that there are 9 entries for leg values, but they’re several different data types. 3 null, 4 integrals, 1 float, and 1 string. Cardinality estimates 5 unique values. The most frequent number of legs that appear in the dataset is 4.

These lightweight data profiles can be used to monitor for data quality issues, data drift, and ML performance degradation using other features in the whylogs library; see how to do that with links below 👇

Learn more about the whylogs profile statistics and using them for ML monitoring:

📝 Latest blog posts:

Hugging Face and LangKit: Your Solution for LLM Observability

Hugging Face has quickly become a leading name in the world of natural language processing (NLP), with its open-source library becoming the go-to resource for developers and researchers alike. As more organizations turn to Hugging Face’s language models for their NLP needs, the need for robust monitoring and observability solutions becomes more apparent. Read more on WhyLabs.AI

🎥 Event recordings

Build and Monitor Computer Vision Models with TensorFlow/Keras + WhyLabs

If you want to build reliable computer vision pipelines, trustworthy data, and responsible ML models, you’ll need to monitor your models and data.

In this workshop, we cover how to use ML monitoring techniques to implement your own AI observability solution for computer vision classification applications.

📅 Upcoming R2AI & WhyLabs Events:

Join this workshop on September 6th — RSVP on Eventbrite

💻 WhyLabs open source updates:

whylogs v1.3.0 has been released!

whylogs is the open standard for data logging & AI telemetry. This week’s update includes:

  • Don’t log session init warning if running outside of notebook or ipython
  • Preserve metadata when uncompounding DatasetProfileView
  • Update example notebook schema documentation

See full whylogs release notes on Github.

LangKit 0.0.16 has been released!

LangKit is an open-source text metrics toolkit for monitoring language models.

  • Allow has_patterns to be called on strings
  • Simplify create pull-request-action workflow

See full LangKit release notes on Github.

🤝 Stay connected with the WhyLabs Community:

Join the thousands of machine learning engineers and data scientists already using WhyLabs to solve some of the most challenging ML monitoring cases!

Request a demo to learn how ML monitoring can benefit your company.

See you next time! — Sage Elliott

--

--