WhyLabs Weekly: Creating powerful data profiles
Create privacy preserving data profiles that can be used in data and ML monitoring applications
A lot happens every week in the WhyLabs Robust & Responsible AI (R2AI) community! This weekly update serves as a recap so you don’t miss a thing!
Start learning about MLOps and ML Monitoring:
- 📅 Join the next event: Monitoring LLMs in Production with Hugging Face & WhyLabs
- 💻 Check out our open source projects whylogs & LangKit!
- 💬 Join 1,249+ Robust & Responsible AI Slack members
- 🤝 Request a demo to learn how ML monitoring can benefit you
💡 MLOps tip of the week:
Last week we looked at how to monitor ML models for data drift in the WhyLabs observatory platform. This week we’ll take a closer look at the profiles that get created with whylogs. These data profiles contain only summary statistics about your dataset and can be used for monitoring data drift, ML performance, and data quality issues.
Once you install whylogs using `pip` you can create a profile of your dataset with just a few lines of code! You can run this example in a notebook here.
# import whylogs and pandas
import whylogs as why
import pandas as pd
# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)
Let’s create a quick example dataset from a Python dictionary. We’ll include several different data types.
# create a simple test dataset
data = {
"animal": ["lion", "shark", "cat", "bear", "jellyfish", "kangaroo",
"jellyfish", "jellyfish", "fish"],
"legs": [4, 0, 4, 4.0, None, 2, None, None, "fins"],
"weight": [14.3, 11.8, 4.3, 30.1,2.0,120.0,2.7,2.2, 1.2],
}
# Create dataframe with test dataset
df = pd.DataFrame(data)
Next, we’ll create a data profile with whylogs and view it in a pandas dataframe.
# Log data with whylogs & create profile
results = why.log(pandas=df)
profile = results.profile()
# Create profile view dataframe
prof_view = profile.view()
prof_df = prof_view.to_pandas()
# View Profile dataframe for dataset statistics
prof_df
The number of rows of our dataframe will be equal to the number of columns in the logged data. Each column of the statistics’ dataframe contains a specific dimension of a given Metric.
Lets take a quick look at the generated statistics:
animal
The animal row shows there are 9 entries (counts/n). All the data types are strings. Cardinality estimates that 7 different animal types are in the dataset. Frequent items show jellyfish appearing the most.
weight
Our weight data contains 9 entries. All of them are fractional values. Cardinality shows that all 9 values are estimated to be unique. Since all entries were numerical the distribution statistics are generated.
legs
We can see that there are 9 entries for leg values, but they’re several different data types. 3 null, 4 integrals, 1 float, and 1 string. Cardinality estimates 5 unique values. The most frequent number of legs that appear in the dataset is 4.
These lightweight data profiles can be used to monitor for data quality issues, data drift, and ML performance degradation using other features in the whylogs library; see how to do that with links below 👇
Learn more about the whylogs profile statistics and using them for ML monitoring:
📝 Latest blog posts:
Hugging Face and LangKit: Your Solution for LLM Observability
Hugging Face has quickly become a leading name in the world of natural language processing (NLP), with its open-source library becoming the go-to resource for developers and researchers alike. As more organizations turn to Hugging Face’s language models for their NLP needs, the need for robust monitoring and observability solutions becomes more apparent. Read more on WhyLabs.AI
🎥 Event recordings
Build and Monitor Computer Vision Models with TensorFlow/Keras + WhyLabs
If you want to build reliable computer vision pipelines, trustworthy data, and responsible ML models, you’ll need to monitor your models and data.
In this workshop, we cover how to use ML monitoring techniques to implement your own AI observability solution for computer vision classification applications.
📅 Upcoming R2AI & WhyLabs Events:
- 9/6 Monitoring LLMs in Production with Hugging Face & WhyLabs
- 9/13 Intro to AI Observability: Monitoring ML Models & Data in Production
- 9/20 Monitoring LLMs in Production using OpenAI, LangChain & WhyLabs
💻 WhyLabs open source updates:
whylogs v1.3.0 has been released!
whylogs is the open standard for data logging & AI telemetry. This week’s update includes:
- Don’t log session init warning if running outside of notebook or ipython
- Preserve metadata when uncompounding DatasetProfileView
- Update example notebook schema documentation
See full whylogs release notes on Github.
LangKit 0.0.16 has been released!
LangKit is an open-source text metrics toolkit for monitoring language models.
- Allow has_patterns to be called on strings
- Simplify create pull-request-action workflow
See full LangKit release notes on Github.
🤝 Stay connected with the WhyLabs Community:
Join the thousands of machine learning engineers and data scientists already using WhyLabs to solve some of the most challenging ML monitoring cases!
- 1,249+ Robust & Responsible AI Slack members
- 2,341+ whylogs GitHub Stars
- 1210+ Robust & Responsible AI Meetup Members
- 9,477+ WhyLabs LinkedIn followers
- 901+ WhyLabs Twitter followers
Request a demo to learn how ML monitoring can benefit your company.
See you next time! — Sage Elliott