Fire up your MLOps with a scalable, lightweight, open source data logging library
Co-author: Bernease Herman
We are thrilled to announce the open-source package whylogs. It enables data logging for any ML/AI pipeline in a few lines of code. Data logging is a critical component of an ML/AI pipeline, as it provides essential insights into the health and performance of the pipeline. We hope whylogs will make it into the toolset of every data science practitioner and researcher.
We built whylogs because in our years of troubleshooting AI systems, we observed that the vast majority of problems start with the data that the models consume. Teams frequently overestimate the resiliency of machine learning algorithms and underestimate the effects of poor data quality, leading to data dependencies that can cause costly failures. Keeping an AI model healthy requires developers and data scientists to be aware of changes in the quality and consistency of their data. Our team at WhyLabs aims to address these challenges with the open-source data monitoring library whylogs.
Data monitoring, why is it so hard?
Much of the difficulty in maintaining an ML system comes from data drift. This isn’t always due to bad data from upstream; sometimes the data coming in just changes abruptly, and the system is unprepared to deal with it. While monitoring machine learning data is similar in theory to traditional DevOps monitoring, simply applying existing DevOps systems to ML data often leads to trouble. ML data can have unusual data types that DevOps tools aren’t built to handle, or, what’s worse, contain personally identifiable information that needs extra security and privacy measures many DevOps systems are unable to implement.
Some organizations have already started investing in building dedicated monitoring systems for ML pipelines, such as Netflix’s Metaflow and Uber’s Michelangelo. However, most of these systems are so specialized that implementing them requires teams to perform resource-intensive restructuring of the pipelines they monitor.
What does a good data logging solution look like?
A good data logging system should combine the transparent, flexible methodology of traditional DevOps with the human-centric design of dedicated systems like Metaflow. To do this, it has to accomplish the following:
- Log properties of data over time, as that data moves through the ML system
- Aggregate logs, so you can use distributed systems to process more data at once
- Support a wide range of ML data types, including complex inputs like images and video
- Tag data segments with labels that the humans involved can use for deeper analysis
In a production setting, there are a few more requirements for a good data logging system:
- Use a lightweight structure that can run in parallel with model training and inference
- Integrate easily with a variety of ML pipelines and architectures
- Enable data logging at any point in a ML model’s life cycle
Since data scientists lack the specialized tools that software developers have, they often repurpose existing DevOps approaches for ML monitoring. Doing that creates significant problems, some of which we outline below:
- Log collection: DevOps solutions like Kafka and Splunk have made log collecting accessible. However, these solutions are designed for application logs, so they struggle with handling the large data volume of MLOps. There is also the risk of losing the structural information associated with ML data if log messages aren’t carefully managed.
- Metrics collection: One popular integration is Prometheus — it’s supported by many systems, including AirFlow and Kubeflow. It can be streamed to different metrics systems for storage (DataDog, New Relic etc.) and visualized using OSS tools like Grafana. However, using Prometheus can limit your ability to monitor various important metrics (average, sum, min, max, count, etc).
- API monitoring: Technologies like Tensorflow Serving and AWS SageMaker Model Hosting make model serving simple. They allow for monitoring REST API requests and responses as a method for extracting insights. While this approach provides deep insights, it doesn’t scale well because it requires every data point to be stored.
As longtime AI practitioners, we grappled with the challenges posed by applying ML to massive amounts of data first-hand. We set out on a mission to find a monitoring approach that delivers on both the DevOps performance requirements, and the data science insight requirements.
Introducing whylogs, the MLdata logging library
We engineered whylogs by working backwards from the requirements. We focused on lightweight data collection, enterprise scalability, and flexibility designed for data science workflow. We added built-in data tagging and aggregation capabilities. Furthermore, we optimized the installation to take minutes and to seamlessly integrate with existing tools . whylogs comes with the following features out of the box:
- Massively scalable due to static memory footprint: whylogs uses estimation algorithms such as HyperLogLog to build detailed statistical profiles of data. We incorporated streaming algorithms and schema tracking techniques to enable deep data insights and to capture important data quality signals.
- Lightweight output minimizes storage costs: Using these algorithms allows us to produce minimal output. No matter how massive the size of the input dataset is, the output size remains small and thus makes the solution cost-effective. Instead of growing linearly with the data as in traditional logging techniques, the size of whylogs output only depends on the number of features being tracked in the data.
- Supports interactive analysis: whylogs is also designed to fit the interactive data science workflows in a notebook environment. The output is lightweight and can be aggregated on a local machine. Additionally, certain metrics such as histograms can be re-binned, thus enabling more flexibility than traditionally static metrics. Finally, tagging support helps scientists to “slice-and-dice” the data, an important requirement of interactive analysis.
- Prevents overcollection of data: While collecting key signals are important in data science, overcollecting data creates both technical challenges and data security and privacy challenges. WhyLogs is conservative about data collection to reduce costs, security vulnerabilities and operational complexity.
So, what does whylogs actually collect?
whylogs collects metrics that include approximate statistics, counters, and sketches of data columns in per-feature statistical profiles:
- Simple counters including boolean, null values, and data types.
- Summary statistics including sum, min, max, and variance.
- Unique value counter or cardinality: whylogs uses the HyperLogLog algorithm to track an approximate unique value for each feature.
- Histograms for numerical features: whylogs binary output can be queried with dynamic binning based on the shape of your data.
- Top frequent items (default is 128). This configuration affects the memory footprint, especially for text features.
Today, whylogs supports columnar data and is available in Python and Java. Our team is working on adding support for time series, text and image data types. We are also working o adding more languages; you can request your favorite one by commenting here.
Setting Up whylogs in 5 Minutes
To get started with the Python library, simply run:
pip install whylogs
We provide a quick initialization command to configure your workspace for whylogs. Just run the following command in your project folder:
Now that you have a configuration file (.whylogs.yaml) in your workspace, whylogs can generate the correct metadata for the dataset. If you have experience with DevOps metrics systems, you can think of metadata as “tags”. The metadata will enable users to group and aggregate data.
To get started with whylogs, you’ll need to create a whylogs session. A session tracks multiple loggers and is associated with a batch or window of data.
First, create a session:
Once you have a session, you can create a Logger. A Logger is a 1–1 mapping to a dataset or a model and has a unique name associated with it. You can create different loggers with different names under the same session:
A few ways to log data:
Writing log data to disk: when a logger is closed, the dataset profile it created will be written to disk. whylogs can write out to different formats depending on the YAML configuration. By default, the logger will write to the local path for all available formats. You can check out further details in our documentation.
Once you have data written to disk, you can run complex operations on it. If you have data from the same session with matching name and tags, you can merge them:
A whylogs dataset profile object can be complex, but we’ve built some convenient methods to make it easier to make sense of them through data analysis:
There are currently four major plots with variants for discrete and continuous data. Here’s an example whylogs Data Type visualization for the Lending Club dataset using information from the dataset profile computed for variable desc.
I Am Excited, So What’s next?
whylogs provides access to the data insights that ML engineers and data scientists need without the heavy infrastructure requirements of other systems. It’s designed to integrate with your existing system, and to make insights understandable and actionable to the AI Builders who need them. We are already working on exciting additions, including:
- Further performance optimization, especially for Python
- Additional data sources, including cloud storage and SQL databases
- More integration with existing ML tooling such as AirFlow and Kubeflow
- Support for additional data types and languages