Data Logging: Sampling versus Profiling

Published in

WhyLabs

7 min readOct 29, 2020

by Isaac Backus and Bernease Herman

In traditional software, logging and instrumentation have been adopted as standard practice to create transparency and to make sense of the health of a complex system. When it comes to AI applications, the lack of tools and standardized approaches mean that logging is often spotty and incomplete. Here, I compare two approaches to data logging: sampling and profiling.

I have two goals in this post. First, I will demonstrate that profiling is superior to sampling. Profiling provides a lightweight, robust approach to characterizing distributions for all types of data encountered in ML. Next, I want to convince every data scientist to give data logging a shot. To that end, I present whylogs: an open source library developed by the team here at WhyLabs. The whylogs approach is suitable for any ML framework and enables scalable, statistical data logging and profiling in only a few lines of code.

Logging

In a post on Towards Data Science I argued that data logging is required for robust, mature ML/AI applications. I also outlined five requirements for logging software. Whylogs, particularly in combination with the WhyLabs AI Observability Platform, aims at hitting those targets. The requirements are:

Ease of use
Lightweight
Standardized and portable
Configurable
Close to the code

There are two main approaches to data logging: profiling and sampling. Let’s see why profiling (as implemented by whylogs) beats out sampling.

Sampling

Data sampling is a basic method for trying to monitor data in production environments. The idea is simple: randomly or programmatically select samples of data from a larger data stream and store them for later analysis. Implementing sampling typically requires no special software and can be achieved with little extra up-front design.

However, there are some drawbacks to sampling which profiling attempts to address. For example, sampling can involve large I/O and storage costs. It tends to be noisy, and even though implementation is straightforward, consuming samples generally still requires statistical analysis specific to the dataset. The output data format depends on the input data, making it difficult to create generic tooling for consuming the events.

Rare events and outliers are missed at a high frequency, and distribution metrics such as min/max or the number of unique values cannot be accurately estimated. These metrics are of special importance in logging cases since outliers and rare events are often correlated with data issues.

Profiling

In contrast, profiling collects statistical measurements of the data. In the case of whylogs, the metrics produced come with mathematically derived uncertainty bounds. These profiles are scalable, lightweight, flexible, and configurable. Rare events and outlier-dependent metrics can be accurately captured. The results are statistical and of a standard, portable data format which are directly interpretable. Whylogs packages this all up and includes multi-language support, ease of use, reliability, and flexibility.

whylogs profiles

Whylogs implements a number of useful statistics for data profiling. All statistics are collected in a streaming fashion. Using this approach requires only a single pass over the data with minimal memory overhead, and is naturally parallelizable. The resulting profiles are all merge-able, allowing statistics from multiple hosts, data partitions, or datasets, to be merged post-hoc. The approach is therefore trivially parallelizable and map-reducible, making it highly scalable.

Certain statistics can be tracked exactly, such as record count, data type counts, null count, min, max, and mean. Others — such as quantiles, histograms, or cardinality — require approximate statistics.

Profiling vs sampling — experimental results

To compare profiling and sampling, I ran a number of experiments. The results demonstrate profiling’s improved accuracy over sampling, especially regarding outlier-dependent metrics, long tail distributions, and metrics such as cardinality estimates (number of unique values). Here I present two sets of experiments:the first targets distributional metrics, and the second targets unique value counts.

Experiment 1 — distributional metrics

The first set of experiments were run as follows:

Select a distribution to choose from (outlined below)
Randomly sample 10⁵ records
Sample a subset of n_sample records such that the subset is as many bytes as the profile. This is to compare apples to apples. Accuracy can be improved for sampling and profiling by increasing the data size.
Compare with exact values
Repeat steps 2 through 4 for a total of 24 runs and average the results
Repeat for every distribution

As can be seen in the figures below, across all distribution types and for every metric, profiles outperform samples. This is especially clear for the long-tail pareto distribution which produces “outliers.” Outlier-related metrics cannot be captured by random sampling.

**Mean estimates:** errors in the estimate of the mean for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown. Profiling errors are too small to be seen.

**Median estimates:** errors in the estimate of the median for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown.

**Edge quantiles**: errors in the estimates of the 0.05 and 0.95 quantiles for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown.

**Max:** errors and bias in the estimate of maximum for sampling vs profiling for various distributions. The mean relative error (left) and mean relative bias (right) are shown. The relative bias is calculated as (bias)/(true value). Profiling errors are too small to be seen.

**Min:** errors in the estimate of the minimum for sampling vs profiling for various distributions. The mean relative error (right) and mean relative bias (left) are shown. The relative bias is calculated as (bias)/(true value) . Profiling errors are too small to be seen.

Experiment 2 — unique value counts

When it comes to estimating the number of unique values, particularly at high cardinality or in un-balanced datasets where certain categories are rare, profiling significantly outperforms sampling.

This experiment proceeds as follows:

Select a distribution and a number of unique values (n_true)
Randomly sample 10⁶ records
Sample a subset of n_sample records such that the subset is as many bytes as the profile.
Estimate number of unique values (for both methods)
Repeat steps 2 through 4 for a total of 15 runs and average the results
Repeat steps 1 through 5 with a new choice of unique value count n_true
Repeat for all distributions

As can be seen in the figure below, only in the case of a uniform distribution and fairly low cardinality does sampling accurately estimate the number of categories.

Experiment 2 distributions

**Unique value counts**: Errors in the estimate of number of unique values (y-axis) are displayed for profiling vs sampling across different (discrete) distributions for a range of unique value counts (x-axis). Results are for 1M samples.

Monitoring profiles

Beyond their statistical accuracy, another motivation for data logging with profiles is how well they can be used for automated monitoring of ML/AI applications and pipelines. There are a number of reasons that make profiles especially well suited for monitoring. Profiles are:

Lightweight
This encourages broad monitoring across many data sources. There is very little cost in terms of person hours (implementation), storage, or compute.
Controlled
whylogs profiles are a standardized, cross-language, cross-platform format. They provide monitoring targets consistent across customers, platforms, databases, etc.
Simple
Monitoring on arbitrary data can add an additional complex, fragile data layer to an already complex system. The structured, standardized profiles are much simpler than samples of arbitrary data.
Human-centered
Profiles produce interpretable statistics and signals, which is entirely necessary for debugging data pipelines and understanding model performance.
Statistical
Statistical monitoring algorithms are more interpretable and robust than black box ML monitoring. One can include statistical knowledge of the profiles when designing monitoring algorithms.

Monitoring demo

To see what monitoring and data logging can look like in production, check out the live sandbox demo of the WhyLabs Platform service with built-in monitoring which consumes statistical profiles generated by whylogs.

Extensibility

Machine learning applications are, by their nature, statistical. Profiling is a broadly applicable approach to characterizing distributions, making it viable for all types of data encountered in ML. Additionally, since the whylogs approach is streaming, trivially parallelizable, and map-reducible, it is naturally suited to all ML frameworks. At WhyLabs, our goal is to make data logging available and easy to implement for all AI practitioners.

The data explored in this blog post has been primarily structured data, but the team here at WhyLabs is already working on implementing profiling in whylogs for images, natural language data, time series data, and more. Current integrations include python, java, pandas, numpy, spark, MLFlow, and more. We are rapidly expanding these to target every ML environment.

Check us out

Open source library

whylogs Python
https://github.com/whylabs/whylogs-python
https://whylogs.readthedocs.io/
whylogs Java — https://github.com/whylabs/whylogs-java