Drift Reporting with Automunge

The only constants are laws of nature

Published in

Automunge

10 min readOct 19, 2021

For those that haven’t been following along, I’ve been using this forum to document the development of Automunge, a python library that automates the preparation of tabular data for machine learning. The tool is intended for data scientists comfortable with working at the coding layer, such as in the context of jupyter notebooks. Under automation numeric features are normalized, categoric features are binarized, and missing data is imputed. But a user need not defer to automation, there is an extensive internal library of data transformations that may be applied based on a ‘fit’ to properties of a feature in a training set for preparation of additional corresponding data on a consistent basis, as may be used to prepare data for inference from a trained model. The interface is channeled through two master functions: automunge(.) prepares training data and also returns a populated dictionary, what we call the postprocess_dict, recording steps and parameters of transformations, which dictionary may then may be used as a key for automatically preparing additional corresponding data in the postmunge(.) function. Basically we’re encapsulating pandas pipelines fit to a train set similar to how machine learning encapsulates neural networks fit to a train set. And we’re doing this in the context of an open source library.

I wanted to use this essay to quickly offer some further detail on what has been kind of a neglected aspect of the library in this forum. I mostly write essays when rolling out new functionality, and occasionally will make note of something that doesn’t really get sufficient coverage on first pass. This was probably the case when I introduced the Automunge drift reporting feature back on New Years Eve 2019. What can I say, was probably getting ready for the ball to drop.

Data Distribution Drift

Automunge makes machine learning easy

medium.com

Drift reporting refers to the aggregation of statistics when preparing training data that can then be compared to corresponding statistics when preparing test data. In this context training data refers to data prepared by automunge(.) that could be used to train a model in supervised learning, and test data would be subsequent corresponding data prepared by postmunge(.) that could be used to run an inference on that trained model. The point of comparing statistics between training data and subsequently prepared test data is that cases of drift can be used to identify that the original model may be losing validity and thus need to be retrained to accommodate new distribution properties. After all it is a fundamental assumption of supervised learning that data used for inference has comparable properties to the data used to train the model. When that assumption begins to fade, the validity of the model does as well.

driftreport

The aggregation of training data statistics in automunge(.) is automatic, no need to change any setting. This includes statistics associated with raw values in the input data before transformations are applied, and also statistics derived while transformation functions are performed. The records of these aggregated statistics are populated in various entries of the returned postprocess_dict dictionary for retrieval when populating a drift report in postmunge(.).

The activation of drift reporting is then performed when preparing additional data by the postmunge driftreport parameter [Table 1], which accepts entries as one of {False, True, 'efficient', 'report_full', 'report_effic'}. The default is False, meaning no drift report is assembled (which has latency benefits). Activating scenarios True or 'efficient' results in a drift assessment populated followed by processing of the postmunge(.) data consistent to a regular postmunge(.) call, with drift results displayed in printouts and returned in the postmunge(.) dictionary postreports_dict. Activating scenarios 'report_full' or 'report_effic' assembles comparable drift reporting but without further preparing data in postmunge(.), which just speeds things up a bit when user is only interested in drift statistics without further preparations of data. Basically the efficient scenarios refer to aggregating drift statistics just for the input features, while the full reporting scenarios also aggregate statistics for derived features.

Table 1: postmunge(.) driftreport options

The reporting of drift statistics is provided in two fashions. First the postmunge(.) printouts detail each evaluation (available when user sets printstatus=True). Secondly, the results are collected in the postreports_dict dictionary returned from postmunge(.) — which is kind of like the postmunge analog to the automunge dictionary postprocess_dict and is used for reporting postmunge validation results, feature importance results, drift results, etc.

Input feature drift statistics

The distinction between input feature drift statistics and derived feature drift statistics is worth some clarification. Input feature drift statistics refer to statistics collected on the raw data as received prior to performing feature transformations. The collection takes place after any automated evaluation of data properties, such that the root category assigned to an input feature is used as basis for what type of drift reports to inspect (more particularly the basis is established from the NArowtype recorded in the process_dict entry for the root category assigned to an input feature, where NArowtype is basically a specification associated with classifying expected inputs to a transformation which is used for a few different purposes). For example, a transform that expects all numeric inputs would have numeric distribution statistics aggregated, while a transform that allows non-numeric inputs (like would be the case for a categoric encoding), would have distribution statistics collected associated with categoric entries.

We selected numeric drift statistics to try and record a few metrics that capture a lot of information about the distribution. The aggregation of multiple quantile values for segments like 99/90/66/33/10/01 we believe will be a helpful metric to evaluate for mixed distributions with multiple peaks, and evaluate spread at different segments of distribution. We collect some typical numeric statistics like max/min/median/mean/std/MAD for reference, and then also inspect some more advanced statistics like skew and Shapiro statistic which can help evaluate for tail thickness and normality. We also calculate the percent of data with missing values.

Drift statistics collected for NArowtype 'numeric' (used with transforms accepting numeric input):max / quantile_99 / quantile_90 / quantile_66 / median / quantile_33 / quantile_10 / quantile_01 / min / mean / std / MAD / skew / shapiro_W / shapiro_p / nan_ratio

We selected categoric drift statistics associated with reporting unique entries, number of unique entries, as well as a statistic of what percent of the feature constitutes that unique entry (reported as unique_ratio). Similarly we calculate the percent of data with missing values. (To mitigate memory overhead in high cardinality scenario, some of these results aren’t populated in cases of unique entry counts exceeding a heuristic threshold of 5000.)

Drift statistics collected for NArowtype ‘justNaN’ (used with transforms accepting categoric input):unique / nunique / unique_ratio / nanratio

The drift report statistics for training data input features can be inspected in postreports_dict['sourcecolumn_drift']['orig_driftstats'] which can be compared to comparable stats for test data recorded in postreports_dict['sourcecolumn_drift']['new_driftstats']. The results are also reported in the printouts.

Example of postmunge(.) printouts for input feature drift statistics

Derived feature drift statistics

The derived feature drift statistics are a little trickier, and I’ll provide some context for why that is. The aggregation of derived feature drift statistics is performed within the application of transformation functions associated with their derivation. For example, if you are applying a z-score normalization via the ‘nmbr’ transform, then the same function that performs that transformation also records any drift statistics associated with that transformation, which are recorded in a data structure we refer to as the normalization_dict — the same data structure that records derived training data properties used as a basis to perform corresponding transformations on a test data feature set. In our example of z-score normalization, we are already recording the training data mean and standard deviation for use to prepare test data (which could also be considered of interest for drift assessment). Basically the drift reporting is piggy backing on that data structure to record other properties that might be of interest, such as in the case of z-score normalization includes the maximum and the minimum of values in the received feature set. So when the drift report returns results, all it is doing is displaying the recorded normalization_dict populated during a transformation function.

Even though a derived feature drift statistic may match an input feature drift statistic, such as calculating the maximum and minimum values for z-score normalization, when a derivation is performed as a downstream transform the feature set received by the downstream transform will be different than the input feature (due to some preceding derivation performed upstream), so the derived feature version may still carry extra information than what was recorded for the input feature.

When it comes time to collect the corresponding drift statistics for test data, it is performed by applying a family tree of transforms to the test data in a manner consistent with applying transforms to train data. To describe what that means will use the custom_train convention for custom transformation functions as an example. The convention is that custom_train transforms are used to prepare a training data feature, and custom_test transforms are used to prepare a corresponding test data feature. When a custom_train transform is performed it returns the result of the derivation (as one or more derived columns) and the normalization_dict dictionary recording parameters of those derivations (which also records drift statistics). In normal operation the normalization_dict dictionary is then inspected by the custom_test function to perform corresponding transforms to a test data feature on a consistent basis. In the context of deriving drift statistics for test data, this is accomplished a little differently by passing the test data as training data to apply a family tree of transforms, resulting in the test data being fed to the custom_train functions (instead of custom_test), and resulting in an aggregation of normalization_dict entries derived from properties of the test data — which can then be compared to those derivations applied with training data. Using again our example of z-score normalization, this would mean that we then have access to the mean and standard deviation of both the training data feature and the corresponding test data feature, which can then be compared to identify cases of distribution drift.

As a little bit of minutia, in cases where a transform returns multiple columns, the reporting of the statistics are then returned using a key of the first returned column in that set, which is the same column header used as a key to access the normalization_dict. A complication of this approach for aggregating test data drift statistics is that in some cases custom_train transforms may return columns conditionally on data set properties. For example, in the case of one hot encoding via the ‘text’ transform, returned column headers are derived as a function of feature set unique entries. This leads to the complication that the drift statistics may be reported with keys of different returned column headers between train and test data. We accommodate this by prioritizing first, when available, reporting for comparison of corresponding returned columns, and when corresponding column keys are not available between train and test data, reporting separately the drift statistics of column as returned from training data vs the different column as returned from test data drift stat derivation. Each of those will still be aggregated by the input feature that was the basis.

Thus, the derived feature drift statistics can be inspected in postreports_dict['driftreport']['(input feature)']['newreturnedcolumn'], or for cases where there was an inconsistent first columns returned between train and test data, separately in postreports_dict['driftreport']['(input feature)']['orignotinnew'] and postreports_dict['driftreport']['(input feature)']['newnotinorig]. The results are also reported in the printouts.

Example of postmunge(.) printouts for derived feature drift statistics

Cargo Cult Science

In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas — he’s the controller — and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.
Richard Feynman

In the field of engineering, when some evaluation or design is performed, the profession dictates that a single point of responsibility be identified for “signing and sealing” that result. This professional has liability exposure for this practice, and only earns the right through years of education, professional experience, and certification examination. The field of machine learning is currently moving way too fast for adopting such practices.

It is easy to get distracted by the ever accelerating pace of new technologies and new software. This is probably one of the fastest changing domains in industry right now. With every new project and experiment, there may be new parameters, new architectures, new data sets, or even new things that we haven’t yet discovered. The only constant is change. So if we want to make this a real profession, with ethics and guidelines, we need to search for the fundamentals. What principles will endure independent of new gadgets and gizmos?

A machine learning practitioner’s job is not just to prepare data, train a model, and roll out into production. Any cargo cult could do that. I would argue that we have a professional duty to diligently pursue validity of our models. Without formal data distribution metrics with periodic inspections, one is blind to loss of validity, and all of the ethical concerns that entails. This is what drift reporting is for.

The first principle is that you must not fool yourself — and you are the easiest person to fool.

Professor Longhair — Big Chief

Books that where referenced here or otherwise inspired this post:

Surely You’re Joking, Mr. Feynman! — Richard Feynman

As an Amazon Associated I earn from qualifying purchases.

Intellectual Property Disclaimer

Automunge is released under GNU General Public License v3.0. Full license details available on GitHub. Contact available via automunge.com. Copyright © 2021 — All Rights Reserved. The Automunge library is Patent Pending, including applications 16552857, 17021770

For further readings please check out A Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com