Data Distribution Drift

Automunge makes machine learning easy

Nicholas Teague
Automunge
8 min readJan 1, 2020

--

One

Approaching the one year mark of the Automunge project (non-sequential months, took a salary slave sabbatical at the mid-point), and well looking back we really have a lot to show for it. A robust software package filling several unmet needs for data scientists looking to compile machine learning models that incorporate elements of tabular data in their pipelines — or in other words most machine learning models. A real book! (I mean ok it’s just a bunch of essays on Medium which is literally so prestigious a platform that I can’t even share links to the data science Reddit feed to give you an idea). The beginnings of some intellectual property claims (patent pending) for this open source software which may sound like an anachronism but I assure you is a sound strategy meant to simultaneously establish trust in a user-base while providing some degree of protection from competitors copying our inventions in the context of a commercial offering. Oh and like frankly not enough users. That’s kind of the big gap of making this a reputable platform.

There are three key insights that have inspired much of what makes the Automunge platform unique. A first is the segregated application of feature engineering transformation functions between sets intended for training data, validation data, or inference data for machine learning, basing parameters of transformations to the validation and inference data sets from properties inferred from the training set — such as to prevent “data leakage” which may interfere with hyperparameter tuning operations for instance. A second key insight involves the incorporation of methods for automated assembly of feature column specific sets for purposes of training intermediate models for purposes of predicting infill to missing or improperly formatted data points — which is built into the platform, what we call ML infill. Another key insight originates from the recognition that by building on an assumption that data sets are received in a “tidy” form, meaning one feature per column and one row per observation, it becomes possible to fully automate the preprocessing methods by basing simple feature engineering transformations on properties inferred from each column for cases where methods are not otherwise specified by a user.

From the foundation of these three assumptions, the Automunge development journey has basically amounted to the simple act of building out the skeleton between inputted tidy data and outputted ML ready data all of these tabular data preparation steps that may precede the application of machine learning, which we have done in the context of two master functions: the automunge(.) function for the initial processing applied to training data (and optionally simultaneous application to inference data), and the postmunge(.) function for subsequent consistent processing of additional data (based on parameters inferred from the training data saved in a dictionary object returned from an automunge(.) call). In the context of these two functions the Automunge platform includes a whole host of useful and unique methods, such as a library of feature engineering transformations which may be assigned to distinct columns individually or as sets of transformations, including generations and branches, such as to present data to the ML algorithms in multiple configurations. The sets may be assembled from transformation functions built into the library or alternatively custom defined by the user, making use of just a few simple data structures, for incorporation into the platform. Oh and a whole bunch of other useful stuff including several alternatives to one-hot encoding for categorical sets, methods to prepare data for oversampling in cases of class imbalance in labels, label engineering, label smoothing, fitted label smoothing (it’s a thing!), feature importance evaluation, dimensionality reduction such as via PCA, methods for data versioning and no need to archive multiple versions of a data set, oh and just this really efficient means to prepare streams of data for ML inference for instance. In short, we make machine learning easy.

Two

One of the features I’d like to highlight this week is built to support the real world challenges of managing a machine learning implementation, specifically with respect to tracking data distribution drift between training data and subsequent data intended for inference from a corresponding trained model. This turns out to be a common challenge for machine learning practitioners in that a built in assumption for any trained model will be that properties of data used to generate predictions is consistent in form to that data used to train the model, and identifying drift of those properties can serve as a key signal for when it is time to retrain a model. Because the Automunge platform is built around the workflow that immediately precedes the application of machine learning training or inference, it turns out to be an ideal step for incorporation of an evaluation of data set distribution properties. To support this method, two sets of distribution properties are evaluated from a training set automatically in the context of an automunge(.) call.

Mozart’s Sonata I — Nicholas Teague

The driftreport assessment is then accessed in the corresponding postmunge(.) call in which subsequent data is consistently processed and evaluated. Because a postmunge(.) may be performed much more frequently such as for streams of inference data, we’ve put a little more effort into efficient operation such as to ensure speed and energy efficiency at large scale — so the default for a driftreport assessment is False (off). A user may activate the assessment for subsequent data by passing the postmunge(.) driftreport parameter as one of {True, ‘efficient’, ‘report_effic’, ‘report_full’}. In the True assessment the full range of distribution properties are evaluated and returned in the ‘postreports_dict’ object returned from postmunge, and optionally displayed by passing printstatus=True. This assessment includes two series of distribution property evaluations — first for original source columns based on the column’s root category and second for properties associated with each derived column based on associated categories of transformations. By passing driftreport = ‘efficient’, only the source column properties are evaluated which, as excludes the need for data transformations, is a much more efficient operation. The ‘report_full’ or ‘report_effic’ are comparable save that the postmunge halts after the drift evaluation and no preparation / further transformations of the passed data is performed, just a report of drift properties — a use case for when user wishes just to evaluate data distribution drift. Actually come to think of it the question of evaluating distribution property drift kind of came up in a NeurIPS talk when Yoshua Bengio proposed some means to distinguish paradigms of artificial cognition between Systems One and Two.

Three

It just occurred to me that it has already been three years since I began writing essays for the collection From the Diaries of John Henry — consecutive months, non-stop. Along the way there has admittedly been some serious drift of attention, what started as a bunch of sort of random explorations eventually transitioned to formal essays, many around themes of machine learning — and more recently a concerted focus on entrepreneurship documenting the development of the Automunge platform. (Actually if you check out A Table of Contents these themes are sort of aggregated in separate collections — currently working on Book Four!) And well part of the problem with so much energy applied to software development as of late has been a neglect of creative writing, which has really been part of the fuel for Automunge as much as the python code; I find the different aspects of the project — software development, creative writing, and research — as mutually reinforcing in each direction. Sort of a rock paper scissors game where everybody wins. So yeah just a few creative thoughts here to close. Any “serious” reader please feel free to disregard.

I just had a chance to watch a bowl game with my dad the other day, we’re both kind of college football nerds so always look forward to a little friendly trash talk such as when the Gators plays the Tigers every year (and boy LSU looks strong this year don’t they :). My dad has this ridiculous tradition where every time when our teams compete he fishes out from some hiding place this unbelievably laughable “football vest” which he hand-made himself, including random “alligator” and “tiger” patches sewn on the front which he must have found in the discount bin at a local craft store and then in large font felt lettering on the back “LSU Grad, Gator Dad”. I tried to buy him like a nice team button up collar shirt once and he still prefers the vest, it is both endearing and oh so cringeworthy all at the same time. Something about those LSU fans I swear, must be something in the water.

I’m vaguely reminded of a story Richard Feynman told about his father who worked at a uniform store, you know like might supply the outfits worn by a general, or a bishop, or a nurse, or a Disney character, or a football player. And forgive me I just tried to grab the book off the shelf must have misplaced it in the move, but I believe this senior Feynman’s point was that it is a fallacy to believe that these people in uniforms are superior or somehow special due to their station, and that under that outfit they are just like the rest of us. Consider the movie Amadeus about the life of Mozart, a man of genius in composition (yet unafraid to make a bawdy joke here and there when the time called for it) — in comparison to the court of noblemen with their regalia and crowns, all just as confused as the rest of us. The point, if I have a point and I’m not sure if I do, is by recognizing that these people in power are just as human and fallible as the rest of us, well perhaps it becomes a little easier to have faith that maybe you yourself might have something to contribute. After all wearing a uniform is really just a cheap hack for state of mind.

Books that were referenced here or otherwise inspired this post:

Thinking, Fast and Slow — Daniel Kahneman

Thinking, Fast and Slow

From the Diaries of John Henry — Nicholas Teague

From the Diaries of John Henry

Surely You’re Joking, Mr. Feynman! — Richard Feynman

Surely You’re Joking, Mr. Feynman!

As an Amazon Associate I earn from qualifying purchases.

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.