Data Distribution Drift

Automunge makes machine learning easy

Nicholas Teague
Jan 1 · 8 min read

One

Approaching the one year mark of the Automunge project (non-sequential months, took a salary slave sabbatical at the mid-point), and well looking back we really have a lot to show for it. A robust software package filling several unmet needs for data scientists looking to compile machine learning models that incorporate elements of tabular data in their pipelines — or in other words most machine learning models. A real book! (I mean ok it’s just a bunch of essays on Medium which is literally so prestigious a platform that I can’t even share links to the data science Reddit feed to give you an idea). The beginnings of some intellectual property claims (patent pending) for this open source software which may sound like an anachronism but I assure you is a sound strategy meant to simultaneously establish trust in a user-base while providing some degree of protection from competitors copying our inventions in the context of a commercial offering. Oh and like frankly not enough users. That’s kind of the big gap of making this a reputable platform.


Two

One of the features I’d like to highlight this week is built to support the real world challenges of managing a machine learning implementation, specifically with respect to tracking data distribution drift between training data and subsequent data intended for inference from a corresponding trained model. This turns out to be a common challenge for machine learning practitioners in that a built in assumption for any trained model will be that properties of data used to generate predictions is consistent in form to that data used to train the model, and identifying drift of those properties can serve as a key signal for when it is time to retrain a model. Because the Automunge platform is built around the workflow that immediately precedes the application of machine learning training or inference, it turns out to be an ideal step for incorporation of an evaluation of data set distribution properties. To support this method, two sets of distribution properties are evaluated from a training set automatically in the context of an automunge(.) call.

Mozart’s Sonata I — Nicholas Teague

Three

It just occurred to me that it has already been three years since I began writing essays for the collection From the Diaries of John Henry — consecutive months, non-stop. Along the way there has admittedly been some serious drift of attention, what started as a bunch of sort of random explorations eventually transitioned to formal essays, many around themes of machine learning — and more recently a concerted focus on entrepreneurship documenting the development of the Automunge platform. (Actually if you check out A Table of Contents these themes are sort of aggregated in separate collections — currently working on Book Four!) And well part of the problem with so much energy applied to software development as of late has been a neglect of creative writing, which has really been part of the fuel for Automunge as much as the python code; I find the different aspects of the project — software development, creative writing, and research — as mutually reinforcing in each direction. Sort of a rock paper scissors game where everybody wins. So yeah just a few creative thoughts here to close. Any “serious” reader please feel free to disregard.



Thinking, Fast and Slow — Daniel Kahneman

Thinking, Fast and Slow

From the Diaries of John Henry — Nicholas Teague

From the Diaries of John Henry

Surely You’re Joking, Mr. Feynman! — Richard Feynman

Surely You’re Joking, Mr. Feynman!


Automunge

Automated Data-Wrangling With Automunge

Nicholas Teague

Written by

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com

Automunge

Automunge

Automated Data-Wrangling With Automunge

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade