Open Source Automated Data-Wrangling

All your code are now belong to us

Nicholas Teague

Published in

Automunge

13 min readNov 9, 2018

Ray Charles — What’d I Say (live)

github repository available here

For those that haven’t been following along, I’ve been using this forum in the last three months to document the development of a tool for the automation of data wrangling (aka “munging”) of structured data sets for the direct application of machine learning in the framework of your choice. In the current iteration the tool is not yet a replacement for feature engineering, however it is suitable as a replacement for the final steps of data processing prior to application of machine learning. The culmination of this work has been a tool and a corresponding business vehicle dubbed AutoMunge, which includes the function automunge(.) for the initial processing of data intended to train a downstream model along with comparable process of corresponding test data intended to generate predictions from that same downstream model, or alternately the function postmunge(.) intended for the subsequent consistent processing of test data that wasn’t available at initial address.

HT Tidy Data by Whickham / Pandas Cookbook by Petrou

To facilitate the application of neural networks, the tool evaluates each column in an inputted structured data set and assigns a category based on the properties of the data to determine an appropriate processing approach. Numerical data is treated with a z score normalization to mean 0 and standard deviation of 1, binary data is translated to a 0/1 designation, and categorical data is similarly encoded into multiple columns via one-hot encoding. The algorithm is designed to facilitate the further development of supplemental feature engineering transformations which may be applied based on the properties of a column. For example, numerical data may be treated with a power law transformation via the Box-Cox method to address scenarios for fatter tailed distributions as may be evidenced by a measure of distribution skewness. Numerical sets may also be broken into bin identifier columns based on the number of standard deviations from the mean (a new feature now btw). Time series data may be segregated into multiple columns via time scale. These transformations listed already have a version of their implementation included in the current package, however the intent is to continue building out the set of potential feature engineering transformations. After all, for each feature engineering transformation included, we are “tutoring” our machine learning model, giving it access to different realizations of the same data that the neural network would have had to otherwise learn via backpropagation. The hope is thus that our range of supplemental feature engineering transformations will make for a more accurate and efficiently trained downstream model. Of course such a brute force approach to processing data will eventually risk exposing the data to the curse of dimensionality, so as we build out our range of feature engineering transformations the intent will be to incorporate some kind of mechanism to evaluate suitability of transformations.

HT Deep Learning by Goodfellow, Bengio, and Courville

I’ve actually encountered multiple options when it comes to evaluating suitability of features. The above highlight from the Deep Learning textbook is certainly one viable approach. Another that was discussed and presented in the fast.ai Machine Learning course and involves a method whereby some downstream model is trained and predictions are generated once for each feature, but in each step with that feature’s column having values randomly shuffled — such that one may be able to evaluate the impact to model accuracy without that feature’s support. Things get even more interesting when you consider that some feature engineering transformations may support learning that are produced from some combination of parent features, such that some single parent feature may not individually significantly contribute to efficient learning but that some derived combination of parent features does — thus as the scope of original features grows the range of potential feature transformations may grow exponentially. So this definitely isn’t a simple problem, but I have some thoughts that perhaps were sparked by a recent TWiML meetup presentation.

Quantum Machine Learning

TWiML Meet-up Video Lecture and Slides

medium.com

I think it would be a fair criticism of the automunge tool in its current form that it is fairly unsophisticated from a statistical standpoint. I think one of the big opportunities that remains insufficiently addressed in the machine learning industry (or at least the mainstream literature I suppose) is the incorporation of probability theory into numerical address / feature engineering. There are certainly cases where a feature under measurement may not be amenable to simple z score normalization (a good resource on this issue will be found in the forthcoming technical addendum to Nassim Taleb’s Incerto series). I’ve tried to move in the direction of this issue with the incorporation of the Box-Cox transformation (a kind of power law transformation that can be tailored towards the distribution of data), but even that I don’t feel is sufficient for the real world jungle of univariate probability distributions — many of which may not have potential for a direct transformation to normal.

via Univariate Distribution Relationships by Leemis and McQueston

A further and I think valid criticism of the tool in its current iteration is the insufficient address for text data. As currently offered, any column that is not numerical or time-series is squeezed into the bucket of categorical, and converted to a series of associated columns via one-hot encoding. (*A quick tangent — I actually met someone who was studying data science in university and wasn’t familiar with the term one hot encoding, so perhaps worth a brief refresher — one hot encoding simply means creating a distinct column for each category in a set to facilitate a sparse representation of 1’s in which a row is activated for that category and 0’s elsewhere.) So a weakness of the current method is that if we have a set of all distinct values — say serial numbers or addresses etc, well then we end up with one new column for each row — which obviously isn’t ideal. Fortunately the tool does allow you to pass the names of any columns which are to exclude from transformations, so I guess what I’m getting at is the intent is if you do have some column with all unique values your best bet for automunge is to process that column prior to application however you see fit and then pass the name of the column to exclude from further transformations. An important caveat is that if you want to make use of the ML infill technique (I’ll talk about that next), you’ll need to make sure you are passing a column that is already in ML ready format (i.e. numerically encoded). So where could we go to extend the address of text? I think a good future iteration could be to incorporate a text parser that could corse grain the field, such as maybe to extract geographic categories, topical categories, sentiments, gender relationships, or any other adjectives that we would like our tool to test for. This is where there is real potential to incorporate third party API’s.

HT Superintelligence by Nick Bostrom

So I touched briefly on the ML infill technique, but since we’re apparently trying to cover key features here it’s probably worth a quick refresher. One of the unique value propositions for the automunge software is a really cool method to address missing values in a data set. Current data science mainstream practice for data set infill is not extremely sophisticated — commonly practitioners will use an approach such as inserting a mean for the set, inserting the most common value, inserting a value from an adjacent cell, or some comparable shortcut. What automunge accomplishes with ML infill is to make our infill values more tailored to the properties of the data set, by training a series of machine learning models for each column and predicting infill using the properties of adjacent cells in the same row. Thus if you consider arbitrary estimates for infill for what they really are, an accumulation of micro obstacles to efficient downstream learning, by tailoring our infill to the data set properties we are (slightly) lowering the bar for efficient learning. Further, automunge is accomplishing this in a generalized and fully automated fashion.

demonstration of automunge(.) application — more on github.com/automunge

Haley Reinhart — Sittin’ on the Dock of the Bay

demonstration of postmunge(.) application — more on github.com/automunge

Of course the big announcement here isn’t just the self-flagellation on current product shortcomings (believe me I could go on), but instead something a little more significance. Here it comes, the big announcement: automunge is hereby an open source project. … (pause for applause) … Thank you, thank you, no that’s enough, really. Everyone please be sarcastically seated. I’ve been working on this project full time for about three months now, publishing regular essays on the status throughout, and to be honest I haven’t found a great deal of feedback forthcoming from the machine learning community. The biggest validation I found of the concepts so far has been some of the speakers at a recent data science conference held at Rice University. For example the feature speaker there was from a company in oil and gas industry who was using machine learning to address missing values for drilling sensor measurements, basically a version of ML infill specialized to a specific use case. But the Automunge version of this technique is generalized and fully automated for structured data sets. I think that we have a superior product to the keynote speaker at this conference is a kind of validation. Further validation from this same conference was a speaker presenting on tools for automated machine learning — tools such as Google’s AutoML and the like. I got to speak to this speaker after the presentation and was thrilled for the additional validation of the uniqueness of fully automated data wrangling. I believe we have a unique product and distinct value proposition — now if I could just find the means to get this in users hands — let’s just say distribution is not fully figured out yet.

via My Inventions by Nikola Tesla

That is not to say that this invention is without competition. There is no shortage of companies offering data wrangling tools. Some of the notable offerings include Trifacta, an extremely well funded firm ($100M+ in venture funding), not to mention an established partnership with Google — their Wrangler product is now incorporated directly as an offering in Google Cloud for instance. Although Trifacta’s product certainly streamlines the data processing process with a polished interface, it still makes use of a manual address. Another earlier stage venture that I see as more directly competing already today with where I would like to take Automunge is Feature Labs, from what I’ve gathered in their literature they already have a product intended for the automated generation and evaluation of feature engineering transformations. I haven’t had a chance to try out their product so I’m not sure how effective it is, but I don’t believe they have the advantage of the ML infill technique, which as a reminder we currently have patent pending status for. There are certainly others that come to mind. In my prior blog post documenting a Kaggle competition entry I tried out RapidMiner for instance, I’m not sure how their product sits now but based on my use last year I thought they had a ways to go. However I see the biggest potential competitors to the features that automunge is trying to capture with the current open source offering not necessarily with these data processing tools, but actually more of concern with the mainstream machine learning frameworks. When you consider that the automunge application of today is just a simple called python class, I could see this fitting right into some implementation from the likes of scikit-learn, keras, fast.ai, or you know any of those type of players. Given that the whole point of rolling out an open source implementation is to get integrated into user work flow to make a future paid product for feature engineering path of lead resistance, it would certainly damage that potential were a comparable functionality to be incorporated into one of these mainstream tools. I haven’t seen any indication that that is happening, but it certainly wouldn’t shock me if some variation is in development.

This is my first open source project so am kind of venturing into uncharted territory. I toiled a little over the license strategy, and in the end settled on the GNU GPLv3 license. This one one of those known as a “copyleft”. What that means is that anyone who incorporates this code into their own project will have to maintain consistent licensing. I believe this is the license used by Linux for instance. There are other, more open license approaches (MIT, Apache, BSD, etc) which allow for more open distribution. So the benefit of the GPL is it maintains some semblance of copy protection — after all some third party can’t adopt the code into their own commercial offering, they can only fork it if it remains of consistent (open source) licensing. Does it make sense to open source a tool that we are trying to patent? I think it does based on following rationale. Although we are in effect granting a free license to anyone who uses our code, I think we still have recourse for potential paid licensing agreement for any commercial product that makes use of the automated wrangling via other means. I’ve heard some call this kind of protection as a “lottery ticket for a lawsuit”, and believe me turning into a patent troll is not the goal here, but at the same time the goal is to encourage wide use of the tool such as to facilitate a future paid product utilizing external computing resources that integrates with the offering. We’re certainly not trying to handcuff anyone into our product, the goal is to make this future offering the path of least resistance.

Where do we go from here? Well I’ll be starting a new job soon as a data scientist with a Houston firm (one I’m looking forward to). Part of the rationale of taking this code open source is so I’ll be able to use the resource in the context of this new job without jeopardizing intellectual property. But where does that leave the automunge project? Well to be honest despite a fair bit of technical validation of the concepts and potential for the technology, I have very little validation that I personally have what it would take to run this type of enterprise full time. Take for instance this blog, I’ve been writing on Medium for over two years now with a fairly regular publishing schedule, and despite a few lean exceptions this blog has literally no audience to speak of. If I can’t even hack growth by offering free content in high value domains it doesn’t seem likely that a commercial product would be more successful. I consider some of the essential characteristics of a founder — networking, growth hacking, etc — they’re not exactly my strength. Don’t get me wrong I think I would make an excellent executive, but to grow a firm would require a team — ideally assembled for complementary strengths — and all evidence at hand is that recruiting a team is outside of my ability. The hope is that by releasing this code to the open source community I might be able to circumvent that obstacle by the strength of the product value, at the very minimum the hope is that perhaps a few people might find this of use. I mean no one can predict the future after all.

HT The Quark and the Jaguar by Murray Gell-Mann

I’ll close with a brief thought to put all of this neural network stuff in perspective. Life doesn’t exclusively rely on neural cognition to solve problems, it also relies on adaptive strategies through experiments in genetic variations via mutation or combinatorial transformation. You know, species evolve, people have children and stuff. Let’s not lose sight of what’s important.

The ground was still hanging menacingly above his head, and he thought it was probably time to do something about that, such as to fall away from it, which is what he did. … “Would you like to try?” She bit her lip and shook her head, not so much to say no, but just in sheer bewilderment. She was shaking like a leaf. “It’s quite easy,” urged Arthur, “if you don’t know how. That’s the important bit. Be not at all sure how you’re doing it.”
— Douglas Adams, So Long and Thanks For All the Fish

Books that were referenced here or otherwise inspired this post:

Pandas Cookbook — Theodore Petrou

Deep Learning — Ian Goodfellow,‎ Yoshua Bengio,‎ and Aaron Courville

Quantum Computation and Quantum Information — Michael Nielsen and Isaac Chuang

Incerto — Nassim Taleb

Superintelligence — Nick Bostrom

My Inventions— Nikola Tesla

The Quark and the Jaguar — Murray Gell-Mann

So Long and Thanks For All the Fish — Douglas Adams

(As an Amazon Associate I earn from qualifying purchases.)

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on linkedin for professional inquiries or twitter for personal.

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

Jeff Bezos — Regret Minimization Framework