A Family for Data

Infrastructure of Insights

Nicholas Teague

Published in

Automunge

20 min readNov 18, 2018

Automunge GitHub repository available here

“… I shall endeavor, while intending no discredit to anyone else, to make myself understood to Your Excellency for the purpose of unfolding to you my secrets, and thereafter offering them at your complete disposal, and when the time is right bringing them into effective operation all those things which are in part briefly listed below:” — Leonardo da Vinci, letter to a ruler, Letters of Note

Buckwheat Zydeco — My Feet Can’t Fail Me Now

Background

For those that haven’t been following along I’ve been using this forum over the last three months to document the development of a tool intended to automate the processing of structured data sets for the purposes of application of machine learning. The project started in a somewhat haphazard fashion in an attempt to take some of the lessons learned from a few brief forays into Kaggle competitions (a platform for competitive crowdsourced innovation contests in data science) to build my personal competency in working with data. Along the way I was helped by a few leaps of insight:

First insight: The current generation of machine learning frameworks require a specialized format of data (numerically encoded) in order to process their algorithms. Further, additional steps of processing, such as z-score normalization of numerical features and one-hot encoding of categorical features, can make for more efficient learning. Current mainstream practice in data science requires at least some degree of manual address to prepare data into this form, varying based on the tools applied.
Second insight: Given some general functions for processing different categories of dataframe columns (categories such as numeric, binary, categorical, time-series, etc), it is possible to incorporate an evaluation function that determines from properties of the data which of these processing approaches to apply to each column, thus facilitating a fully automated address for final steps of data preparation, which we have implemented with what we call the automunge(.) function.
Third insight: current mainstream practice for addressing missing values in a structured dataset are somewhat unsophisticated with practices such as inserting the mean for a set, most common value, or values from an adjacent cell for instance. Each of these approaches in aggregate slightly raises the bar to efficient downstream learning. Using the framework of automation from automunge, it is possible to incorporate an additional step of predicting infill for missing values in a set based on properties of adjacent cells in the same row using machine learning models trained on the rest of the data in a fully generalized and automated fashion. We call this feature of automunge “ML infill”.
Fourth insight: Part of the challenge of preparing data for machine learning algorithms is that a fully consistent processing approach is required between the data used to train a model and any subsequent data used to generate predictions from that same model in the future. This realization led to the development of the function we call postmunge(.) for the consistent processing of any downstream data.
Fifth insight: The processing of a column of data does not have to be realized as just a direct translation into some alternate form. By presenting our data to a machine learning algorithm in multiple alternate forms our model may be able to recognize features earlier in the application of backpropagation than it might otherwise. Through the automunge development process we have continued to build out the range of feature engineering transformations to present our data in multiple forms for purposes of training.

“To speed up the experimental work, which is at present being carried on within the limits of the budgets of University laboratories, by providing funds, if such funds be required, through his contacts with private persons who are willing to make contributions for this cause, and perhaps also by obtaining the co-operation of industrial laboratories which have the necessary equipment.” — Albert Einstein, letter to the president, Letters of Note

Buckwheat Zydeco — Ya Ya

Intentions

As the automunge project processed, I began to see potential for this to become more than just a building and learning experiment. Although I still don’t consider my knowledge of the competitive environment extremely exceptional, by following closely machine learning industry publishing through resources like the TWiML podcast, the Import AI newsletter, fast.ai courses, and especially a honed twitter feed, I at least got a strong sense that there was no single dominant commercial player established relevant to this project. Long story short Automunge progressed from a research project to an incorporated business entity, with the hope that a monetization could be reached based on the following goals:

Goal one: the provision to data science industry participants of a packaged python product based on the current capabilities of the tool, intended as a free resource to simplify the final steps of data processing for purposes of machine learning, with the hope that due to the simplicity of the fully automated address coupled with the inclusion of novel capabilities for ML infill that some degree of reach would be established for a user base.
Goal two: although this blog has long been fully transparent of the algorithmic components through inclusion of a series of Colaboratory notebooks in this development environment, by open-sourcing this tool (a recent step), we hoped to encourage further scrutiny of the algorithms, for without the name brand credibility of a large firm we obviously deserve some increased skepticism. Such scrutiny and feedback could potentially be a resource of improvements to the code-base, and if a supportive developer community were to emerge it would be welcomed.
Goal three: the intent is to continue building out the range of feature engineering transformations to facilitate a more fine grained address to different categories of data that can be inferred from the properties of a set. Although the tool in it’s current iteration is not extremely sophisticated from a statistical standpoint, I believe the framework of the tool is getting close to the point where it could serve as a generalized platform to build capabilities for more sophisticated types of transformations and feature engineering address.
Goal four: and this is the big one. All of this processing will require some degree of computational load. The longer term goal is to establish a paid model for application of the automunge tool in cases requiring external computing resources. We believe that by establishing a strong enough user base with our initial offering that such a second generation tool, integrated into the framework, could become a “path of least resistance” for a practitioner.
Goal five: the art of feature engineering is about more than just transforming distinct subsets of data, there is potential to further refine data with multivariate transformations. Of course evaluating the full range of data combinatorial transformations is what deep learning is all about, and we expect some might argue that deep learning negates the need for any significant effort for feature engineering. We believe there is potential for a middle ground, with some “pre-training” achievable with a degree of algorithmic evaluation for some generic classes of sub-feature transformations. To be honest this is more of a hunch than a tested hypothesis, and the realization of this will require further experimentation.

via Nick Cave — Feat on display at Orlando Museum of Art

“I do not pretend to give such a sum; I only lend it to you. When you shall return to your country with good character, you cannot fail of getting into some Business, that will in time enables you to pay all your debts. In that case, when you meet with another honest man in similar Distress, you must pay me by lending this Sum to him; enjoining him to discharge the Debt by a like operation, when he shall be able, and shall meet with another opportunity. I hope it my thus go thro’ many hands…” — Benjamin Franklin, letter to a debtor, Letters of Note

Buckwheat Zydeco — Hard to Stop (Live)

How Automunge Works

The tool in its current iteration is a defined Python class called AutoMunge. Although I have not yet taken the step of packaging on PyPi for pip install, it is getting close to that step. Once imported, the class is intended to be operated by interface with the two master functions automunge(.) and postmunge(.), described further here:

1. automunge(.): the automunge(.) function takes as input a “tidy” structured data set such as might be imported from a CSV file, albeit currently the tool depends on the set as a Pandas dataframe. The tool takes as input a “train” set intended for use in the downstream training of a machine learning model, and if available a consistently formatted “test” set intended to develop predictions from that same downstream model. The tool allows for the specification of any special case columns such as any label column in the train set (used as the target variable in machine learning cost function), ID columns in train or test set intended to identify rows, or a set of any columns which are to be excluded from transformations. The function has a series of other inputs that allow for some customization of address (such as selection for various feature transforms or selection for use of the ML infill method). When run, the function will process these train and test sets into a format directly suitable for application of machine learning algorithms in a framework of a user’s choice. Numerical data is normalized, categorical data is one-hot encoded, and any missing values receive predicted infill use properties of the rest of the set via ML infill. The function returns a series of NumPy arrays intended for use as a training set, two separate validation sets (one for hyper parameter tuning and a second for final evaluation prior to release), a test set, corresponding labels and ID sets, a list of column identifiers, and finally a python dictionary which should be saved for the subsequent consistent processing of test data through the postmunge(.) function described below. The function is not a replacement for feature engineering, however it is suitable as a replacement of the final steps of data processing prior to application of machine learning. Some of the key support functions that facilitate operation include:

evalcategory(.): to allow for automation of the data wrangling address, the tool requires some capability to assess the contents of a column and identify an appropriate processing approach. The evalcategory(.) function relies on a series of tests to assign the contents of each column into one of a series of pre-defined categories. Some of these categories include ‘text’ for categorical data, ‘bnry’ for binary values, ‘nmbr’ for numerical sets, ‘date’ for time-series data, and etc. The function looks at the most common datatypes in a column’s set, and uses a few relatively simple heuristics to assign a category.
processing(.): The assigned category from the evalcategory(.) function allows the algorithm to select which of a series of processing functions to apply to each column of data. The processing functions take as input a column of name ‘column’, and outputs a column of name ‘column_ctgy’, where ‘ctgy’ is an identifier for the type of transform applied. The exception to this naming convention is for categorical sets, where the outputted one-hot encoded columns are labeled as ‘column_category’, where in this case ‘category’ is the value from the categorical set corresponding to that column’s activation. This naming convention allows for a simple means of recognizing the steps of derivation for a set in the output, for with each step of transformation the column heading identifier will accumulate suffix labels such as ‘column_ctgy1_ctgy2’ etc. Also outputted from each processing function is a dictionary capturing all of the information and normalization parameters required for both the ML infill techniques as well as the consistent processing of some future test set in our postmunge(.) function which we will discuss shortly.
MLinfillfunction(.): Once our tool has prepared the sets into a machine learning ready form in the processing step, it can then implement the ML infill technique for infill of missing data. Note that in the prior processing step the sets receive some initial infill using conventional methods such as mean value for a numeric set, most common value for binary, or a distinct column identifier in categorical. The processing function also developed a distinct column of boolean identifiers indicating which rows in each column will be in need of infill, which will be an input to the ML infill techniques. The ML infill works by running a for loop through each column in the set, and for each column developing a corresponding set of training data, labels, and features from the rest of the data. Note that some care is required for multicolumn transformations to avoid leakage, which is taken into account, although a future extension might be developed for a user to specify other potential sources of leakage between columns that were present in the original data such that these columns may be carved out for that column’s ML infill address (wondering if there’s some way to test for this to automate that step, such as two columns exceeding some bar of correlation). Using a column’s developed sets, a predictive model is trained. The tool currently makes use of a random forest regressor for numerical sets and random forest classifier for categorical, although I expect there could be potential to adopt other methods (part of the reasoning here is that the random forest methods are fairly good for a generalized address, not requiring a great deal of hyperparameter tuning, although the intent for future development is to customize to a degree parameters of the model training operation based on properties of the column). Once trained, the model corresponding to a column is used to predict infill for any missing values.

2. postmunge(.): The second master function for operation is called postmunge(.). Postmunge is intended for the consistent processing of test data (intended for use to generate predictions from a trained model), for any data that wasn’t available for the initial application of automunge. Note that in addition to the test data inputted as a Pandas dataframe, the function requires as input the python dictionary called “postprocess_dict” which was returned from the application of automunge. The postmunge function extracts from the postprocess_dict all of the relevant parameters required for consistent processing of data, including the generation of infill predictions via ML infill. It returns a NumPy array of consistently processed data.

“In other words, we think we have found the basic copying mechanism by which life comes from life. The beauty of our model is that the shape of it is such that _only_ these pairs can go together, though they could pair up in other ways if they were floating about freely. You can understand that we are very excited. We have to have a letter off to Nature in a day or so.” — Francis Crick, letter to his son, Letters of Note

Buckwheat Zydeco — Help Me Understand You

This Week’s Updates

This week was something of a sprint to the finish line to try and get a few more architectural revisions in place — I’ll be starting new job next week and expect won’t have nearly as much time to work on coding and writing, so am trying to get as many loose ends tied up as possible:

1. Some of our prior updates to the tool included a few steps to generalize the application of ML infill to a generic called function independent of the category of data. Upon some reflection I realized that this type of object oriented generalization in address ideally would be needed for the application of processing functions as well — which is what we have now accomplished with our processfamily(.) function.

2. Not only have we generalized the application of processing to a category-neutral basis, but we’ve also established some new primitives for processing steps. The primitives are meant to allow multi-tiered processing in a kind of family tree structure, loosely inspired by the Teague family and kin.

3. These primitives are presented to the automunge function in the form of a dictionary we call transform_dict, one that although is currently a static entity in the function, is intended for future extensions to be potentially fed to automunge as an input API, allowing for a kind of programming of the steps of processing — kind of like how TensorFlow allows you to program the architecture of a neural network.

4. The processing functions themselves have been previously presented in this forum in prior posts, but as a quick refresher we have a distinct set of processing functions for each evaluated category corresponding to the keys found in the transform_dict presented above. Most categories have a pair of processing functions for the dual treatment of train and test sets in automunge plus a corresponding “postprocess” function for the treatment of the test set based on input from the postprocess_dict. However for processing steps where no input from the postprocess_dict is required we have a single process function applied to one set at a time (thus the three designees in this code dualprocess, singleprocess, and postprocess).

5. The processing of each column starts in the processfamily(.) function which applies a different processing function based on the key derived from the transform_dict presented above. Keys without offspring are processed using the processcousin(.) function and keys with offspring are processed using the processparent(.) function which implements a new family generation, using recursion by calling processparent(.) within the processparent(.) function.

“To let another man define your own goals is to give up one of the most meaningful aspects of life — the definitive act of will which makes a man an individual. Let’s say you have a choice of eight paths to follow (all pre-defined paths of course). And let’s assume that you can’t see any real purpose in any of the eight. THEN — and here’s the essence of all I’ve said — you MUST FIND A NINTH PATH.” — Hunter S Thompson, letter to a friend, Letters of Note

Buckwheat Zydeco — Turning Point

Conclusion

I’ll finish with a few admittedly random thoughts here, as I probably won’t be writing with as much frequency now that am venturing back into the 9–5 world. I’ve had a few tidbits that were intended to be fleshed out into essay form and perhaps they may be better served here as brief forays to close:

I’ve gotten a sense that there is a kind of active debate in the machine learning community involving artificial general intelligence (AGI). The issue being that research and tools that are not working towards superintelligence are seen by many a distraction that will soon be obsolete and thus are not necessarily of value, like investing in compact discs near the dawn of the iPod. The automunge project obviously does not subscribe to this philosophy. When our ancestors were first trying to implement human flight, they sought means to mimic the movements of birds flapping their wings, but in the end it was through wings with propellers attached and then gas turbines that wide scale human flight was realized. Yes it is possible that some giant leap will be realized that allows a machine to mimic the brain in interpreting arbitrary input data without human supported intervention, but in the mean time data scientists and researchers who work with data have a real need for tools that simplify what is largely in mainstream practice a manual task of transforming data and preparing for application of machine learning algorithms. Besides, even if researchers do reach the domain of AGI in our lifetime, I would find it surprising if it was allowed for unfettered implementation by all classes of users — it is a reality of our regulatory environment that technology of sufficient sophistication leading to potential dual use (e.g. defense) applications could very likely be restricted for universal access, leaving tiers of users — some that may have access to domain independent interpretation capabilities, and some that are otherwise left with some reduced capacity implementation resources (to be clear this is mostly speculation about where we might be headed). For now we don’t have AGI, it is the age of the compact disc, but you know what there are certainly companies in the music industry that survived the transition to streaming. Wayne Gretzky says hockey players shouldn’t skate where the puck is, they should skate where it is going — well you know that advice only works if you then take the puck somewhere you may have a reasonable shot on the goal.
So let’s ruminate for a second on what a monetized version of automunge would look like. Given that this is an open source product, we’re obviously not intending to sell as a product the software itself. When Xerox initially had so much success commercializing their copy machine, they didn’t sell the equipment outright, they provided it into offices to incorporate into their workflow, and then monetized on a per copy basis. Now as an open source python class we’re obviously not going to be charging per application, however as the framework is intended to be built into a platform with an API interface for running different versions of transformation chains, we expect this application to reach computational loads that may require external computing resources — at which point we may be able to offer on a per use basis, like charging per copy. The potential for revenue streams is enhanced when you consider that a user making initial use of the automunge(.) to train a downstream machine learning model will then be tied to the postmunge(.) for purposes of processing subsequent data intended for use to generate predictions. This could be a channel for recurring revenue streams. The idea being for applications of automunge that make use of external computing resources, the postprocess_dict normalization dictionary output from automunge and used in the postmunge operation we may decide to incorporate encryption — this is an important business decision though and a final path for this measure may require some more rumination. After all if the WAV files on compact discs were encrypted we may never have seen the success of Napster.
From a development standpoint, I do see a few more updates that would be useful before rolling out an API. First as we discussed prior the intent is to make the transform_dict assembly (the keys to the generational tree of transform chains) available for external interface as opposed to the static form as currently addressed. There certainly is a need to increase the sophistication of numerical set statistical address, both from the evaluation inference standpoint as well as the transformations themselves — I know that Wolfram Language has some functions for inferring distribution types and parameters from a set for instance and wondering if there may be some way to incorporate that capability into our paid model. Right now we lack the functionality to perform downstream transformations on multicolumn sets. I have an idea of how to enable that, will require reworking a few things, but I think it is doable. I’d really like to incorporate a method to allow for multivariate transformations as well. I think that will be the big feature when we roll out version 2.0. I think when we get to the stage of an external API interface, it will help to have a capacity not just to allow specification of which of the existing transforms are to be applied, but also perhaps the ability for user specified transformation functions and corresponding user specified evaluation functions — perhaps even developing a library supported by an open source community. That would be really cool. I guess what I’m getting at is the hope is that this will be more than just a static tool, but a platform allowing for building processing evaluation and transformation capabilities. Now there are still a few menial issues that need address as well. For instance we currently have a fairly tacky heuristic that deletes a column if >80% missing values which I need to put some more thought into for how to add any more sophistication. There’s also currently an annoying bug that provides an additional copy of the NArws column for categorical sets which I just haven’t had time to figure out. It wouldn’t surprise me if there may be another issue or two that I haven’t found yet. Testing and validation has not yet been a primary priority.
Those few that have come across some of my earlier posts may know that I’ve had an interest for a while in quantum computing, including a fair bit of research and writing on the subject. One of the sacrifices made in putting so much focus on this Automunge project has been setting on hold this fun hobby of research into the coming quantum computer revolution. Not long before I started this project I did get a chance to audit the series of eDx courses on Quantum Information Science taught by Aram Harrow, Peter Shor, and Isaac Chuang at MIT — this was an excellent resource and I do recommend the series to any that may be looking to develop some knowledge in the field. I found one of the most fascinating elements of the class the issue of copying quantum states. As anyone familiar with quantum dynamics is probably familiar, the no-cloning theorem is there as a kind of protection for the Heisenberg uncertainty principle — we can’t simultaneously know both the position and the momentum of a particle in superposition for instance is the classic example. Well it turns out there is a kind of technique that can be realized with quantum computers enabling what is known as a gentle measurement — one in which although the no-cloning theorem still applies — we can’t make an exact copy of a qubit’s state — the gentle measurement does allow for a probabilistic copy — whereby the more qubits applied to the method the higher the fidelity of the copy. I find this concept fascinating and hope to be able to research further if time allows.
If you consider our lives as like the paths from Borge’s short story The Garden of Forking Paths, then with every person we are lucky enough to cross paths we may on different routes experience a whole gamut of experiences. In some paths our brother may annoy us by always nitpicking our facebook posts, in another our sisters may out of nowhere get married or have twins practically at the same time. In some paths that girl you met at the bookstore may turn out to be a soulmate, perhaps in others forever a distant connection. I consider myself blessed to have run down this distinct path with such great family and friends, and I want to take a quick second to thank a few. Cheers to Laura for making college football fun again. Cheers to Heidi someone I could always count on. Cheers to my fellow UF misfits and vagabonds — I never even came close to rushing a fraternity, but it certainly felt like we were filming a reality remake of Animal House. Cheers to all of my coworkers that I’ve had the pleasure to toil in obscurity alongside over the years. (You didn’t know I had a blog? Well you never asked.) Cheers to the entire board of the Gina McReynolds Foundation for your selfless giving to such a noble cause, you guys are an inspiration to me. Cheers to fellow alumni of the Real World Risk Institute — a loose collection of random interactions that has shaped the way I see the world, may there be many more to come. Cheers to my amazingly quirky family, whoever said that all happy families are alike obviously didn’t know about the Teagues. You know Seneca said that “every individual can make himself happy. … For he has always made the effort to rely as much as possible on himself and to derive all delight from himself.” Over the years some have criticized Seneca as somewhat hypocritical in that he preached the tenants of stoicism, but lived the life of wealth and influence. But consider his own words from On the Shortness of Life that immediately follow those preceding: “So what? Am I calling myself a sage? Certainly no.” I never claimed I could do any of this alone. I consider myself beyond lucky to have found myself surrounded by so many people I can look up to.

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

Books that were referenced here or otherwise inspired this post:

Letters of Note — Shaun Usher

Gödel Escher Bach — Douglas Hofstadter

From the Diaries of John Henry — Nicholas Teague

Collected Fictions — Jorge Luis Borges

On the Shortness of Life — Seneca

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on linkedin for professional inquiries or twitter for personal.

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

“P.S. I’LL ALWAYS CHERISH THAT AFTERNOON WE SPENT TOGETHER IN RIO, WALKING ALONG THE BEACH, LOOKING AT _rocks_.” — Steve Martin, letter to a fan, Letters of Note

Paul Simon — That Was Your Mother