Automunge Inc.

Artificial Learning, Intelligent Machines

Nicholas Teague

Published in

Automunge

14 min readOct 30, 2018

OAR — That Was a Crazy Game of Poker (live, 34th and 8th)

For those that haven’t been following along, I’ve been using this forum in recent weeks to document the development of a tool for the automation of data wrangling (aka “munging”) of structured data sets such as to prepare them for the direct application of machine learning algorithms in the framework of your choice. In the current iteration the tool is not yet a replacement for feature engineering, however it is suitable as a replacement for the final steps of data processing prior to application of machine learning. Numerical data is treated with both a normalization to mean of 0 and standard deviation 1, as well as the separate optional application of a power law transform to address sets that may be subject to fatter tailed distributions. Categorical data is encoded into multiple columns via one-hot encoding method, and time series data is segregated by time scale and normalized. What’s more, the tool addresses missing data points in the sets by derivation of predicted infill using machine learning models trained on the rest of the data in an automated fashion. The two primary functions are “automunge” for the simultaneous address of initially available train and test sets as well as a function we call “postmunge” for the subsequent consistent processing of test data that wasn’t available for initial address. As we continue the development journey, the intent is to continue building out the range of feature transformations such as to facilitate a full automation of the feature engineering workflow. Initially the tool will be intended for application on a user’s local hardware, and then as we reach features requiring external computing resources we will layer on a user account with a pay per use pricing model. I’ll use this week’s write-up as a summary of those updates that have been incorporated since last week, probably with a tangent or two along the way.

In our last post, a primary point of focus was an attempt to clean up the architecture of the code base. The evolutionary nature of the development process to date was a source of a lot of redundancy in data stores and probably symptomatic of lack of overarching vision from the beginning. Admittedly the primary saving grace that kept the code from evolving in the direction of disfunction was the inspiration of the process flow from the typical manual address of data that I had found through my explorations of Kaggle kernels and the like. That being said, the full automation of this workflow along with the integration of the ML infill technique within was more based on a few sparks of inspiration than any prior art.

This week’s coding updates largely were a continuation of this effort to consolidate components and lay the groundwork for future buildout. One architectural update originated from the realization that the application of column specific processing functions were somewhat dysfunctionally broken into pieces that both took place interior to a category specific processing function along with additional actions that took place separately in the master automunge function. The solution was simply to move all of the processing into each category specific processing function. Those functions that had previously been fed a series of normalization parameters and were now revised to the simple argument of a column dictionary as input and output. Another architectural revision involved the data stored in the column specific dictionary. Previously the dictionary had captured a list of the full range of columns derived from a source column with a second list only populated for cases of multicolumn categorical derivations. The updated approach made use of this second list (which we call categorylist) for all transformations, even single column transformations. By generalizing to this use of two separate lists we were able to generalize the application of ML infill methods to a single function call independent of the type of transformation applied. This generalization of the ML infill call I think went a long way to simplifying the code. In fact through this generalization we were able to replace ~300 lines of code from the automunge function with a single called ~60 line function, and similarly we were able to replace ~200 lines of code from the postmunge function with a single called ~60 line function. On their own these don’t sound like tremendous savings, however as we continue building out our range of feature transformations these savings will compound.

Some of the other updates were focused on a few revisions to the address of numerical sets. I had noted previously that one weakness of the current approach was the potential for a numerically formatted categorical column to be interpreted as and processed consistently to a numerical set — this could lead to erroneous processing for data like phone numbers, zip codes, etc. The solution was multifaceted. First it is already a property of pandas (the Python SciPy implementation that the tool is built from) to allow a user to assign a dataframe column to data type of categoric, even in cases of numeric data.

df[‘serialnumber’] = df[‘serialnumber’].astype('category')

So one of the means of address was to update our category evaluation function to recognize if a user has implemented this as type transform prior to running automunge. Another means for a user to assign categorical was implemented with the new addition of a list argument to automunge allowing a user to pass column names that should be treated categorically. Finally, as a fallback for cases of full automation of data streams (the end goal after all), we turned to a simple (and admittedly hacky) heuristic in which a user assigned ratio, currently defaulted to 0 if not otherwise addressed, which designates the ratio of a column’s distinct values to total number of rows below which a column should be treated as categorical. This allows a user to provide a means for a automated address of data. However since any appropriate default value would necessarily be highly dependent on the scale of data that is being processed (after all 0.01 applied to 100 rows would be a very different assessment than 0.01 applied to 1,000,000 rows) we left the default to 0. A future iteration of this tool may scale a default value based on the scale of the data set applied.

Note fast.ai cited “A systematic study of the class imbalance problem in convolutional neural networks” by Mateusz Buda, Atsuto Maki, Maciej A. Mazurowski

This offhand comment in one of the lectures from fast.ai’s Machine Learning course inspired what I think may be a pretty useful addition to the project. The automunge function now includes a boolean argument as “TrainLabelFreqLevel” which when set to True will automatically levelize the frequency of label categories populated in the train set — specifically by copying multiples of the rows with lower frequency to achieve approximately (give or take) equal distribution of label categories in the set intended for training — this address is currently limited to labels populated with either binary or categorical values, a future extension will allow for this method to be applied to numerical sets as well by leveling the frequency of different ranges of data e.g. such as numerical data sorted by quintiles. A few more updates which in the end I think will prove quite useful to expand the range of use cases for the tool included a simplification of the option to pass train sets to automunge either without a label column attached or without a parallel test set for initial processing. Finally, we added a default to output the validation sets for downstream model training as two separate sets — the first for validation purposes in hyperparameter / model tuning and the second for final validation prior to production release — of course if you’re training a model with k means validation then only one of these validation sets will be needed, so you can just set one of the validation ratios to 0. Anyway there’s a companion Colaboratory notebook available if you’re interested in the details.

application of automunge(.) function for initial processing of data

OAR — Night Shift…Stir It Up

application of postmunge(.) function for subsequent processing of data

Of course the week wasn’t all Red Bulls and late night coding sessions (*metaphorical Red Bulls that is, never touch the stuff, but they say it gives you wings). The culmination of this development session has actually been the realization of a material step for commercialization. Yes, we are hereby incorporated. Automunge Inc. Tag line: Artificial Learning, Intelligent Machines. Trademark and patent pending copyright 2018 freedom of speech all rights reserved hence forth so forth and so on. Just between you and me I’m kind of figuring this out as I go along. I think we’ve got a material innovation and opportunity on our hands, but have been so focused on the building aspects in recent weeks that some of the other considerations, marketing, funding, product design — well to be honest I don’t consider them fully fleshed out just yet. Not to worry nothing that a few more long bike rides shouldn’t resolve.

image via The Hudsucker Proxy (tongue in cheek)

So let’s quickly take stock of where we are. We now have a working prototype of a software implementation of a useful and I believe novelly generalized approach to cleaning up messy data. The current iteration is realized in a Jupyter style notebook on the Colaboratory platform, and the primary reason I haven’t packaged it in it’s current form is that the easiest way to do so, on the PyPI platform, necessitates an open source license — which I toiled over whether that was an appropriate next step and finally concluded that for now I’m going to claim intellectual property protections. So that first of all means copyright, and I’m now finally able to make the claim “patent pending”. A patent is a pretty big deal, especially for such a potentially central function in the data science realm, however I don’t want to overstate my case here. I’ve hardly had a full team of lawyers working on the registration, so there remains some uncertainty as to whether a full patent can be realized. Coupled with the intellectual property we now have a business vehicle to go with it. Automunge Inc., trademark pending, is now established. It took kind of a leap of faith to circumvent a LLC and jump straight to corporation status, the hope is that if funding does materialize this will allow us to move forward quickly. It’s probably worth noting that I also recently accepted a job as a data scientist with a Houston firm, so starting in a few weeks Automunge Inc will transition to a side project. Who knows perhaps there’s potentially value to extract in the established intellectual property.

Side bar: Being “Fustest with the Mostest”

Highlights from Innovation and Entrepreneurship by Peter Drucker

“Fustest with the Mostest” requires an ambitious aim; otherwise it is bound to fail. It always aims at creating a new industry or a new market. … perhaps because “Fustest with the Mostest” must aim at creating something truly new, something truly different, non-experts and outsiders seem to do as well as experts, in fact, often better. … the outsider may have an advantage. He does not know what everybody within the field knows, and therefore does not know what can’t be done. … being “Fustest with the Mostest” is very much like a moon shot. … there has to be one clear-cut goal and all efforts have to be focused on it. And when this effort begins to produce results, the innovator has to be ready to mobilize resources massively. … it must always aim at creating a business that dominates its market … demands substantial and continuing efforts to retain a leadership position; otherwise, all one has done is create a market for a competitor. … For this strategy to succeed at all, the innovation must be based on a careful and deliberate attempt to exploit one of the major opportunities for innovation … In innovations that are based on process need, everybody in the organization always knows the need exists, yet usually no one does anything about it. However, when the innovation appears, it is immediately accepted as “obvious” and soon becomes “standard”. … the process need opportunity has to be tested against three constraints: do we understand what is needed? Is the knowledge available or can it be procured within the “state of the art”? And does the solution fit, or does it not violate the mores and values of the intended users? … the solution must fit the way people do the work and want to continue to do it.

A common pitfall for aspiring entrepreneurs is to mistake an innovation for a market opportunity. The two sets only occasionally overlap. So let’s take a second to ask ourselves if there is a market opportunity here. I’ve previously noted in this forum a 2017 survey by data science practitioners. Somehow when I first read this report I thought it said that there were around 20,000 professionals in the industry but apparently that was a bit of confusion I think that might have been the number of survey respondents lol, and according to the survey the largest problem these scientists address is cleaning up messy data. Since finding that study a friend recently brought to my attention a (slightly dated) 2013 report, one slightly more eye opening, in which it was estimated by 2018 there would be a shortage of 190,000 data scientists. I found that number kind of surprising. However I considered it a corroboration that another report recently estimated that by 2020 the number of data science professional jobs in the US to exceed 2.7M. Yes you read that right, around 2% of the active workforce (!). If you believe these numbers, we should still consider that of course not all of these professionals are working directly with data, and of those that do I’m sure many use platforms other than Python — the 2017 Kaggle survey said around 76% use Python for whatever that’s worth. Still I think this speaks to the market size, and let’s just play a game where we try to assign a number, let’s say our automation for the final steps of data wrangling makes these workers 1% more productive, that means we’ve achieved labor savings of 0.01*2.7M*0.76*$100k ~ $1B per year — capturing even a small fraction of that value would be a success. But the end goal is not to make these workers 1% times more productive, it’s to automate a sizable chunk of the data address, at which point we could be talking real money. Is it far-fetched to imagine that a one man startup could address a market of this size? Perhaps. But we’ve all got to start somewhere, and I certainly don’t intend to continue tackling this alone indefinitely.

Sublime — Marley medley (acoustic)

So we’ve spoken to the market size, but how about the product? Does it fulfill an unmet need? Is there some driving market trend causing a potential shift in some direction — like say Swanson’s law for the falling costs of solar panels? Although the end goal is to fully automate the feature engineering workflow, I think the current iteration could be potentially meet the usefulness criteria even in it’s current form. The practice of data wrangling for purposes of machine learning application will generally benefit from normalizing numerical sets and encoding into binary registers categorical sets. That’s a universal step that from what I’ve gathered has in general practice largely required a manual address to each specific column. Not only are we automating this portion of the workflow, but we are also adding a novel generalized address to infill missing values, one that I believe is superior to the state of the art methods such as inserting plug values based on a set’s mean or the like. The ML infill technique infers infill for missing rows based on properties of a specific row, and it does so in a fully automated fashion using machine learning models trained from the rest of the data set. In addition to the automunge function, the incorporation of the postmunge function further generalizes the workflow by allowing the practitioner a means to easily and consistently process subsequent test data sets which were not available during the initial automunge application for purposes of generating machine learning predictions. Through each of these features I believe we are materially reducing the complexity of workflow for a machine learning practitioner addressing structured data sets.

image from the notebooks of Leonardo da Vinci

Side bar: On “lecturing birds on how to fly”

Excerpt from Antifragile by Nassim Taleb

…governments and universities have done very, very little for innovation and discoveries, precisely because, in addition to their blinding rationalism, they look for the complicated, the lurid, the scientistic, and the grandiose, rarely for the wheel on the suitcase. Simplicity, I realized, does not lead to complication. As Fat Tony would say, they “tawk” too much. Less is usually more. — Nassim Taleb

I’ll close with a quick tangent. The author Nassim Taleb is known to call out academics and the like for “lecturing birds on how to fly”. I consider this type of sentiment as unnecessarily dismissive to the value of theory. Peter Drucker, who many consider the father of management theory, never built a company of his own, but through his writings had a tremendous influence on organizations, for-profit and non-profit alike. The tinkering or experimentation necessary to navigate in some new terrain is not always possible without non-ergodic risk. The ancient Greeks had limited vocabulary for colors, lacking a word for the color blue for instance, and as a result some of their literature in a few cases struggled with confusing analogy. Similarly, without the mental models and frameworks or equations of theory to give us a vocabulary we may be left with obstacles or opportunities that might otherwise be difficult to conceptualize. Don’t get me wrong, the map is not the territory and it is certainly possible to introduce new risk due to that fact, so theory should never be taken for granted as physical law — some healthy skepticism is needed. Consider Leonardo da Vinci, who in his notebooks documented his study of birds and flight. Yes he never built a working flying machine, but you know what he came pretty damn close.

There is as much pressure exerted by a substance against the air as by the air against a substance. Observe how the beating of the wings against the air suffice to bear up the weight of the eagle in the highly rarefied air which borders on the fiery element! Observe also how the air moving over the sea, beaten back by the bellying sails, causes the heavily laden ship to glide onwards! So that by adducing and expounding the reasons of these things you may be able to realize that man when he has great wings attached to him, by exerting his strength against the resistance of the air and conquering it, is enabled to subdue it and to raise himself upon it. — Leonardo da Vinci

Bob Marley — One Love (live)

Enjoyed Orlando for a quick spell of R&R

Books that were referenced here or otherwise inspired this post:

Innovation and Entrepreneurship — Peter Drucker

Antifragile — Nassim Taleb

Notebooks— Leonardo da Vinci

(As an Amazon Associate I earn from qualifying purchases.)

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share . I can also be reached on linkedin for professional inquiries or twitter for personal.

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

Haley Reinhart with Postmodern Jukebox — Seven Nation Army