Artificial Learning, Intelligent Machines

What we’ve been up to lately at Automunge

Nicholas Teague
Automunge
9 min readSep 1, 2019

--

The Stowaway Peers Out at the Speed of Light —by James Rosenquist, on display at the Orlando Museum of Art

Introduction

It’s been a wild kind of month or two here in the Automunge family. Had taken a little break from our traditionally somewhat aggressive essay publishing schedule to focus on software refinements, and well I think the effort has paid off, at least from an elegance of codebase standpoint. I’m not sure what a typical software update rollout schedule looks like on a mature project, but rolling out a new patch almost daily I’m guessing is probably a touch below the median. I’ll try to use this essay to quickly highlight a few of the updates for any that are following along. Warning, this essay probably won’t be as, how should I say, poetic, as some of my others. Really just trying to get caught up with update disclosures.

v2.13

The automunge functions accept string column identifiers to the ‘ID_column’ parameters for any columns which are to be excluded from processing, such as to be segregated from the data sets and consistently shuffled and/or partitioned and returned as ‘ID’ sets, such as could serve as index identifiers for the rows of tabular data that won’t necessarily be fed directly into a predictive algorithm. With v2.13 we started also accepting lists of string column identifiers such as to allow multiple columns to be apportioned to these ID sets. We also introduced in 2.13 a new normalization method somewhat comparable to z-score normalization, but with the standard deviation portion replaced with mean absolute deviation.

v2.14

This was a simple update, if user elects for outputted numpy arrays instead of pandas dataframes, in cases where the labels set intended for a training operation consists of a single column, that column is flattened using the ravel function as is standard practice. Just stripping away one more bit of complexity for the training operation.

v2.15

Lol this was just taking same incorporation of ravel flattening and incorporating into the other label sets intended for validation sets. Kind of was an oversight in 2.14. Hey nobody’s perfect!

v2.16

As one of the cool features of automunge is the potential for automated dimensionality reduction of a data set via means such as principle component analysis, we introduced here a really useful method to exclude from PCA transforms columns with boolean (1/0) content, such as otherwise might result in memory issues for the returned set. Other than that I believe this also included a couple of random bug fixes for label set engineering and the oversampling method (for those not familiar, automunge has a method for preparing data sets to oversample labels with lower frequency intraining).

v2.17

Realized here that in real world use cases there certainly could be a desire for application of the automunge function to pandas dataframes with non range integer index column (or columns), so yeah this update was the incorporation of methods to allow that kind of data, with any index columns partitioned into the returned ‘ID’ sets we mentioned earlier.

v2.18

A few updates associated with infill functions. Of course infill to numerical sets with a column’s mean was part of functionality since almost day one in default infill methods, but here we added the (somewhat redundant) ability to specifically designate such approach, along with comparable method for infill with a set’s median (in each case derived from the training data). We also made the function call a little more user friendly associated with the assigninfill parameter, especially considering the increasing range of infill options available, in that a user no longer has to pass a full list of infill function entries to specify any particular method.

v2.19

Quick bug fix for functionality introduced in 2.18. Move fast and fix things, that’s our motto.

v2.21

Quick bug fix associated with infill insertion. MFAFT as is our motto. Oh yeah hat tip for the movie recommendations screenshot from one of those Kaggle intro to ML courses (sorry can’t remember which one).

v2.22

So this was a good day. I think a bug fix or two, but a big philosophical departure trying to reduce our reliance on methods from the Scikit Learn library. Here we replaced a few cases where our processing functions had relied on their Labelencoder / OneHotEncoder methods with some Pandas-based homegrown variants. We also introduced some new processing functions to the library for purposes of performing logarithmic transforms on numerical sets and also assembling binned groupings of data based on powers of ten.

v2.23

Well let’s see, a few things here. For one we introduced a new ordinal processing method for categorical sets (encoding into a single column of integer identifiers as opposed to one-hot encoding, such as might be beneficial for categorical sets with large number of values). Oh yeah hat tip for helping me recognize the need for this method to the Kaggle “Intermediate Machine Learning” Micro-Course by Alexis Cook. Along with the ordinal processing we also introduced option to exclude ordinal sets from PCA transforms. We included some stuff here to fix labels engineering on validation sets and also I think an improvement to the oversampling preparation operation. But really I think the most important update here was the introduction of a ‘printstatus’ argument, kind of equivalent to a “verbose” method as you may see in other libraries, allowing a user to track status of processing through operation. The specific printouts were further refined in a few more updates to come, but this is where it started.

v2.24 & 2.25

I was really happy with my productivity with this rollout (two updates rolled into one), but don’t want to give impression that this was all performed in one day, some of the changes here I had been chewing on in back of my head for a little while before putting down the code. I won’t go through every update here because there were a bunch, but some of the most critical included the revision of the family tree primitives establishing methods of applying transformation functions (to a reduced set of primitives with equivalent functionality). I also introduced some methods for statistical evaluations of sets for purposes of determining means of transformations. Oh yeah also added an intellectual property disclaimer to the code header for clarity.

v2.27

I mean after 2.26 updates it’d be hard to do anything quite as impressive, 2.27 was just improvements of the printouts for tracking status of operations. I think these printouts are a real improvement from a usability standpoint, as make troubleshooting without diving into the codebase much more feasible.

v2.28

In announcement I kind of breezed over this update by just calling it a few code cleanups, but in reality the cleanups were not insignificant. Mostly for purposes of cleaning up the order of operations for a clearer presentation of printouts, oh and I think some further revisions to printouts incorporated here as well. There may have been some golf within proximity here :).

v2.29

I found that the processing for binary sets had a few outlier scenarios needing address (helpfully identified when applying the tool to a few Kaggle sourced datasets), so one of those outlier scenarios was addressed here for binary set processing. Umm, let’s see, a few updates to the logic tests sets for determining categories of processing, a few updates to the data set properties archived in the returned dictionary, oh and more improvements to the printouts. Oh and just a wonderful day to boot.

v2.30

Found another outlier scenario for processing binary sets. Cool. Oh yeah and also replaced another Scikit method with a homegrown Pandas-based variant, here the label binarizer.

v2.31

Another bug fix for binary sets (the gift that keeps on giving). Oh and this update I think is pretty useful, added a new parameter allowing user to turn on or off the collective assembly of sets identifying rows of each column that were subject to infill. Sometimes you feel like a nut, sometimes you don’t.

v2.32

Yet another (geez) outlier scenario fix for binary sets. Who knew could be this hard. I think our binary set processing was in pretty good shape by this point.

v2.33

Some updates to the defined tree of transformations associated with default box-cox transformations, I’m not sure but I think this may have had something to do with the update to the family tree primitives introduced in 2.25. Proceed.

v2.35

Let’s see, looks like we introduced an option for infill method for a plug value of 1, updated a few of the logic tests for statistical analysis, some new printouts for PCA transformations, oh and for feature importance evaluations incorporated methods to take account for user passed parameters for the activations of column sets I mentioned in 2.31.

v2.37

Here we started the incorporation of a few methods from automunge available for training set transformations to the postmunge function for subsequently available data set transformations, which design philosophy we continued to incorporate in a few subsequent updates. In this case we introduced the label frequency levelizer options from automunge to the postmunge function. An update to the nbr3 definition of family tree entries, oh yeah and this could be useful, added ability to assign processing methods of ‘eval’ statistical inference methods to specific columns instead of blanket. Trying to make the tool more user friendly also incorporated a few validation functions for user passed parameters.

v2.38

So this first update is kind of an esoteric detail, don’t worry if you don’t follow along, but added a method to ensure user passed transform_dict entries which are missing a replacement primitive in the first layer of transformations incorporate a passthrough transformation such as to maintain the original column unaltered. I think a few updates to the logic test steps in evaluation of column properties, oh and this following was kind of a material decision. Previously we had performed a few validation steps for each column to ensure properties of the train set match properties of associated columns in the test set. This turned out to be leading to a whole bunch of edge cases that were getting hard to keep track of, so for now we just making an assumption without verification that columns properties are equivalent (may reintroduce this step in a later update tbd).

v2.39

So this was kind of a momentary slowdown in preparation for the grand finale. First a few updates to printouts for feature importance evaluation. Then a few tweaks to the derivation of binned sets for powers of ten. Oh and a few more entries for information purposes to the returned postprocess_dict.

v2.40

Automunge now supports processing of data sets passed in the form of Numpy arrays in addition to Pandas dataframes. (Yes that’s a big deal.)

v2.41

Automunge now supports performance of automated feature importance evaluation of additional data sets passed to the postmunge function. (Yes that’s a big deal.)

Conclusion

The whole philosophy of the Automunge tool is grounded in the reality that prior to the realization of artificial general intelligence, our predictive algorithms may sometimes need the help of pre-processing methods to make the data for training and inference more digestible, what we call Artificial Learning. The hypothesis is that by consistently normalizing our data, we are taming the problem of model hyperparameter selection. Yes the tool is useful for the prerequisites of modern machine learning libraries of numerical encoding of data, but that is only a part of the value here. Automunge automates the application of column specific feature engineering methods. Automunge automates the preparation of tabular data for machine learning, with minimal prerequisites of tidy data (single column per feature and single row per observation) to literally a single function call for initial data and using the returned database a single function call for subsequently available data. Automunge includes options for automated feature importance evaluations with the shuffle permutation method, automated dimensionality reductions for data, label engineering, partitioned processing of validation sets to avoid data leakage, oh and forgot to mention, Automunge includes an automated method for machine learning derived infill to missing or improperly formatted data in a set. In short, we make machine learning easy. More to come.

Time to Run — Lord Huron

Books that were referenced here or otherwise inspired this post:

From the Diaries of John Henry — Nicholas Teague

From the Diaries of John Henry

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge:

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.