A Tool for Data

Developers, developers, developers, developers

Nicholas Teague
Automunge
9 min readOct 13, 2018

--

Jimmy Buffett — One Particular Harbor (full album)

For those that haven’t been following along, I’ve been using this forum in recent weeks to document the development of a tool for the automation of data wrangling (aka “munging”) of structured data sets such as to prepare them for the direct application of machine learning algorithms in the framework of your choice (we keep getting more to choose from after all). In the current iteration the tool is not yet a replacement for feature engineering, however it is suitable as a replacement for the final steps of data processing prior to application of machine learning. Numerical data is treated with both a normalization to mean of 0 and standard deviation 1, as well as the separate application of a power law transform to address sets that may be subject to fatter tailed distributions. Categorical data is encoded into multiple columns via one-hot encoding method, and time series data is segregated by time scale and normalized. What’s more, the tool addresses missing data points in the sets by derivation of predicted infill using machine learning models trained on the rest of the data in an automated fashion. The two primary functions are “automunge” for the simultaneous address of initially available train and test sets as well as a function we call “postmunge” for the subsequent processing of test data that wasn’t available for initial address. As we continue the development journey, the intent is to continue building out the range of feature transformations such as to facilitate a full automation of the feature engineering workflow. Initially the tool will be intended for application on a user’s local hardware, and then as we reach features requiring external computing resources we will layer on a user account with a pay per use pricing model. (No underwear gnomes here — phase 2 will not be easy but at least it is a known.) I’ll use this write-up as a summary of those updates that have been incorporated since last week, probably with a tangent or two along the way.

The Last Lecture — Randy Pausch

One of my big takeaways from this week is a simple statement that memory management in Python (or Pandas for that matter) is hard. I spent a whole day trying to clean up assignment operations to use inplace operators such as:

The expectation was that this approach would be more memory efficient but a fascinating result was that when we incorporated inplace operators throughout there was a material slowdown on the order of a factor of 3. This was unexpected. I speculate that it may be due to increased RAM utilization but that’s just a guess. I had a little trouble finding reference material on this subject — I hope there are some resources out there for garbage collection, memory, and speed performance tradeoffs in Pandas that I just haven’t found yet. I would not expect that this performance issue was specific to Colaboratory. The decision was to revert to the original function calls via assignment in lieu of inplace as the speed tradeoff was too costly. If anyone wants to offer navigation advice here I’m all ears.

Another edit for performance considerations was to reformat the 1/0 values from our encoded categorical columns such as to revise the 64bit integers, which would be wasted on binary values, with more memory efficient representations via an 8 bit (1 byte) datatype. This was a step recommended by the Pandas Cookbook text. Note that when I was doing a little digging here I found that Pandas has an approach that could potentially offer even more memory savings, via “sparse data structures” which could replace all of the zeros with “blank” representations and an associated plug value stored only once. However my thinking here is that the most critical bottleneck for memory usage will be during the training operation — whether for our ML infill methods or for the training performed after completion of the automunge application. In both cases we will have converted to Numpy arrays in the process, so this sparse representation won’t provide any more horsepower.

The next update was with respect to the ML infill methods, and although it sounds like an extreme step this really on took a few minor edits to incorporate, this was to replace the machine learning architecture used for the ML infill methods from support vector machines to random forest architecture (including both scikit-learn’s RandomForestRegressor and the RandomForestClassifier). This was partly inspired by some discussions from the new fast.ai machine learning course. In the current iteration I’m using all default setting for the models, the intent is that a future iteration will tailor the function calls to the characteristics of the data set. For example if the data exceeds a certain number of rows we might limit the training operation to a subsample of the set or alternatively if our number of features exceeds some threshold we may adjust the max_depth parameter to ensure the training doesn’t get out of hand. In the process of this update to the predictinfill(.) function I corrected an annoying output that was showing up with each application of automunge(.), which was an error message indicating that the shape of the y labels needed to be corrected with ravel(.) which acts to flatten an array. Although I am a fan of the composer Ravel I didn’t quite follow why this was necessary, and despite the error message the ML infill methods were still working. I finally gave in and it turns out it really was just as simple as an application of the ravel(.) method to the y labels, so it looks like we’re clear now of those annoying error messages albeit the output arrays were fine either way.

The next address of the week, and one probably more of import and fundamental to the problem here, involved a rework of the core architecture associated with column transformations. Previously we had different approaches for column transformations for each category of ‘bnry’, ‘nmbr’, ‘text’, ‘date’, and the new ‘bxcx’. The simplest were the nmbr and bnry categories which were basically transformed in place with consistent column naming conventions, while those categories of ‘text’, ‘date’, and ‘bxcx’ each were more complex as their original single column set was in the processing transformed to a set of columns, such as for the categorical text category a one hot encoded set, and for the bxcx or date categories a series of numerically transformed sets derived from the original column. Those categories transformed into multiple columns resulted in a fair bit more complexity in the ML infill methods for instance, as well as the postmunge function as we had to deal with inconsistent column naming between the original set of column names and the processed set of column names which each were used as keys for different python dictionaries in use. The key decision here was to implement the use of a column specific dictionary those categories of ‘nmbr’ and ‘bnry’ that were originally in place transformations. The origin of the decision came from the desire to implement an added column transformation for a boolean identification of those rows containing missing values which were subject to infill, but an expected benefit of the revision is that going forward as we identify additional potential transformations to these sets we will already have the architecture in place to facilitate. Thus we are kind of making an investment in added code complexity for the purposes of laying the groundwork for future buildout. But in the mean time, this will enable us to incorporate added columns to each category to designate rows which received infill for that portion of the data. Quick tangent: I got into a ridiculous troubleshooting hangup from attempting to rename a column in the dataframe in the processing functions, apparently the pandas .rename() function is more complicated than it looks (or alternatively I might just be a moron), so I took the easy way out and just copied the column to a new name and deleted the old one. It would not surprise me if this approach is not optimized for performance :).

An interesting side effect of this update to the processing functions is that because we were re-using the processing functions from our training set columns to process the labels column that we were ending up generating a series of added column transformation to our labels as well. It hadn’t yet occurred to me whether there might be comparable benefit to applying multiple transformations to a labels set and training a subsequent model to predict each of them simultaneously vs our prior approach of just outputting a single column of labels for all but the categorical values which were output as one hot encodings. I think this is a question that deserves some more thought. I recall a TWiML podcast episode where it was discussed the benefit of adding additional label features for parallel training even if we don’t necessarily need those feature for our application, but at the time I had interpreted those added features not as derived from the original labels but just some adjacent data points that we suspected might be available for training runs but not for predictions. I think this deserves address as a hypothesis which we can go back and run experiments on when we find the time.

Hypothesis: Just as multiple transformations of training data can help the learning process, multiple transformations of labels which our model can predict simultaneously will benefit training as well.

The biggest architectural update of the week turned out to be a revamping of the support dictionaries which are used throughout. After an audit of the various supporting variable stores it turned out that we had a fair bit of redundancy in the code, most strikingly with separate column data dictionaries for each category of data which collectively were redundant with the same data stored elsewhere. The solution was to consolidate, consolidate, consolidate, consolidate, with a goal of generalization and consistency of column specific data stores in between the different categories of data. This generalization of variables I expect will greatly help with future buildout.

demonstration of the types of generalization and consolidation that took place between the various stores of data

I’m highlighting Chollet’s tweet here because I see echos of this issue in how I spent my time this week. The automunge tool is a prime example of software that started organically and evolved — hell you only need to look at these collection of blog posts to see what that has entailed. I think the only thing that made the progress we’ve made so far possible is that the design was developed primarily to mimic a specific workflow that was derived from my (admittedly lean) experience exploring Kaggle competitions. When I took a step back this week to audit the clarity of code, I found some significant obstacles to the consistency of address between different classes of data, with corrections made in the hope of enabling this tool to be further built out in a modular fashion for generalized feature engineering transformations. The updates of this week I think have helped to bridge that gap, but I don’t consider the matter settled. Not to worry I’ve got a few things in mind for next week that I think will help. Oh yeah before signing off there’s a companion Colaboratory notebook available here. Until next time.

Rebirth Brass Band — It’s All Over Now
Rebirth Brass Band — Let’s Do It Again

Books that were referenced here or otherwise inspired this post:

The Last Lecture — Randy Pausch

The Last Lecture

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share — or I don’t know, consider hiring me or something :). I can also be reached on linkedin for professional inquiries or twitter for personal.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.