Published in


Munging in the Real World

In which we graduate from the Kaggle workflow

The Dixie Cups — Iko Iko

For those that haven’t been following along, I’ve been using this forum over the last few weeks to develop some algorithms for automating the data wrangling process for structured data sets that generally serves as a precursor to application of machine learning algorithms:

I’ve used as inspiration for the workflow some of the beginner problems offered on the Kaggle platform for data science competitions. Traditionally these competitions offer two sets of data, a ‘train’ and ‘test’ set. The train set includes all of the training data for application of machine learning along with a column which serves as the labels in our post processing application of machine learning as well as a column of row identifiers. Once the train set is processed a model can be trained via machine learning to facilitate predictions of labels for the test set, which are then submitted to the competition for evaluation of prediction accuracy. Our automunge function is intended for the first step of this problem, the processing of our training data for preparation of application of machine learning algorithms, and assumes that the data is provided as pandas dataframes with comparable naming and ordering of columns between train and test sets. Because the train and test sets will need to have comparable processing such as normalization and common structure of columns, automunge was designed for simultaneous processing of both the train and test sets. There’s certainly a foreseeable scenario where the “test data” (aka the data fed into a machine learning algorithm to generate predictions) will not be available at time of processing our training data. The intent of the automunge design for this scenario was still to use our automunge for processing by initially entering a dummy test dataframe with comparable structure / column naming conventions as our train data to serve as the test set, and then once the test data becomes available simply re-applying automunge with this test in conjunction with our original (pre-processed) train data.

In my last post I noted that a reasonable future extension would be to store the parameters from processing our train set in a returned dictionary so that a subsequent test set may be processed in a more efficient fashion (without parallel processing of the original train data). The purpose of this post will be to facilitate this more efficient approach for processing test data that is not available at the time of our initial application of automunge. To do so we’ll update our automunge function to generate a dictionary capturing processing parameters which can be saved outside of the notebook for future application. We’ll also roll out some new functions to support the standalone processing of subsequent test data, with the analogue to automunge being a function we’ll call postmunge. There is a companion Colaboratory notebook available here.

The Grateful Dead — Iko Iko (4–11–78)

Post processing dictionaries

The automunge function is first revised to generate an output of a dictionary we’ll call the postprocess_dict. This python dictionary is intended to capture all of the relevant parameters necessary for subsequent processing of test data to ensure consistent normalization and formatting as the original train data. The postprocess_dict includes a set of global parameters as well as a column specific dictionary column_dict for each column in the train data:

Define new processing functions

  • postprocess_numerical_class(.) — function for processing of test data evaluated as category ‘nmbr’ by evalcategory(.) function using parameters from postprocess_dict in place of train data
  • postprocess_text_class(.) — function for processing of test data evaluated as category ‘text’ by evalcategory(.) function using parameters from postprocess_dict in place of train data
  • postprocess_time_class(.) — function for processing of test data evaluated as category ‘date’ by evalcategory(.) function using parameters from postprocess_dict in place of train data
  • (no new function required to process ‘bnry’ category data, reused version from automunge)

Define new ML Infill functions

*(A general note is that for following functions I mostly just edited the original automunge versions by ‘commenting out’ via ‘ctrl-/’ those train set specific points that were no longer applicable. This results in slightly messier code but is intended to facilitate easier updates to these functions should we decide to update the automunge versions in the future or vise versa by retaining comparable code layout between the two.)

  • createpostMLinfillsets(.) — function to generate data sets that will be used to infill points of missing data to the test using machine learning
  • predictpostinfill(.) — function to generate infill data derived from infill sets of preceding function and a machine learning model trained on the train set in original application of automunge and stored in the postprocess_dict
  • (no new function required to infill the data, reused version from automunge)

Define new postmunge function

The postmunge function is comparable in many ways to our original automunge function, but for the replacement of train set processing code with parameters pulled from the postprocess_dict to facilitate consistent normalization and formatting between train and test sets. Again the goal here was to enable efficient processing of test data that isn’t available during the initial training phase of our machine learning algorithms. I’ll present the code here because I think it makes me look smart to use code in my essays even though I don’t think many people will read it. Too many toothpicks to count.

The Belle Stars — Iko Iko


We definitely should attempt a validation metric on this new postmunge function. Definitely. Since the whole purpose of the update was too facilitate added efficiency for subsequent processing of test data not available at the time of initial training of our machine learning model, a reasonable metric was selected as the processing time delta between automunge and postmunge. Note that this processing time will vary based on your processor — those shown are based on running the functions through a Colaboratory session. Although the session included the GPU acceleration, I don’t expect that would impact this metric since no part of our function made use of Nvidia’s CUDA language for parallelization, to be honest I’m not sure if there is some potential here too incorporate in the current iteration (all of the machine learning algorithms for the ML infill technique are currently run through scikit-learn which to my understanding does not currently include GPU acceleration). The following metrics are for comparison of automunge’s simultaneous processing of a train and test set vs the postmunge’s processing of just a test set using the postprocess_dict output from the original application of automunge, shown with and without the ML infill technique for comparison. Three different data sets are applied including a bare bones demonstration set that I defined in the notebook, the Titanic set from Kaggle, and the House Price set from Kaggle. The validation metrics demonstrate a speedup on order of 50–70% for the postmunge technique.

postmunge validation table

So was this a successful exercise? Is postmunge a worthwhile function to incorporate in our workflow? Is the added complexity of code worth the benefits to efficiency? Note that this post-processing of test set functionality was previously already available to us by way of simply reapplying automunge to new test data along with the original pre-processed train set, so the only improvement we’re getting here is computation time. The cost is that we’re roughly doubling the number of functions called and length of code base. My take is that although the savings for the size of data sets used in validation are not tremendous, if this tool is to be applied in the real world there will certainly be cases of data sets of scale well beyond these from beginner Kaggle competitions. After all when computer scientists study computational complexity they’re not simply comparing the address of a fixed size problem, it’s about the scaling of resources required with increased problem complexity. The postmunge function represents our first material deviation from the workflow associated with Kaggle competitions to address real world challenges from real world applications. In these cases a halving of the application time for processing subsequent test sets certainly seems like a worthwhile investment in code complexity.

Future improvements

Given the determination that efficiency of application is a worthwhile investment of development time and code complexity, where next could we look to further improve this tool? The current processing times demonstrated by this validation attempt certainly aren’t very impressive, and if this tool is to be applied to large scale data sets I think material speed ups are necessary. Based on my understanding of the current algorithms, I expect that there are two primary sources of inefficiency that are costing us processing time.

  1. The first source of inefficiency is found in the application of the ML infill technique where we train a separate model for each column to derive infill for that column. As the number of columns go up we are not only adding time via the number of models trained, but also the time to train each column’s model also goes up. I’m just guessing but I’m thinking that implies a computational complexity scaling of this feature on order of O(n²) where n is the number of columns (again that’s just a guess — if there are any computer scientists out there that want to offer feedback I’m all ears). This inefficiency of application is actually one with a solution. If we can trade our use of scikit-learn algorithms for a CUDA GPU accelerated framework like Tensorflow or Pytorch then we will be able to accelerate this piece simply by adding GPU hardware. Part of the basis for the current ML infill approach was the simplicity of application. Support vector machines do not require a great deal of hyperparameter tuning. Tensorflow does have a built in comparable model, one which I assume does allow for GPU acceleration, so investigation of that approach is certainly a reasonable future extension of this technique.
  2. The second source of inefficiency I expect is associated with the use of a specific technique for iterating through the rows of the dataframe in the address of missing values, a practice that we used both in our evalcategory(.) function used by both automunge and postmunge, as well as the insertinfill(.) function used to apply ML infill. That specific technique is the pandas iterrows() call. Now from what I’ve gathered the practice of iteration to work on a pandas dataframe is a common rookie mistake, after all you lose in the process a lot of the memory efficiency of the class and you are no longer parallelizing the derivation. That being said, there are certainly some applications where iteration is required. Further investigation into this issue including possible alternate techniques will be a next step of development.

I agree vectorization is the right solution where possible — sometimes an iterative algorithm is the only way though. — Wes McKinney via Stack Overflow

Well I think I’ll go ahead and chalk this one up a success. Feedback is always welcome.

Skyline of Savannah, GA
Elvis — Suspicious Minds (Live in Honolulu, 1973)
John Lennon — #9 Dream
Tedeshi Trucks Band with Taj Mahal and Jerry Douglas — Leavin’ Trunk

Books that were referenced here or otherwise inspired this post:

Emerson: Essays and Lectures — Ralph Waldo Emerson

Emerson: Essays and Lectures

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share — or I don’t know, consider hiring me or something :). I can also be reached on linkedin for professional inquiries or twitter for personal.




Automunge —Prepare your data for Machine Learning

Recommended from Medium

Approaching your first NLP Project

Understanding ConvMixer (with a simple PyTorch implementation)

Scikit learn | Supervised Learning

“Speech Recognition” Science-Research, February 2022 — summary from Arxiv, Astrophysics Data…

4 Open Source Person Re-ID Training Datasets for Your ML Project

Reading Comprehension with GPT-3

Getting started with Machine Learning using Sklearn-python

An introduction to Active Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nicholas Teague

Nicholas Teague

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at Consistently unique.

More from Medium

Homesite Insurance Quote Conversion Prediction

Time series analysis on AWS — Part 3 presentation — Find and diagnose outliers in business metrics

Data Science Ideas Using Graphs: Line Graphs and Edge Clustering

Recess in the AI Playground with Ian Kelk from Clarifai