ML Infill for Data Wrangling

Attempting validation for an automation innovation

Nicholas Teague
Automunge
8 min readSep 19, 2018

--

The Sugarhill Gang — Rapper’s Delight

For those that haven’t been following along, I’ve been using this forum over the last few weeks to develop some algorithms for automating the data wrangling process for structured data sets that generally serves as a precursor to application of machine learning algorithms. Along the way we’ve (as the Dude would say, “the royal we”, you know the editorial) rolled out some functions for processing different categories of data set columns such as numerical, binary, or categorical data. We’ve wrapped those functions into what we’ve called our automunge function, which serves to evaluate each column in our train and test sets and selectively apply one of these processing functions. We’ve also rolled out a processing function specific to data of dates or time series (albeit, to be clear, not one intended for application of recurrent neural networks). We briefly offered some detail for the post prediction transformations as might be desired for some potential applications. A post that I’m somewhat proud of which I believe is a novel innovation is the automated application of machine learning algorithms to infill missing data points in our data sets. This post will now serve to validate that same approach, which I’ve dubbed ML infill. There is a companion Colaboratory notebook available here.

Michael Jackson — Don’t Stop ‘Til You Get Enough

Before doing so I’ll offer a few elaborations on the application of these algorithms. I’ve used as inspiration for the workflow some of the beginner problems offered on the Kaggle platform for data science competitions. Traditionally these competitions offer two sets of data, a “train” set which includes all of the training data for application of machine learning along with a column which serves as the labels in our post processing application of machine learning as well as a column of row identifiers. Once the train set is processed a model can be trained via machine learning to facilitate predictions of labels for the test set, which are then submitted to the competition for evaluation of prediction accuracy. Our automunge function is intended for the first step of this problem, the processing of our training data for preparation of application of machine learning algorithms, and assumes that the data is provided as pandas dataframes with comparable naming and ordering of columns between train and test sets. Because the train and test sets will need to have comparable processing such as normalization and common structure of columns, automunge is designed for simultaneous processing of both the train and test sets. There’s certainly a foreseeable scenario where the “test data” (aka the data fed into a machine learning algorithm to generate predictions) will not be available at time of processing our training data, after all a wiser fellow than myself once said “sometimes you eat the bar, and sometimes, well, it eats you.” In this scenario we can still use our automunge for processing by initially entering a dummy test dataframe with comparable structure / column naming conventions as our train data, and then once the test data becomes available simply re-applying automunge with this test in conjunction with our original (pre-processed) train data. A reasonable future extension would be to store the parameters from processing our train set in a returned dictionary so that a subsequent test set may be processed in a more efficient fashion.

L.A. Dream Team — Rockberry Jam

The automunge function will output a set of self-explanatory numpy arrays in numerical format suitable for direct application of of machine learning algorithms in framework of your choice. Numerical data (category ‘nmbr’) is normalized to a mean of zero and standard deviation of 1 (suitable for thin-tailed distributions). Binary categorical data (category ‘bnry’) is coded into 0/1. String categories of >2 types (category ‘text’) is encoded as one-hot encoding such that each category receives it’s own column with naming convention of ‘column_category’. An additional column is created for these ‘text’ categories to designate any rows with missing data with naming convention ‘column__missing’. Datetime formatted data (category ‘date’) is broken into constituent features for year/month/day/hour/minute/second, with each column normalized to 0 mean and 1 std (to reiterate this approach is not intended for application of recurrent neural networks). This segregation of time data into different time scales is meant to ease the extraction of predictions that might be more prevalent at different time scales — after all I wouldn’t go out looking for a job dressed like this on a weekday. Wait what day is this? But I digress — the processing of cells with missing data is addressed by initially infill with mean value for numerical or datetime, most common value for binary, and creation of a ‘__missing’ column for text categories. If elected, the ML infill technique will then derive infill for each of these cells in both the train and test sets by application of a machine learning model trained on the ‘train’ set and then derived from the other cells in the applicable row of the cell with missing value.

Grandmaster Flash — Wildstyle

Before proceeding to the validation, a few notes about some updates to the functions from prior posts. Just trying to fix the cable.

  1. Note the process_text_class(.) function has been updated to address scenario for when data missing in our test set but not in train.
  2. Also made an update to the process_text_class(.) function to force a ‘__missing’ column even when there are no missing values in train set just in case there are some in test set, as well as to fix a bug with inconsistency in naming convention for this column.
  3. The evalcategory(.) has been updated with respect to treatment of majority NaN columns. We also have a new column category of ‘null’ for columns with >80% NaN values (arbitrary figure) which are dropped from our datasets — this was partly a troubleshooting hack and a future extension will be to facilitate processing of columns with majority NaN in a different fashion (likely by inserting a 0 in place of NaN).
  4. The predictinfill(.) function (used in the ML infill process) has been updated extensively via revision of the class of models applied to each category. It turned out the basic linear regression model was insufficient for numerical data. After some experimentation with various scikit models settled on the support vector regression (SVR) for linear regression and also support vector classification (SVC) for the other categories. My expectation is that this approach is more generalizable without hyperparameter optimization verses other methods such as stochastic gradient descent (SGD) based models for instance. I expect there may be some scenarios with other models may be more computationally efficient such as for instance based on the relation between quantity of samples and quantity of features, a future extension will be to incorporate an evaluation for properties of the train set for potential selection of alternate model architecture in the predictinfill(.) function or to otherwise adjust parameters of support vector models. Currently we are sticking to models through scikit, other scenarios such as gradient boosting or neural networks such as via TensorFlow or PyTorch frameworks may also be considered for a future extension.
  5. The biggest automunge function update here was the address of the new evalcategory(.) output of ‘null’ for columns with excessive NaN values based on evalcategory determination — which are simply columns we’ll drop from our set (for now (this was intended as a temporary troubleshooting hack)).
  6. We’ve removed the application of insertinfill(.) to the test set for cases where there were no NaN values in the train set’s corresponding column. Address of this scenario will be a future extension.
  7. There is also a structural update to the automunge’s application of MLinfill to ‘nmbr’ and ‘bnry’ categories which were segregated primarily for ease of troubleshooting.
Newcleus — Jam On It

The validation of our ML infill technique was conducted with two different data sets from some of the beginner problem competitions from Kaggle — specifically the Titanic set as well as the set from the House Prices Advanced Regression competition (the same competition previously addressed on this blog in 2017). Of the two sets the Housing set contained the higher prevalence of missing data points so a hypothesis was that we would see a stronger effect on model accuracy. We processed the train and test sets initially using standard infill techniques for missing data and then again with application of the ML infill. The two scenarios were trained in a dense network of somewhat arbitrary architecture roughly appropriate for the data (without much effort for hyperparameter tuning other than some rules of thumb). The trained model was then scored with the model’s accuracy metric using the validation set and labels carved out from the train set through the application of automunge. Since there is some noise in the distribution of a trained model’s accuracy from the randomness of backpropagation application, the model training and scoring was repeated 33 times (an arbitrary number selected for time constraints) for each scenario to derive an average model accuracy. A consistent model architecture and hyperparameters were used for both scenarios of with and without ML infill, so the hope here is that any difference in average model accuracy can be attributed to the ML infill technique. An improvement to the average accuracy metric of 0.6% was demonstrated for the titanic set and the housing set saw a 2% accuracy disadvantage — so the benefit to final model accuracy of ML infill I wouldn’t consider a settled matter, no funny stuff. Again please note that this training operation has some noise due to the randomness of a backpropagation application, and the accuracy metric did show material variability, so I would caution against drawing a final conclusion on the benefit of the approach from this sample size. If any reader would like to offer some GPU or TPU cycles to a more thorough validation I would certainly be grateful. I’ll close with the validation code from the Titanic set for reference:

Validation result for Titanic set without ML infill. Average accuracy = 81.2%
Validation result for Titanic set with ML infill. Average accuracy 81.7%
Architectural detail from the Maitland Art Center (Maitland, FL)

“Viva Las Vegas.” — Elvis

“I am the Walrus.” — John Lennon

“Well that about does her. Wraps her all up.” — The Stranger

The Neville Brothers — Brother John / Iko Iko

Books that were referenced here or otherwise inspired this post:

Deep Learning With Python — Francois Chollet

Deep Learning With Python

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share — or I don’t know, consider hiring me or something :). I can also be reached on linkedin for professional inquiries or twitter for personal.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.