Automated Data Wrangling with Machine Learning Derived Infill for Missing Values

The shiny new thing

Nicholas Teague
Automunge
6 min readSep 4, 2018

--

In my last few posts I’ve been publishing some algorithms developed to automate the process of data wrangling for structured datasets. Here we’ll continue the journey by adding a new feature which updates the approach applied to infill missing values from the dataset, which previously were addressed by inserting the mean for the column in numerical data, inserting the most common value for binary classes, or adding a new column for missing data for classification data addressed with one-hot encoding. Specifically, we’ll turn to machine learning algorithms via scikit-learn to derive a predicted value for the infill cells based on the properties of the other corresponding features for that row (‘ML infill’). My expectation is that this approach could prove beneficial to training of particularly messy data sets. This work is intended as a proof of concept, as I have not seen ML infill methods in the literature prior.

Before deriving the ML predicted infill values for missing points, the data is processed such that numerical data is normalized and categorical data is encoded. Initially missing data is filled in with approaches from previous address detailed above. We take a ‘for loop’ through each column and create a set of features and labels for each column. The current method uses linear regression for numerical data and logistic regression for categorical although a natural extension would be to incorporate other architectures for this task. The ML infill is trained on the features of the train set but once trained the same model is applied to infill the test data. Once we derive our infill those prior infill values are replaced.

The ML infill address of the text category of columns required a little more complexity since they were processed into multiple columns with one-hot encoding, such that to train our ML infill required extracting multiple columns — I’ll defer to the code if you’d like to learn more on the approach. Note that this notebook in current iteration only applies the ML infill method to numerical, binary, and categorical columns. A natural extension for this method would be to extend the ML infill technique to the datetime category included in our automunge(.) function.

I’ll present here the code in it’s entirety. There is also a companion Colaboratory notebook available here. Note that some updates have been incorporated to those functions addressed in last week’s posts. We’ll demonstrate the operation first on a miniature data set that I created for this purpose then we’ll apply to a slightly more expansive set from the Titanic competition available on Kaggle.

Preservation Hall Jazz Band — Just A Closer Walk With Thee, Part I

1) preprocess and evalcategory functions from prior notebooks

Here are the preprocess functions we’ve previously introduced, with a few updates.

The evalcategory(.) function has also had a few updates from our prior introduction.

Preservation Hall Jazz Band — Just A Closer Walk With Thee, Part II

2) Define new functions

Here we’ll introduce the following functions to support our ML infill methods:

  • NArows(.) — used to identify which rows for a column to infill with ML derived plug data
  • createMLinfillsets(.) —returns a series of dataframes which can be applied to training a machine learning model to predict appropriate infill values for those points that had missing values from the original sets
  • labelbinarizercorrect(.) — ensures that the re-encoding following application of scikit’s LabelBinarizer is consistent with the original array
  • predictinfill(.) — returns predicted infills for the train and test feature sets’ missing values based on application of machine learning against other points in the set
  • insertinfill(.) — replaces the missing values in rows corresponding to the NArows with the values from predictinfill
Sweet Emma and Her Preservation Hall Jazz Band — Closer Walk With Thee

3) Update automunge(.) function

Here we update the master automunge(.) function to incorporate the MLinfill methods.

4) Test Functions

Here we’ll create some Train and Test datasets for demonstration of our functions. First we’ll use a small set similar to those I introduced in prior posts, then we’ll apply the Titanic dataset available from the Kaggle competition to demonstrate the application in the real world.

Now let’s apply our automunge function to see how we did. First we’ll use the MLinfill method, then we’ll try again without the MLinfill to compare results. I’ll highlight in red those cells that were impacted by the MLinfill technique.

train
train array output from automunge function with MLinfill changes highlighted in red

Here is the output of automunge on the same data with MLinfill argument set to False for comparison.

train array output from automunge function with NA cells using standard infill techniques highlighted in red

Now we’ll upload the titanic data and confirm application of automunge on a real world set. First here’s what the data looks like upon upload:

titanic data train set first five rows before any processing

Now there are certain aspects of feature engineering that our automunge won’t address. Because each row is unique I expect there won’t be any learning for Cabin, Name, or Ticket columns so we’ll go ahead and delete those rows for our demonstration. It is certainly feasible that there is some learnable feature buried in these columns that can be extracted prior to application of automunge with further processing. For example one could extract from the Mrs/Ms/Miss designation in the Name column if a female is married or single. From Cabin field perhaps we could infer on what deck the passenger was sleeping or whether they even had a cabin. Since this current iteration doesn’t evaluate a text field’s individual characters, only character groupings, this type of evaluation would need to be done prior to application of automunge — such further evaluation could be a future extension.

So the good news is the application succeeded. The array output didn’t reveal any specific MLinfill examples (the nan values are probably a little more sparse than our prior example set) so I’ll just show an image of the output without comparison.

train
titanic set train array output from automunge function

Great well I think I’ll chalk this one up as a success. Until next time.

Bobby McFerrin — Don’t Worry Be Happy

Books that were referenced here or otherwise inspired this post:

So Long and Thanks For All the Fish — Douglas Adams

So Long and Thanks For All the Fish

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share — or I don’t know, consider hiring me or something :). I can also be reached on linkedin for professional inquiries or twitter for personal.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.