Automunge

In which we keep on wranglin’

Nicholas Teague
Automunge
3 min readAug 28, 2018

--

In my last post I drew up some functions for wrangling structured datasets. An extension of this method could be to incorporate a function that evaluates each column in a data frame to determine which of the three defined processing functions would be most appropriate, potentially opening the door to automated data preprocessing, which could prove beneficial for big data scale training sets. In this notebook we’ll create this function to automate the column selection process for wrangling a structured dataset. There’s a companion Colaboratory notebook available [here]. Cheers.

1) Import data pre-processing functions from last notebook

Here again are the functions we defined in our last post:

2) Define evalcategory(.) and automunge(.) functions

  • evalcategory(.) — looks at column in a dataframe to determine which preprocessing function to apply
  • automunge(.) — takes as input dataframes of train set, test set, id of label column from train set, and validation ratio and then outputs a set of numpy arrays ready for application of machine learning algorithms in framework of your choice.

Note that this approach assumes that the test data is available at time of wrangling. A reasonable extension here would be to allow the function to output variables such as for normalization for subsequent processing of test data if processing of test data is not simultaneous, the output of such values could be triggered by incorporating an additional True/False selection in the defined function arguments.

3) Test Functions

Here we’ll create some sample Train and Test datasets for demonstration of our functions.

Now let’s apply our automunge and see how we did.

Here we’ll view the output numpy arrays:

train
labels
validation
validationlabels
test

Great well I think I’ll chalk this one up as a success. Until next time.

The Beatles — Let it Be

Books that were referenced here or otherwise inspired this post:

Seeking Wisdom— Peter Bevelin

Seeking Wisdom

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on linkedin for professional inquiries or twitter for personal.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.