Wrangling for Smart People

Keep it simple stupid

Nicholas Teague
Automunge
3 min readAug 23, 2018

--

This is an inside joke.

I drew this code up a while back when was active on Kaggle and just realized that I never rolled it out anywhere other than some dark corner of github, so here now is some defined functions for wrangling structured datasets. Who knows maybe those nice people at Towards Data Science will even consider picking this up. There’s a companion Colaboratory notebook available [here]. Cheers.

Define the following functions for preprocessing different categories of data in structured datasets. Each function takes as input pandas dataframe data sets and the name of column to be processed. Once defined the functions will be tested / demonstrated.

  • process_numerical_class(mdf_train, mdf_test, column)
  • process_binary_class(mdf, column, missing)
  • process_text_class(mdf_train, mdf_test, column)

These are meant to deal with some potential errors such as missing data, although other conceivable errors such as incompatible numeric vs string input are not addressed.

Define Functions

The following are imports to support our functions.

We’ll define process_numerical_class(.) to address columns containing numerical data. Note that Test and Train data is processed simultaneously to ensure consistent normalization.

We’ll define process_binary_class(.) to address columns containing binary data classifications.

We’ll define process_text_class(.) to process columns containing text classes of >2 categorizations, using one-hot encoding. Note that Test and Train data is processed simultaneously to ensure consistent categorization.

Test Functions

Here we’ll create some sample Train and Test datasets for demonstration of our functions.

First we’ll separate our labels from the Train data.

Train dataframe for demonstration

Now we’ll apply process_numerical_class(.) to the ‘number’ column.

Train dataframe after application of process_numerical_class(.)

Now we’ll apply process_binary_class(.) to the ‘Y/N’ column.

Train dataframe after application of process_binary_class(.)

Now we’ll apply process_text_class(.) to the ‘shape’ column. Note that our one-hot-encoding function created a new column for missing category values.

Train dataframe after application of process_text_class(.)

We’ll view here the Test set which we processed along with the Train. Note that our one-hot-encoding function inserted a column for the ‘oval’ category even though there was no datapoint included in this set. Note the order and naming convention of the category columns are consistent between Train and Test sets.

Test dataframe which was simultaneously processed.

Now that we have defined our functions for data preprocessing of structured data, we’ll begin a new notebook incorporating these functions and applying them to the processing and training of the Kaggle house price regression data set. To be continued in next notebook.

The Beatles — Her Majesty

Books that were referenced here or otherwise inspired this post:

Python for Data Analysis — Wes McKinney

Python for Data Analysis

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on linkedin for professional inquiries or twitter for personal.

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.