Predictions and Encodings

From a blog about nothing

Nicholas Teague
Automunge
4 min readSep 6, 2018

--

This is an inside joke (painting via Paint the Trail)

Rolling out a few quick updates to the automunge function from a few prior posts meant to address a few additional outputs to support post processing application of machine learning algorithms, the most notable of which being some functionality to address encoding and reverse encoding of predictions post application of machine learning algorithms from framework of your choice. There’s a companion Colaboratory notebook available here if you’d like to try it out on your own data.

1) Import functions form prior notebooks

(Code not shown for brevity.)

#process functions
process_numerical_class(.)
process_binary_class(.)
process_text_class(.)
process_time_class(.)
#category evaluation
evalcategory(.)
#ML infill functions
NArows(.)
createMLinfillsets(.)
labelbinarizercorrect(.)
predictinfill(.)
insertinfill(.)

2) Update our automunge function

Incorporated the following updates for this version 1.02:

1) Removed the column sort feature that I am apparently too clueless to conduct and replaced with a validation check that train and test sets have columns in same order.

2) Added an output from automunge with a dictionary key for encodings of labels to facilitate reverse encoding following predictions.

3) Made the randomseed an input for the automunge function although it still is not feeding into the scikit MLinfill models (future extension).

4) Created a new argument for automunge serving as a list of columns to exclude from transformations (excludetransformscolumns = []). Note that these need to be in ML ready format otherwise the predict MLinfill function will return errors.

5) Added a step to delete rows from training set early in the automunge function that are missing labels since we won’t want to do infill on label columns.

6) Added an argument to automunge function setting the number of iterations for predictinfill ML methods. The idea is that since the missing values are initialized with standard infill techniques, for a set with a high number of infill cells additional ML iterations might improve the infill values.

7) Created an output of list of column names for output arrays.

8) Removed shuffling of test data (test data doesn’t need shuffling).

2) Test our functions

We’ll use the titanic set from Kaggle for our demonstration. The imports are addressed in the companion Colaboratory notebook, here is simply the application of the automunge function. First here again is the heading of the train set data to illustrate the format:

titanic_df_train.head()
Titanic data from Kaggle

From my last post, a reminder on our preprocessing prior to application:

Now there are certain aspects of feature engineering that our automunge won’t address. Because each row is unique I expect there won’t be any learning for Cabin, Name, or Ticket columns so we’ll go ahead and delete those rows for our demonstration. It is certainly feasible that there is some learnable feature buried in these columns that can be extracted prior to application of automunge with further processing. For example one could extract from the Mrs/Ms/Miss designation in the Name column if a female is married or single. From Cabin field perhaps we could infer on what deck the passenger was sleeping or whether they even had a cabin. Since this current iteration doesn’t evaluate a text field’s individual characters, only character groupings, this type of evaluation would need to be done prior to application of automunge — such further evaluation could be a future extension.

Now there are obviously a lot more sophisticated ways to train this data that should provide greater accuracy in predictions, the following is just to demonstrate re-encoding of labels post prediction using scikit’s vanilla logistic regression model.

predictions output array

Now we’ll quickly re-encode the predictions into the format from the original train set. It turns out this is kind of a silly example since our original train set already was encoded numerically as 0/1, but please take my word that this would work just as well if the original encoding was a text classification for instance. We’ll also merge with the ID column to be consistent with a typical approach requested from Kaggle submittals.

reverse-encoded predictions (kind of a silly example since the encoding was already 0/1 to begin with)

Great well I think I’ll chalk this one up as a success. Until next time.

Books that were referenced here or otherwise inspired this post:

Innovation and Entrepreneurship — Peter Drucker

Innovation and Entrepreneurship

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share — or I don’t know, consider hiring me or something :). I can also be reached on linkedin for professional inquiries or twitter for personal.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.