An intro to Automunge

The stuff I couldn’t fit in a tweet

Nicholas Teague
Automunge
21 min readSep 30, 2019

--

Central Florida is like the Silicon Valley of the south they say.

Preface

I recently published a demonstration notebook for Automunge on Kaggle (a data science competition platform) based on the data sets for the IEEE fraud detection competition, and well two likes later I’m thinking might want to try sharing on a real peer reviewed channel like Medium. So yeah here we are. Following is a demonstration of some of the core features of the Automunge platform for automated feature engineering. And without further ado.

Introduction

Hi there! I’m kind of new to Kaggle but would like to use this competition as an opportunity to demonstrate the Automunge tool for automated data-wrangling. This entry is possibly in the realm of Automated ML Tool territory, but to be fair Automunge isn’t really a turnkey machine learning replacement, it’s intended use is primarily for the preparation of tabular data for machine learning (basically performing the data preparation pipelines in the steps immediately preceding training a machine learning model or the subsequent consistent processing to prepare data to generate predictions), and the application of predictive algorithms are intended to be conducted separately. That being said, the tool does make use of some predictive algorithms along the way, including the optional use of machine learning to predict infill to missing or improperly formatted data, what we call ML infill (more on that later).

In short, Automunge prepares tabular data intended for training a machine learning model, and enables consistent processing of subsequently available data for generating predictions from that same model. Through preparation, numerical data is normalized, categorical data is encoded, and time-series data is also encoded. A user may defer to automated methods where the tool infers properties of each column to assign a processing method, or alternately assign custom processing methods to distinct columns from our library of feature engineering transformations.

A user may also consider Automunge a platform for data wrangling, and may pass their own processing functions incorporating simple data structures such that through the incorporation of their transforms into the tool they can make use of extremely useful methods such as machine learning derived infill to missing or improperly formatted data (ML infill), feature importance evaluation, automated dimensionality reduction via feature importance or Principle Components Analysis (PCA), and perhaps most importantly the simplest means for consistent processing of subsequently available data with just a single function calls. In short, we make machine learning easy.

Prerequisites

Before proceeding with the demonstration we’ll conduct a few data preparations. Note that Automunge needs following prerequisites to operate:

  • tabular data in Pandas dataframe or Numpy array format
  • “tidy data” (meaning one feature per column and one observation per row)
  • if available, a label column may be included in the set with column name passed to function as string
  • a “train” data set intended to train a machine learning model and if available a “test” set intended to generate predictions from the same model
  • the train and test data corresponding columns must have consistently formatted data and consistent column headers

Ok well introductions complete let’s go ahead and manually munge to meet these requirements.

Data imports and preliminary munging

Automunge install and initialize

Ok let’s give it a shot

Well at the risk of overwhelming the reader I’m just going to throw out a full application. Basically, we pass the train set and if available a consistently formatted test set to the function and it returns normalized and numerically encoded sets suitable for the direct application of machine learning. The function returns a series of sets (some of which, based on the options selected, may be empty), I find it helps to just copy and paste the full range of parameters and returned sets from the documentation for each application.

So first let’s just try a generic application with our tiny_train set. Note tiny_train here represents our train set. If a labels column is available we should include and designate, and any columns we want to exclude from processing we can designate as “ID columns” which will be carved out and consistently shuffled and partitioned. Note here we’re only demonstrating on the set with the reduced number of features to save time.

etc.

So what’s going on here is we’re calling the function am.automunge and passing the returned sets to a series of objects:

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train)

Again we don’t have to include all of the parameters when calling the function, but I find it helpful just to copy and paste them all. For example if we just wanted to defer to defaults we could just call:

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(tiny_train)

Those sets returned from the function call are as follows:

  • train, trainID, labels : these are the sets intended to train a machine learning model. (The ID set is simply any columns we wanted to exclude from transformations comparably partitioned and shuffled)
  • validation1, validationID1, validationlabels1 : these are sets carved out from the train set intended for hyperparameter tuning validation based on the designated validation1 ratio (defaults to 0.0)
  • validation2, validationID2, validationlabels2 : these are sets carved out from the train set intended for final model validation based on the designated validation2 ratio (defaults to 0.0)
  • test, testID, testlabels : these are the sets derived from any passed test set intended to generate predictions from the machine learning model trained from the train set, consistently processed as the train set
  • labelsencoding_dict : this is a dictionary which may prove useful for reverse encoding predictions generated from the machine learning model to be trained from the train set
  • finalcolumns_train, finalcolumns_test : a list of the columns returned from the transformation, may prove useful in case one wants to ensure consistent column labeling which is required for subsequent processing of any future test data
  • featureimportance : this stores the results of the feature importance evaluation if user elects to conduct
  • postprocess_dict : this dictionary should be saved as it may be used as an input to the postmunge function to consistently process any subsequently available test data

Let’s take a look at a few items of interest from the returned sets.

Notice that the returned sets now include a suffix appended to column name. These suffixes identify what type of transformation were performed. Here we see a few different types of suffixes:

#suffixes identifying steps of transformation
list(train)
#And here's what the returned data looks like.
train.head()

Upon inspection:

  • addr2, card4, and ProductCD both have a series of suffixes which represent the different categories derived from a one-hot-encoding of a categorical set
  • each of TransactionDT, TransactionAmt, card1, card2, card3, card5, addr1, addr2, dist1 have the suffix ‘nmbr’ which represents a z-score normalization
  • card6 has the suffix ‘bnry’ which represents a binary (0/1) encoding
  • P_emaildomain has the suffix ‘ordl’ which represents an ordinal (integer) encoding

Automunge uses suffix appenders to track the steps of transformations. For example, one could assign transformations to a column which resulted in multiple suffix appenders, such as say: ‘column1_bxcx_nmbr’ Which would represent a column with original header ‘column1’ upon which was performed two steps of transformation, a box-cox power law transform followed by a z-score normalization.

Labels

When we conducted the transformation we also designated a label column which was included in the set, so let’s take a peek at the returned labels (returned as a separate object to the train data).

list(labels)

As you can see the returned values on the labels column are consistently encoded as were passed.

labels['isFraud_bnry'].unique()

Note that if our original labels weren’t yet binary encoded, we could inspect the returned labelsencoding_dict object to determine the basis of encoding. Here we just see that the 1 value originated from values 1, and the 0 value originated from values 0 — a trivial example, but this could be helpful if we had passed a column containing values [‘cat’, ‘dog’] for instance.

labelsencoding_dict

Subsequent consistent processing with postmunge(.)

Another important object returned form the automunge application is what we call the “postprocess_dict”. In fact, good practice is that we should always save externally any postprocess_dict returned from the application of automunge whose output was used to train a machine learning model. Why? Well, using this postprocess_dict object, we can then pass any subsequently available “test” data that we want to use to generate predictions from that machine learning model — giving fully consistent processing and encoding. Let’s demonstrate.

When we performed a train_test_split above to derive the “tiny_train” set, we also ended up with a bigger set called “tiny_train_bigger”. Let’s try applying the postmunge function to this additional set to consistently process.

Note a few pre-requisites for the appplication of postmunge:

  • requires passing a postprocess_dict that was derived from the application of automunge
  • consistently formatted data as the train set used in the application of automunge from which the postprocess_dict was derived
  • consistent column labeling as the train set used in the application of automunge from which the postprocess_dict was derived (or alternatively, for the case of Numpy arrays, consistent order of columns)

And there we have it, let’s demonstrate the postmunge function on the set “tiny_test” we prepared above.

etc.

And if we’re doing our job right then this set should be formatted exactly like that returned from automunge, let’s take a look.

test.head()

Looks good!

So if we wanted to generate predictions from a machine learning model trained on a train set processed with automunge, we now have a way to consistently prepare data with postmunge.

Let’s explore a (few) of the automunge parameters

Ok let’s take a look at a few of the optional methods available here. First here again is what a full automunge call looks like:

So let’s just go through these one by one. (This list is kind of diving into the weeds, not required reading)

  • df_train and df_test First note that if we want we can pass two different Pandas dataframe or Numpy array sets to automunge, such as might be beneficial if we have one set with labels (a “train” set) and one set without (a “test” set), or alternatively just pass the train set and leave df_test = False. Note that the normalization parameters are all derived just from the train set, and applied for consistent processing of any test set if included. Again a prerequisite is that any train and test set must have consistently labeled columns and consistent formatted data, with the exception of any designated “ID” columns or “label” columns which will be carved out and consistently shuffled and partitioned. Note too that we can pass these sets with non-integer-range index or even multi column indexes, such that such index columns will be carved out and returned as part of the ID sets, consistently shuffled and partitioned. If we only want to process a train set we can pass the test set as “False”. These sets can be passed as either Pandas dataframes or Numpy arrays.
  • labels_column is intended for passing string identifiers of a column that will be treated as labels. Note that some of the methods require the inclusion of labels, such as feature importance evaluation or the “label frequency levelizer” (for oversampling rows with lower frequency labels).
  • trainID_column and testID_column are intended for passing strings or lists of strings identifying columns that will be carved out before processing and consistently shuffled and partitioned.
  • valpercent parameter intended as float between 0–1 that indicate the ratio of the sets that will be carved out for the validation sets. If shuffle train is activated then the sets will be carved out randomly, else they will be taken from the bottom sequential rows of the train set. Note that these values default to 0.
  • shuffletrain parameter indicates whether the train set will be (can you guess?) yep you’re right the answer is shuffled.
  • TrainLabelFreqLevel parameter indicates whether the train set will have the oversampling method applied where rows with lower frequency labels are copied for more equal distribution of labels, such as might be beneficial for oversampling in the training operation.
  • powertransform parameter indicates whether default numerical column evaluation will include an inference of distribution properties to assign between z-score normalization, min-max scaling, or box-cox power law transformations. Note this one is still somewhat rough around the edges and we will continue to refine methods going forward.
  • binstransform indicates whether default z-score normalization application will include the development of bins sets identifying a point’s placement with respect to number of standard deviations from the mean.
  • MLinfill indicates whether default infill methods will predict infill for missing points using machine learning models trained on the rest of the set in generalized and automated fashion. Note that this method benefits from increased scale of data in the train set, and models derived from the train set are used for predicting values for the test set.
  • infilliterate indicates whether the predictive methods for MLinfill will be iterated by this integer such as may be beneficial for particularly messy data.
  • randomseed seed for randomness for all of the random seeded methods such as predictive algorithms for ML infill, feature importance, PCA, shuffling, etc
  • numbercategoryheuristic an integer indicating for categorical sets the threshold between processing with one-hot-encoding vs ordinal methods
  • pandasoutput quite simply True means returned sets are pandas dataframes, False means Numpy arrays (defaults to Numpy arrays)
  • NArw_marker indicates whether returned columns will include a derived column indicating rows that were subject to infill (can be identified with the suffix “NArw”)
  • featureselection indicates whether a feature importance evaluation will be performed (using then shuffle permeation method), note this requires the inclusion of a designated labels column in the train set. Results are presented in the returned object “featureimportance”
  • PCAn_components Triggers PCA dimensionality reduction when != None. Can be a float indicating percent of columns to retain in PCA or an integer indicating number of columns to retain. The tool evaluates whether set is suitable for kernel PCA, sparse PCA, or PCA. Alternatively, a user can assign a desired PCA method in the ML-cmnd[‘PCA_type’]. Note that a value of None means no PCA dimensionality reduction will be performed unless the scale of data is below a heuristic based on the number of features. (A user can also just turn off default PCA with ML-cmnd[‘PCA_type’])
  • PCAexcl a list of any columns to be excluded from PCA transformations
  • ML_cmnd allows a user to pass parameters to the predictive algorithms used in ML infill, feature importance, and PCA (I won’t go into full detail here, although note one handy feature is we can tell the algorithm to exclude boolean columns from PCA which is useful)
  • assigncat allows a user to assign distinct columns to different processing methods, for those columns that they don’t want to defer to default automated processing. For example a user could designate columns for min-max scaling instead of z-score, or box-cox power law transform, or you know we’ve got a whole library of methods that we’re continuing to build out. These are defined in our READ ME. Simply pass the column header string identifier to the list associated with any of these root categories.
  • assigninfill allows a user to assign distinct columns to different infill methods for missing or improperly formatted data, for those columns that they don’t want to defer to default automated infill which could be either standard infill (mean to numerical sets, most common to binary, and boolean identifier to categorical), or ML infill if it was selected.
  • transformdict and processdict allows a user to design custom trees of transformations or even custom processing functions such as documented in our essays that no one reads. Once defined a column can be assigned to these methods in the assigncat.
  • printstatus You know, like, prints the status during operation. Self-explanatory!

Now we’ll demonstrate a few.

TrainLabelFreqLevel

Let’s take a look at TrainLabelFreqLevel, which serves to copy rows such as to (approximately) levelize the frequency of labels found in the set. First let’s look at the shape of a train set returned from an automunge application without this option selected

OK now let’s try again with the option selected. If there was a material discrepancy in label frequency (aka a class imbalance) we should see more rows included in the returned set.

binstransform

binstransform just means that default numerical sets will include an additional set of bins identifying number of standard deviations from the mean. We have to be careful with this one if we don’t have a lot of data as it adds a fair bit of dimensionality. It can also be assigned to distinct columns in assigncat if you don’t want default for all numerical columns, such as for instance assigning a column to ‘nmb3’, which includes a z-score normalization and assembly of standard deviation bins. We’ll simply demonstrate defaulting to inclusion.

etc.

So the interpretation should be for columns with suffix including “bint” that indicates bins for number of standard deviations from the mean. For example, nmbr_bint_t+01 would indicated values between mean to +1 standard deviation.

MLinfill

MLinfill changes the default infill method from standardinfill (such as mean for numerical sets, most common for binary, and boolean marker for categorical), to a predictive method in which a machine learning model is trained for each column to predict infill based on properties of the rest of the set. This one’s pretty neat, but caution that it performs better with more data as you would expect. Please note current ML infill predictive models are random forest based, adding some more sophisticated options here is intended for a future extension.

Let’s demonstrate, first here’s an application without MLinfill, we’ll turn on the NArws option to output an identifier of rows subject to infill.

So upon inspection it looks like we had a few infill points on columns derived from column ‘dist1’ (as identified by the corresponding NArw columns) so let’s focus on that. As a reminder the suffix ‘_nmbr’ indicates that this ‘dist1_nmbr’ column was returned from applying a z-score normalization to a column originally titled ‘dist1’. As you can see the plug value here is just the mean which for a z-score normalized set is 0.

columns = ['dist1_nmbr', 'dist1_NArw']
train[columns].head()

Now let’s try with MLinfill:

As you can see the method predicted a unique infill value to each row subject to infill (as identified by the NArw column). We didn’t include a lot of data with this small demonstration set, so I expect the accuracy of this method would improve with a bigger set.

pandasoutput

pandasoutput just tells whether to return Pandas dataframes or Numpy arrays in the returned sets (defaults to Numpy which is a more universal eligible input to the different machine learning frameworks, although I believe Pandas has some potential advantages in memory efficiency for instance as can hold multiple types of numerical precisions across columns).

Note that if we return Numpy arrays and want to view the column headers (which remember track the steps of transformations in their suffix appenders) good news that’s available in the returned list finalcolumns_train.

print("finalcolumns_train")
finalcolumns_train

NArw_marker

The NArw marker helpfully outputs from each column a marker indicating what rows were subject to infill. Let’s quickly demonstrate. First here again are the returned columns without this feature activated.

Now with NArw_marker turned on.

print("list(train)")
list(train)
etc.

If we inspect one of these we’ll see a marker for what rows were subject to infill (actually already did this a few cells ago but just to be complete).

columns = ['dist1_nmbr', 'dist1_NArw']
train[columns].head()

featureselection

featureselection performs a feature importance evaluation with the shuffle permutation method. (Basically trains a machine learning model, and then measures impact to accuracy after randomly shuffling each feature.) Let’s try it out. Note that this method requires the inclusion of a designated labels column.

Now we can view the results from the returned feature importance dictionary like so (a future iteration of tool will improve the reporting method, for now this works):

etc.

(I’m sure the small size of this small demonstration set impacted these results.)

Note that for interpreting these the ‘metric’ represents the impact after shuffling the entire set originating from same source feature and larger metric implies more importance, while ‘metric2’ is derived after shuffling all but the current column originating from same source feature and smaller metric2 implies greater relative importance in that set of derived features. In case you were wondering.

PCAn_components, PCAexcl

Now if we want to apply some kind of dimensionality reduction to the returned sets, we can conduct via Principle Component Analysis (PCA), a type of unsupervised learning. (As an asterisk please be weary of performing this transformation in domains with potential fat-tailed distributions, just a word of caution.)

A few defaults here is that PCA is automatically performed if number of features > 50% number of rows (can be turned off via ML_cmnd), also the PCA type defaults to kernel PCA for all non-negative sets, sparse PCA otherwise, or regular PCA if PCAn_components pass as a percent. (All via scikit PCA library.) If there are any columns we want to exclude from PCA, we can specify in PCAexcl. We can also pass parameters to the PCA call via the ML_cmnd.

Let’s demonstrate, here we’ll reduce to four PCA derived sets, arbitrarily excluding from the transformation columns derived from ‘dist1’.

Noting that any subsequently available data can easily be consistently prepared as follows with postmunge (by simply passing the postprocess_dict object returned from automunge, which you did remember to save, right? If not no worries it’s also possible to consistently process by passing a subsequently available test set with the exact same original train set to the automunge function). I’ll tell you more about postmunge below :).

Another useful method might be to exclude any boolean columns from the PCA dimensionality reduction. We can do that with ML_cmnd by passing following:

assigncat

A really important part is that we don’t have to defer to the automated evaluation of column properties to determine processing methods, we can also assign distinct processing methods to specific columns. Now let’s try assigning a few different methods to the numerical sets. Remember we’re assigning based on the original column names before the appended suffixes.

How about let’s arbitrarily select min-max scaling to these columns:

minmax_list = [‘card1’, ‘card2’, ‘card3’]

And since we previously saw (in another notebook) that Transaction_Amt might have some skewness based on our prior powertransform evaluation, let’s set that to ‘pwrs’ which puts it into bins based on powers of 10.

pwrs_list = [‘TransactionAmt’]

Let’s say we don’t feel the P_emaildomain is very useful, we can just delete it with null.

null_list = [‘P_emaildomain’]

And if there’s a column we want to exclude from processing, we can exclude with ‘excl’. Note that any column we exclude from processing needs to be already numerically encoded if we want to use any of our predictive methods like MLinfill, feature importance, or PCA on other columns. (excl just passes data untouched, exc2 performs a modeinfill just in case some missing points are found.)

exc2_list = [‘card5’]

And we can leave the rest to default methods.

Here’s what the resulting derivations look like:

train.head()

assigninfill

We can also assign distinct infill methods to each column. Let’s demonstrate. I remember when we were looking at MLinfill that one of our columns had a few NArw (rows subject to infill), let’s try a different infill method on those. How about we try adjinfill which carries the value from an adjacent row. Remember we’re assigning columns based on their title prior to any suffix appendings.

transformdict and processdict

transformdict and processdict are for more advanced users. They allow the user to design custom compositions of transformations, or even incorporate their own custom defined transformation functions into use on the platform. I won’t go into full detail on these methods here, I documented these a bunch in the essays which I’ll link to below, but here’s a taste.

Say that we have a numerical column on which we want to apply multiple transformations. Let’s just make a few up, say that we have a set with fat tail characteristics, and we want to do multiple transformations including a box-cox power law transformation, a z-score normalization on that output, as well as a set of bins for powers of 10. Well our ‘TransactionAmt’ column might be a good candidate for that. Let’s show how.

Here we define our custom transformdict using our “family tree primitives”. Note that we always need to use at least one replacement primitive, if a column is intended to be left intact we can include a excl transform as a replacement primitive.

So let’s define our custom transformdict for a new root category we’ll call ‘cstm’.

Note that since ‘bxcx’ is a parent category, it will look for offspring in the primitives associated with bxcx root category in the internal library, and find there a downstream ‘nmbr’ category (for z-score normalization).

Note that since we are defining a new root category, we also have to define a few parameters for it, which I’ll demonstrate here. Further detail on this step available in documentation. If you’re not sure you might want to try just copying an entry in the READ ME available on GitHub.

Note that since our custom defined category ‘cstm’ is only a root category and not included as an entry in any family tree primitives we don’t have to define an associated processing function (for the dualprocess / singleprocess / postprocess entries), we can just enter None.

We can then pass this transformdict and processdict to the automunge call and assign the intended column in assigncat. (Note that the NArw infill indicator won’t be applied to custom sets of transformations unless you specifically designate ‘NArw’ in the family tree primitives, such as for instance as a ‘cousin’.)

etc.

And then of course a user also has the ability to define their own transformation functions to incorporate into the platform (such as may be indicated in the processdict), I’ll defer to the essays for that bit in the interest of brevity (too late).

postmunge

And the final bit which I’ll just reiterate here is that automunge facilitates the simplest means for consistent processing of subsequently available data with just a single function call, all you need is the postprocess_dict object returned from the original automunge call.

This even works when we passed custom transformdict entries as was the case with that postprocess_dict derived in the last example, however if you’re defining custom transformation functions for now you need to save those custom function definitions and redefine them in the new notebook when applying postmunge.

Here again is a demonstration of postmunge. Since the last postprocess_dict we returned was with our custom transformations in preceding example, the ‘TransactionAmt’ column will be processed consistently.

etc.

As you can see the returned columns are the same:

list(test)
etc.

Closing thoughts

Great well certainly appreciate your attention and opportunity to share. I suppose the next step for me is to try and hone in on my entry and perhaps get on the leaderboard. That’d be cool.

Oh before I go if you’d like to see more I recently published my first collection of essays titled “From the Diaries of John Henry”, which a big chunk included the documentation through the development of Automunge. Check it out it’s all online:

turingsquared.com

Or for more on Automunge our website and contact info is available at:

automunge.com

The Beatles — Here Comes the Sun

Albums that were referenced here or otherwise inspired this post:

Abbey Road — The Beatles

Abbey Road

Ella and Louis — Ella Fitzgerald and Louis Armstrong

Ella and Louis

(As an Amazon Associate I earn from qualifying purchases.)

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.