Custom Transformations with Automunge

Because sometimes you want to drive the car yourself

Published in

Automunge

7 min readAug 16, 2021

For those that haven’t been following along, I’ve been using this forum to document the development of Automunge, an open source python library platform for tabular data preprocessing — we prepare data for machine learning. The library is intended as a resource for all of the tabular learning workflow in between receipt of “tidy data” (one column per feature and one row per sample) and returned sets suitable for direct application of machine learning. A helpful way to think about it is that Automunge is a resource for applying univariate data transformations to the features of a tabular data set which are fit to properties of a training set. The transformations may be applied under automation such as for numeric normalizations and categoric binarizations, may be sourced from an extensive internal library, or as we’ll demonstrate in this essay may even be custom defined from a very simple template. By assembling data pipelines through our platform, a user gets access to a host of useful features, such as automatic preparation of additional data on a consistent basis, statistic sampling to detect data distribution drift, auto ML derived missing data infill, and even pushbutton aggregate transformation inversions.

The purpose of this essay is to offer introduction to a newly streamlined standard for Automunge custom transformation functions. To be a little more precise, custom transformation functions are for user defined operations to be applied to transform the entries found in a column in a tabular data set, with support for basing those operations on properties of the entries in a designated “training set” for consistent basis on additional data applied with a separate corresponding “test set” custom transformation function. By implementing custom transformations through the platform, a user can then integrate such operations within a set of transformations with order of operations defined by our family tree primitives, and with the set potentially mixed with transforms available in our internal library. The integration of custom transformations comes with built in support for auto ML derived missing data infill and pushbutton aggregate inversions.

We actually have had the option for custom transformation functions for quite some time now — they were first detailed in this blog in our 2019 essay Automunge 1.79: An Open Source Platform for Feature Engineering (to give you an idea of how far we’ve come since, we’re currently on version 6.64, do the math :).

Automunge 1.79: An Open Source Platform for Feature Engineering

BYOF: Bring Your Own Function

medium.com

The new streamlined convention was rolled out in version 6.41, and served the purpose of abstracting away almost all of the complexity from prior conventions, where by complexity am referring to accommodating and populating various data structures passed alongside transformations. In the new convention, which we refer to in the documentation as the ‘custom_train’ convention, operations can be defined independent of data structures, which are all managed separately in a wrapper function. A user simply defines a function for a received training set dataframe (df), target column (column), and dictionary for received parameters (normalization_dict), and returns the resulting transformed dataframe (df) along with the same dictionary (normalization_dict) logging any properties from the train set needed to consistently prepare additional data. And since this is targeting a column in a Pandas dataframe, may include Pandas or Numpy operations for instance.

def custom_train_template(df, column, normalization_dict):
  
  #custom transformation
  ...
  
  return df, normalization_dict

For additional data, what we refer to as test data, a user can either allow the same function to be applied as was applied to the training set (as may be appropriate when operations are independent of training set properties), or alternatively may define a corresponding custom transformation that applies transformations to the test data column based on properties accessed form the corresponding training data column which were logged in the returned dictionary.

def custom_test_template(df, column, normalization_dict):  #corresponding transformation
  ...  return df

The custom_train convention has even streamlined the specification of a corresponding custom inversion operation, which may be defined with a similar simple template or otherwise omitted when inversion support is not needed.

def custom_inversion_template(df, returnedcolumn_list, inputcolumn, normalization_dict):  #corresponding inversion
  ...  return df

Having defined our custom transformation functions, we can then pass them to an automunge(.) call by way of populating two corresponding data structures, the processdict and the transformdict. These data structures are for defining properties of “transformation categories” which can then be assigned to a column to apply the transformation functions. Each of these data structures are addressed in some detail in the recent essay Data Structure.

Data Structure

From my family tree to yours

medium.com

The processdict is for defining properties associated with a “transformation category”, including various properties like what kind of received data is considered valid input, the form and structure of returned data from the transform (e.g. integers, floats, or boolean integers in one or multiple returned columns), as well as the associated transformation functions, which as used here will be our custom transformation functions defined in the “custom_train” convention. (The bulk of the internal library has transformations defined in an alternate convention.) Here we demonstrate populating a processdict entry for a new transformation category we’ll refer to as ‘newt’. This string, in addition to serving as a transformation category identifier, will also be included as a suffix appender on the returned column (thus an input column with header ‘targetcolumn’ would be returned as ‘targetcolumn_newt’). Note that if we’re not sure what processdict entries to apply for other properties, we can just copy entries from another category from the library, here we’ll match our other entries to the ‘nmbr’ category by way of a functionpointer entry, which will populate entries corresponding to ‘nmbr’ for processdict entries not already specified. (Note that if we omit the entry for ‘custom_test’ or designate as None, the same custom_train_template will be applied to both training and test data.)

processdict = \
{'newt' :
 {'custom_train' : custom_train_template,
  'custom_test' : custom_test_template,
  'custom_inversion' : custom_inversion_template,
  'functionpointer' : 'nmbr',
 }}

The transformdict is for defining sets of transformation categories to be associated with a root transformation category by populating as entries to the Automunge family tree primitives. Thus, when the root category is assigned to a column, the transformation functions associated with the transformation category entries to the family tree primitives will be applied to that column. Here we demonstrate populating a root category set for the ‘newt’ category we just specified in the processdict, which will have transformation functions applied based on the family tree entries of a ‘newt’ category to apply our custom transformation functions as well as the ‘NArw’ category to populate markers for missing data. (Using these same primitives, it is possible to define sets of transformations that include generations and branches of derivations.)

transformdict = \
{'newt' :
 {'parents'       : [],
  'siblings'      : [],
  'auntsuncles'   : ['newt'],
  'cousins'       : ['NArw'],
  'children'      : [],
  'niecesnephews' : [],
  'coworkers'     : [],
  'friends'       : [],
 }}

Now that we’ve defined transformation category properties to the new category ‘newt’ including our custom transformation functions and populated a family tree for the use of ‘newt’ as a root category, we can then assign the root category to a target column with the automunge(.) assigncat parameter. Note that when assigning the same root category to multiple target columns, the columns can be entered as a list of headers (using [list] brackets) instead of a single string value as shown here.

assigncat = \
{'newt' : 'targetcolumn'}

Note that if we want to pass parameters to our custom transformation function we can do so with an additional parameter known as assignparam.

assignparam = \
{'newt' : 
 {'targetcolumn' :
  {'(parameter)' : '(value)'}}}

Putting it all together for an automunge(.) call to prepare a training data set df_train would look something like this.

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(
  df_train,
  assigncat = assigncat,
  assignparam = assignparam,
  processdict = processdict,
  transformdict = transformdict)

Having populated our postprocess_dict through the automunge call which logs all of the steps and parameters of transformations (you did remember to save it, right?), we can then prepare additional df_test data in a pushbutton operation with a postmunge(.) call which will apply transformations and imputations to corresponding data on a consistent basis. Note that we’ll need to initialize our custom functions again if this is taking place in a separate notebook.

test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(
  postprocess_dict,
  df_test)

Similarly, we can use the postprocess_dict returned from automunge(.) to invert transformations. Shown here for the test set that was just prepared (test), inversions could also be performed separately to invert predictions after an inference operation and recover the original form of labels.

df_invert, recovered_list, \
inversion_info_dict = \
am.postmunge(
  postprocess_dict,
  test,
  inversion='test')

You now have all you need to define custom transformations for integration into an Automunge data pipeline. Remember, with great power comes great responsibility. Don’t forget to have fun.

Here Comes the Sun — The Beatles

Appendix A — Custom Train Template

In this example, we’ll define a custom_train_template that applies an operation similar to z-score normalization.

Appendix B — Custom Test Template

This is an example of a corresponding operation using properties derived from the train data to conduct a form of z-score normalization on a consistent basis.