Automunge Explained (in depth)

In case you were, like, really wondering

Nicholas Teague

Published in

Automunge

21 min readFeb 15, 2020

Automunge promo (in depth) video, transcript follows

Transcript

Hello and thank you for stopping by. This video will serve as an introduction and explainer to the Automunge platform for tabular data feature engineering. We expect that given some of the fundamental territory that Automunge covers in the machine learning workflow, this tutorial may also be useful as an educational resource for those looking to gain some vocabulary and insight for the basics of machine learning practice. Oh and as a heads up, our graphic design team really went all out on the special effects in this video, so please prepare to be wowed.

But first, before we get to all of the special effects, here is a kind of reader’s digest summary. Automunge is an open source python library available now for pip install. As a python library, it is intended for use by data scientists comfortable with working at the coding layer, such as in the context of Jupyter notebooks, without the support of a graphic user interface. Not to worry — it’s very easy to use! By passing the automunge(.) function a Numpy array or Pandas dataframe, the function automatically converts to numerical encoding and infills missing values, thus giving a user means to pass raw data directly to a machine learning algorithm.

But a user doesn’t have to defer to automation, the tool also serves as a platform for feature engineering — which for those unfamiliar with the term “feature engineering” just means data transformations intended to make some stream more easily digestible by machine learning. We have a whole library of data transformations available to be assigned, and a user may also pass their own defined transformations within the platform. Oh and we’ll get this in a bit, we also have some pretty nifty methods to predict infill to missing or improperly formatted points by making use of machine learning — what we call ML infill. And so much more!

All of this functionality is available through the application of two simple functions: automunge(.) for the initial preparation of tabular data for machine learning, and postmunge(.) for consistently preparing additional data. In fact an important point to keep in mind as we progress in this video is that most of what we’re going to demonstrate here is available by activating just a single parameter in one of these functions — it’s all push-button and automatic.

The workflow of preparing data with Automunge is pretty easy to visualize. After importing and initializing the class, such as in a Jupyter notebook, we simply pass a Numpy array or Pandas dataframe of the tabular data we intend to use to train a machine learning algorithm, what we’ll call our “train” set, and optionally, if available, a consistently formatted set that we intend to use to generate predictions from the algorithm, what we’ll call our “test” set.

The automunge(.) function returns a series of sets including prepared training data, validation data, and test data — each further segregated by the training features, labels, and index columns — and by prepared meaning numerically encoded, with infill to missing values, such as to be suitable for direct application of machine learning. Importantly, the automunge(.) function also returns a python dictionary which serves as a key for consistent processing of additional data.

Once we have that key, we can then pass any additional data to the postmunge(.) function to return consistently prepared data — with no external database required, everything we need to consistently prepare data is embedded in the returned dictionary.

Ok let’s back up for a second and talk about the kind of data that we are passing to the functions. One way to categorize data sets could be the shape describing the number of dimensions of data point entries. For example a scalar would just be a single value, such as the number one, and thus it has a shape of 1, meaning one dimension.

If we then assembled a list of scalars, this could be described as a vector, such as a set of measurements for a single variable, and the shape would just be the number of measurements, here we show three measurements.

If we then aggregate multiple vectors into another list, such as a list of lists, that can be described as a matrix, with the shape having two entries for number of rows and number of columns. Oh and if we continue to embed lists within lists of lists, these type of higher order aggregations are known as tensors — one example of a tensor dataset could be a collection of images, each with their own aggregation of RGB pixel values.

But let’s not get distracted by higher order tensors, because it turns out that the matrix style of data, also known as “tabular data” or “structured data”, is a very common grouping, and in fact one way to think about Automunge is that it is a tool explicitly intended as a means for preparing tabular data for machine learning.

So let’s take a closer look at a tabular data set for a machine learning project. Again by tabular we just mean a table of data with rows and columns, such as data found in an excel spreadsheet, a Pandas dataframe, or a Numpy array for instance. One of the only pre-requisites for Automunge is that data be received in a “tidy” form, which just means a single column per feature and a single row per observation. This example table represents a set of “features” (or properties) of small animals which we would like to apply machine learning for predicting classifications of whether that animal is a dog or a cat. In order to train a machine learning model, we’ll need a set of features and corresponding set of labels which we will use to algorithmically infer relationships between the features and labels. But in order to train the machine learning model, we’ll first need to perform some preparations — such as numerical encoding, normalizations, infill to missing points — basically all that stuff that Automunge is for.

In practice, once we have our training data prepared, we’ll often segregate into two or more sets, one for use to train a machine learning model, and the other to validate that machine learning model and tune hyperparameters. Once we have trained our machine learning model, we can then use it to predict classifications of unlabeled data, demonstrated here for what we’re calling our “test data”. An important point to keep in mind is that any data transformations that are applied to the training and validation data will need to be consistently applied to the test data to run the inference operation. But before getting into that, let’s take a closer look at the workflow associated with training a machine learning model.

Once we have our tabular data set prepared for training with Automunge, we can then feed our training data sets of features and corresponding labels to train a machine learning model — which could be any of the multiple options to choose from for tabular data such as neural networks, decision tree methods, etc.

Once we have trained a machine learning model, we can then use the validation set that we had carved out from the training data to evaluate the accuracy of that model. Such validations can be used to iteratively tune the parameters of the training operation to improve model accuracy. Oh and not shown here it may be appropriate to carve out a second validation set, one not used in the hyperparameter tuning, to estimate accuracy of the final model.

Having trained a model, we now have the ability to pass to that model unlabeled data to infer label predictions. Of course the model operates on the assumption that this unlabeled data is consistent in form and distribution as the training data used to train the model, so any data preparations that were applied to the training data with the automunge(.) function must now be consistently applied to the test data with the postmunge(.) function.

The application of the automunge(.) and postmunge(.) functions in this workflow is now depicted. Automunge, here abbreviated as “am”, receives a raw data set which may not yet be ready for machine learning, such as for instance not fully numerically encoded or having missing data points requiring infill. Through the application of automunge(.) the data is segregated into training and validation sets, each consistently prepared with feature engineering transformations and infill for ingest by the machine learning training operation.

Just as how the machine learning training operation returns a trained model that is used to infer predictions, the application of the automunge(.) function returns a “trained” dictionary capturing all of the steps and parameters of transformations that can then be used to consistently prepare additional data with the postmunge(.) function, shown here consistently converting additional raw data to a test set for use to feed the trained model and infer predictions.

So let’s talk about what we mean by preparing data for machine learning. It turns out that mainstream machine learning libraries have a few prerequisites of data for training a model. First the data will need to be numerically encoded, meaning raw numbers with no units or strings in the representation. There are some libraries that allow a user to designate categorical sets for internal representation of passed strings, but that extra intervention is not required as long as data is directly served in numerical encodings.

Another common type of feature engineering involves the normalization of numerical sets — think of this as like scaling and centering the set of values within a designated range. Now not all machine learning algorithms uniformly benefit from this step, for example neural networks have a more pronounced benefit from normalization than decision tree methods for instance. There are some types of algorithms which require all-positive values in which case a normalization may be a firm prerequisite. Since the application of normalization here is pushbutton and automatic we recommend just defaulting to normalization of numeric sets.

Another prerequisite for training is that the data sets must contain all valid entries. It is common that real world data sets may have instances of missing or improperly formatted entries. In the Automunge library every transformation function includes a default method for infill and alternatively a user may also designate specific means of infill to distinct columns.

The prerequisite shown here of data pipelines simply means that whatever transformations we apply to our training data, we will need to consistently apply to subsequent data used in the inference operation to generate predictions. The Automunge library makes this as simple as can be, as subsequent consistent processing is achieved with just a single function call to postmunge(.) using the returned dictionary from automunge(.) capturing the steps and parameters of transformations.

Another point of value that we’ll briefly highlight is that in some cases a training operation may benefit by presenting our feature sets to the algorithms in multiple configurations, such as with varying information content. This might not be as important when a user has access to infinite training data, but when faced with limited data, presenting feature sets to the machine learning algorithm in multiple configurations may facilitate more efficient extraction of properties for deriving relationships between training data features and labels.

When we talk about performing feature engineering transformations to the columns of our data sets, it’s worth highlighting that there are a few ways to go about this. Although some transformations can be performed independent of the data set contents, most transforms require first extracting some properties from the data to use as a basis for transformations — as an example for a numerical set we may be basing normalization’s on a set’s mean and standard deviation, or as another example we may base the methods for a categorical column on the set of values found in the data.

For the logistics of extracting these properties we have a few options for how to handle the different data sets, such as for distinctions between training / validation / or test data sets. One way would be to just handle each set separately, such as to extract separately between training data and validation data and simply apply transforms on those two sets on each distinct basis. The problem with this method is that this may result in an inconsistency of transforms — for example the mean of the values found in the training set may be different from the mean of the values found in the validation set.

Another way to go about extracting properties for the basis of transformations could be lump all of the sets together in evaluation. For example we could evaluate the mean of a column including both the training and validation data in aggregate. It turns out this approach also has a problem in that it can lead to what is called data leakage, by which we mean that when we go to train our model, some of the properties of the validation set will have leaked into the training set, which means our validation assessment could overstate the model’s accuracy score.

Automunge offers a solution to both of these problems by basing the transformations for all sets on the original properties derived from the training data. By using the same basis for all sets we are ensured of consistency of transforms and by not including any of the validation data for the derivation of those properties we are ensured that no data leakage will take place between the training and validation operations.

Oh and as an added bonus of this method, by removing the need to assess properties of data prior to transformations for the test data, we end up with a much more efficient means of preparing that data, which may translate to energy efficiency or even reduced carbon intensity for processing streams of data at scale.

The Automunge platform includes a built-in library of feature engineering transformations, fully documented in our READ ME on GitHub. We’ve found it helpful to organize the transformations based on the intended data types of target columns, such that we’ve aggregated between categories of numerical target columns, sequential target columns (such as for time series data), categorical target columns, and some really neat stuff for date-time data.

Each of these sets then have a few sub-aggregations of categories. For example, with numerical data we have several types of normalizations available, such as z-score normalization, mean scaling, min-max scaling, and a few variations upon each. For sequential data we have a few methods of taking first, second, or even higher order derivatives, such as may be useful for bounding cumulative data for instance. For categorical sets we have several methods of encoding, including one-hot encoding, ordinal encoding, and some really neat binary encodings which is the default. Our date-time methods allow a user to segregate entries by time scale and apply sin and cos transformations to address periodicity, as well as aggregating bins for various traditional boundaries such as business hours, weekdays, and holidays.

Oh and I’d be remiss not to highlight that categorical sets can also be further processed with string parsing methods, such as to extract numerical portions of entries, or even identify character grouping overlaps between entries. This is a very very useful part of the tool for automating what might otherwise require manual extractions of data.

Let’s give a few examples of what these data transformations are actually accomplishing. Shown here is a simple data set consisting of three source columns of what we’re calling raw data, each with three entries. The first red column is a numerical set, which could be populated with floats or integers for instance. The second white column is categorical, showing with string entries of three colors to demonstrate. And then the third column is another categorical set, this time with only two distinct values in the set.

First for the numerical set we’ll demonstrate two different types of normalizations. The four character strings of ‘nmbr’ and ‘mnmx’ are how the transformations are accessed when assigning transformations to distinct columns — the READ ME documentation contains a full catalog of the available transforms, each with their own four character string designation. Here, ‘nmbr’ refers to a z-score normalization procedure, in which data is centered around 0 and scaled by the standard deviation. Another kind of normalization available is for min/max scaling, with the ‘mnmx’ transform, which scales data within range 0 to 1 based on the maximum and minimum values found within the set. This one might be useful for cases where we want to ensure that all of our returned values are non-negative for instance. These are just two examples, we have a whole catalog of various normalization methods along these lines.

For categorical sets, we’ll demonstrate here a few of the methods available. The ‘text’ transform is what we call our one-hot encoding, which returns a distinct column for each category found in the set, here one column each for the entries of red, white, and blue. If we want to encode a categorical set in a single returned column, we could use the ‘ordl’ transform which is an ordinal, or integer encoding. And then somewhere in the middle for memory bandwidth is the ‘1010’ transform, a binary transform that allows multiple column activations for a reduced memory bandwidth.

We also have a second kind of binary transform intended for cases where the source column only has two distinct entries, designated the ‘bnry’ transform, which encodes to a single column of 0’s and 1’s. Oh, and it’s worth noting that these four character strings which designate the different types of transformations are also generally appended to the column header titles as suffixes, such that the column header titles of the returned data sets log the steps of transformations by way of suffix appenders to these header titles.

Although the basis of the tool is that transformations must all originate from a single source column, there’s no need to limit our transformations to a single application for each source column. Using a set of of what we call our “family tree” primitives (more on that in the documentation), it’s possible to specify generations and branches of multiple transformations, such as for instance if we wanted to present our features to the machine learning algorithms in multiple configurations of varying information content. Here’s an example of a numerical set with three different returned configurations to be presented to the training operation — first a version with min/max scaling, and the the other two configurations derived by a power law transform followed by a z-score normalization and a set of bins based on number of standard deviations from the mean. Of course if we have infinite training data there may not be much benefit to presenting feature sets in multiple configurations, but for cases where we have limited training data these type of operations in some cases may make the extraction of properties in our training operation more efficient.

Another example of multiple tiers of transformations originating from a single column is given here for a source column of a sequential data stream, such as some time series data or a cumulative counter. In this demonstration the data stream is presented to the machine learning algorithms in three configurations, the first as the raw data with a simple z-score normalization applied, the second derived by taking a dx/dt derivative value based on some time step followed by a z-score normalization — thus returning a normalized calculation of velocity, and the third returned set derived by applying a second derivative operation after the first, again followed by a z-score normalization, thus providing some estimate of acceleration of our data stream.

These demonstrations have been for cases where a user assigns sets of transformations to distinct columns. It’s also important to keep in mind that alternatively, preparing our data for machine learning with automunge(.) is possible even with full automation. For cases where a user does not assign distinct methods of transformation categories to a column, the algorithm assigns it’s own transformations, based on an evaluation of data set properties to infer appropriate means of numerical encoding and infill. Thus it is possible for a user to pass raw data and the algorithm automatically returns numerically encoded data with infilled missing values, such as to meet the minimum requirements for the direct application of machine learning.

The feature engineering transformations are not the only automated methods to prepare the data. Another common obstacle for real world data sets is that our data streams may have instances of missing or improperly formatted data points. Consider here a numerical set in which can be found a “NaN” point, which stands for “not a number”, or perhaps may contain instances of string entries which are inconsistent with the desired numerical formatting. Traditionally, a data scientist has several options for how to infill missing points, which may include replacing those points with values derived from the set, such as mean, median, mode, or maybe replacing missing points with a arbitrary value such as 0 or 1 — or another method could be to just copy the value from an adjacent cell. The Automunge platform allows a user to choose any one of these methods for assignment to distinct columns, or a user may just defer to automation. Under automation, each category of transformation has it’s own default infill method. Another option for the automated infill of missing values is a really neat method that we refer to as “ML infill”, which basically means that infill values are predicted using machine learning models trained on the rest of the set in a fully automated fashion.

We’ll go into a little more detail of the ML infill methods here. The data set on the left represents our raw data, segregated between purple shading for features and orange for labels, which may include instances of missing or improperly formatted values, here designated by the red x’s. The ML infill method takes this data, sets aside the original labels, and then from the feature set partitions into subsets which serve as training data and labels to train a column specific predictive model, as well as feature sets corresponding to the missing points which are applied to that trained model to predict infill, which is then inserted in place of the missing values.

Another really useful built-in method that we want to highlight here is for purposes of evaluating drift of data set properties between the original data used to train a machine learning model which was prepared with the automunge(.) function and subsequent data used in inference which was prepared with the postmunge(.) function. We actually have by default that when data streams are prepared in the automunge(.) function, distribution properties are derived for each source column. This includes distribution properties evaluated based on the raw data that was passed, as well as those properties evaluated along with each transformation function applied. Then, when it comes time to prepare additional data with the postmunge(.) function, by activating a simple parameter those same properties are evaluated to generate a comparison report. This type of analysis may be useful for instance to determine when it is time to retrain a model based on drift of data set properties.

Continuing our exploration of additional push-button methods built into the library, another really neat method is available for evaluation of feature importance by making use of what is known as the shuffle permutation method. In this method, the raw data is prepared with the automunge(.) function, and then is used to train a model with an evaluated accuracy metric. The feature importance metrics are then derived by, for each source column, randomly shuffling the data, such as to dampen the accuracy of the inference operation. Then, by comparing the original model accuracy to the inference accuracy after the shuffle operation, we can derive a metric which gauges the importance of the feature that was shuffled. Automunge actually has expanded on the traditional shuffle permutation method by incorporating a second importance metric, where the first metric measures the importance of the original source column, and the second metric measures the relative importance of each feature derived from the same source column.

There’s more. Another useful push-button method built into the library is for purposes of dimensionality reduction of our training data. We actually have two options here, the first making use of Principle Component Analysis, or PCA, which is a type of unsupervised learning in which the columns are collectively transformed such as by aggregating to a reduced number of axis. And of course the subsequent application of the trained PCA model to data is automatically performed in the postmunge(.) function.

A second type of dimensionality reduction available in the library makes use of the feature importance evaluation. Here the feature importance metric is evaluated for the set of columns and dimensionality reduction is performed by simply dropping a designated range of the low scores. This may prove useful in cases where a user is uncertain of the applicability of various information streams to a problem, however should be cautioned that in cases of highly redundant data single column scores may be impacted.

Another neat push-button method is for purposes of label engineering. Label smoothing refers to the practice of updating the activation values of a one-hot encoded label set to a reduced threshold for activations and an increased threshold for nulls, as shown here we’ve set an activation metric of 0.9 and then the null values are derived from the number of remaining categories. Label smoothing serves the purposes of allowing a training operation to account for the fact that there may be some improperly designated values, or noise, in the labels, and so has the effect allowing the model to recognize that there is some uncertainty associated with the training data.

Automunge actually has extended the traditional label smoothing operation to add a little more intelligence to the null activation values, a method we refer to as fitted label smoothing. Here we see that the null activations in a fitted smoothing preparation are tailored to the number of activations associated with each column. Although not an extensively tested hypothesis, the expectation is that this method may offer some further improvement to probabilistic calibrations.

Continuing our exploration of the data set preparation options, yet another push-button method worth highlighting is for purposes of preparing data sets for oversampling during the training operation. Oversampling is a method to address cases of label set class imbalance, which just refers to when we have an unequal distribution of the different entries in a label set. Shown here is an example where we have four 1’s and two 0’s — this is an example of class imbalance where the 0 values are under-represented, which may dampen our model’s extraction of properties from those 0 labels.

When it comes time to train our model, a simple solution to this problem is available by identifying the under-represented labels and performing a copy and paste operation, such as to append additional copies of the underrepresented labels in order to better balance the examples. What’s really cool is that Automunge doesn’t just offer this method for categorical sets. Numerical label sets, such as for a regression problem, can also be treated with an oversampling preparation by aggregating bins from the data, such as for instance the number of standard deviations from the mean.

At it’s core, the Automunge platform consists of two simple functions. The automunge(.) function accepts Numpy arrays or Pandas dataframes, and prepares raw data to a form suitable for the direct application of machine learning. A user may apply sets of data transformation functions from our library to distinct columns for purposes of feature engineering, or may alternatively defer to automation, allowing the algorithm to infer appropriate means of normalizations, numerical encodings, and infill. The function returns these processed data sets along with a simple python dictionary capturing all of the steps and parameters of transformations. This returned dictionary can then be passed to the postmunge(.) function along with any subsequently available data to easily, efficiently, and consistently prepare data for training or inference. Think of Automunge as a method intended for integration into the workflow in the step immediately preceding the application of machine learning to tabular data. And when you try it out, please be sure to let us know!

Oh and don’t worry we didn’t forget that we promised some special effects for this video. Are you ready? Here they come.