Automunge Explained (in brief)

In case you were wondering

Nicholas Teague
Feb 14 · 6 min read

Automunge promo video, transcript follows

Transcript:


Automunge is an open source python library that automatically performs data normalizations, numerical encodings, and infill to missing points — transforming raw data into a form suitable for the direct application of machine learning.

The two functions of the library are automunge(.) for the initial preparation of training data and postmunge(.) for subsequent, consistent processing of additional data. The application of the automunge(.) populates and returns a simple python dictionary containing the steps and parameters of transformations, so data pipelines can be applied for streams of data, with no external database required.


The tool is intended for the application of tabular data received in a “tidy” form, which means a table of columns and rows, with one feature per column, and one row per observation.

Shown here is a simple example of the features and labels for a classification between cats and dogs — based on age, color, and weight. Through the application of Automunge, data is consistently prepared between sets intended to train, validate, and generate predictions from a machine learning model.


The automunge(.) function takes place at the step immediately preceding the application of machine learning. By feeding raw data through the function, data is numerically encoded, normalized, and the infill of missing data points is received, making it suitable for direct application of machine learning.

Just like how the training operation returns a model we can save and later use to generate predictions, the preparation of training data in automunge(.) populates and returns a python dictionary capturing all the steps and parameters of transformations, allowing subsequent raw data to be passed to the postmunge(.) function for fully consistent processing, with just a single function call.


The preparation of data can either be based on automated inference of data set properties to determine the types of transformations applied, or a user can manually assign sets of transformation functions to distinct columns, thus serving as a platform for feature engineering.

A library of transformation functions is aggregated by methods tailored to several categories of data, such as numerical data, time-series data, categorical sets, date-time sets, and also some really neat methods for extracting structure from categorical sets based on string-parsing methods.


Here are a few examples of possible transformations: shown here are derivations originating from three source columns — one numerical set and two categorical sets. The four-character string is how we designate transformation categories. The red numerical set you see here may be normalized by the application of a z-score normalization, which centers and scales the data — or also by min/max scaling within the range 0–1.

For categorical sets, methods include options between one-hot encoding, where each entry has its own column for activations or, for categorical sets with a high number of values, an ordinal encoding may be more appropriate. We also have a mid-point for memory bandwidth available by a binary encoding, which represents some of the values with multiple simultaneous column activations. An extension of binary encoding is built explicitly for categorical sets with two entries, which returns a single column.


What’s cool about this platform is there’s no need to limit yourself to one returned set from each source column. Using the “family tree” primitives, it’s possible to generate multiple configurations of columns for presentation to machine learning.

Presenting feature sets to machine learning algorithms in multiple configurations is what we refer to as “artificial learning”. We have several pre-configured, multi-output sets available for simple assignment from the library of transformations.


The application of transformations, in most cases, requires extraction of data properties for a basis, such as z-score normalizations which are based on a set’s mean and standard deviation.

Automunge avoids any risk for inconsistent transformations or data leakage by basing transformations on properties derived from the training data. As a bonus, because no extraction of properties is required for subsequent data transformations, preparations in the postmunge(.) function are very efficient.


There are several useful push-button methods available in the library. One item worth highlighting is an option to predict infill to missing or improperly formatted data, based on machine learning trained on the rest of the set, in a fully automated fashion.

Through application, a training set under the hood is segregated to train column-specific machine learning models, which are then used to generate predictions for missing points. We call this method “ML infill” — it’s fully automated and could not be simpler to use.


Another push-button method worth highlighting is the option to prepare data for oversampling in cases of class imbalance, which refers to cases where there is an unequal distribution of categories in a label set.

Here’s an example where we have four 1’s and two 0’s. We can increase the model’s training exposure to the 0 labels by simply copying the rows of lower frequency and duplicating to improve the ratio. The Automunge platform can do this automatically, and this method isn’t limited to categorical label sets. It can also handle numerical labels by way of aggregated bins.


Once again, all of the methods demonstrated here are available in the context of two very simple python functions, which can be run in a Jupyter notebook.

The automunge(.) function accepts raw tabular data and automatically converts it to normalized, numerically encoded sets with infill to missing points, making the returned data suitable for direct application of machine learning.

The automunge(.) function also populates a simple python dictionary, capturing all of the steps and parameters of transformation, so additional data can then be consistently prepared by simple application of the postmunge(.) function.

In short, we make machine learning easy.


Please find us online for more information. Oh and once you try it out — please let us know!


A huge debt of gratitude is owed to Kelley Teague for lending her incomparable polish of voice to this video. Thank you Kelley! For an extended version of this material which goes into more detail, please refer to the “in depth” presentation and transcript linked here:



For further readings please check out A Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

Automunge

Automunge — Data Prep for Machine Learning

Nicholas Teague

Written by

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com

Automunge

Automunge

Automunge — Data Prep for Machine Learning

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade