Munging Supremacy

A Seven Part Suite

Published in

Automunge

19 min readNov 3, 2019

For those that haven’t been following along, I’ve been using this forum to document the development of Automunge, an open source platform for the preparation of tabular data for machine learning. At its core the tool is a method for the numerical encoding of tabular data that is a prerequisite for modern machine learning libraries, with minimal requirements of tidy data (single column per feature and single row per observation). Such encoding may be the product of an automated evaluation of data set column properties for determination of simple feature engineering methods (such as boolean encoding for binary data, one-hot encoding for categorical data, z-score normalization for numerical data, sin/cos transforms for time series data at period of time scale, etc.), or may alternatively be user assigned methods to distinct columns to apply custom feature engineering transformations or even custom sets of transformations functions. Such transformation functions may be sourced from our internal library of feature engineering methods (which we are continuing to build out), or alternatively may be custom defined by the user, making use just a few simple data structures, for incorporation into the platform. By implementing transformations in the platform, a user gets access to a whole host of useful push-button features, such as segregated feature engineering for validation data based on properties derived from training data, automated machine learning derived infill, automated feature importance evaluation, automated preparation of data sets for oversampling in cases of class imbalance in labels, automated evaluation of data property drift between training data and subsequent data, and perhaps most importantly the simplest means for consistent preparation of additional data with just a single function call. In short, we make machine learning easy.

Orphée Suite — Nicholas Teague (Philip Glass composer)

So I’ve been writing essays along the development journey for quite some time, and taking stock of where we are, I finally realized that I haven’t really done as good of a job as would really be appropriate in documenting our library of transformations, which after all is really what the tool is for. So I’ll use this essay to highlight a few of the methods available for implementation in the tool. Each of these transforms can be assigned for application to one or more distinct columns in a tabular data set by passing the column header as a string to the entry lists in the Automunge parameter ‘assigncat’ dictionary. We have traditionally followed a convention where each category of transformation is designated by a four character string, which yes I get it is not exactly very user friendly from a visual browsing of the list of transforms, these abbreviations are more appropriate when you consider that they also generally serve as the suffix appenders to returned column titles which log the steps of transformations for a given column. A user is encouraged to refer to the documentation included in our READ ME for a succinct definition for each of these category keys (in the section titled “Library of Transformations”). Great so without further ado let’s get started. Oh wait one more thing, soundtrack for this essay is Orphée Suite by Philip Glass, one of my favorites. It’s a little hard to keep up in a few spots, I find it helps to take it on at 80bpm as opposed to the specified 160 ;).

I. Categorical Data

We’ll begin our arrangement by taking on our extensive suite of options to prepare categorical data for machine learning. First it’s worth highlighting that one of the primary benefits of the tool here is that we are basing our transformations and returned columns exclusively on properties of the data derived from the set designated for training (that is training in some subsequent machine learning derivation). So for example in one-hot encoding, if our test set is missing some categories from the train set, columns for those feature will still be populated but with just all zeros. Or alternatively in one-hot encoding if the test set has some extra category entries that weren’t present in the train set, those rows will simply not have an activation. The suffix appenders (which for various transforms track the steps of derivations in the returned column headers) for one-hot encoding are pretty straight-forward, the returned columns are the original column header string plus a ‘_’ separator and then a string of the category entry target for that column’s activations. The default infill approach for one-hot encoding is simply the lack of activations in a row (such as shown here for NaN). Probably worth note that for memory considerations the one-hot encoding as default transform for categorical data is only implemented when the number of unique values in a column falls within some (potentially user-specified) threshold (which defaults to (an arbitrary) 15 unique values), and beyond that categorical sets are by default encoded via an ordinal integer encoding, specifically what we call ‘ord3’ which sorts the integer encodings by frequency of occurrence (as opposed to ‘ordl’ for instance which sorts encodings alphabetically).

Of course one-hot encoding is not the only option for numerical transformation of categorical sets. We also have some versions of binary encoding available. The simplest, dubbed ‘bnry’, is intended for use in sets with up to two unique values, and encodes with boolean transformation to 0 and 1, where 1 is most common value and 0 is second most common (unless 0 or 1 were previous in the set in which case the encoding defaults to those prior designations). The default infill method goes to setting missing cells as most common value — note that a user has plenty of options for alternate infill techniques such as can be assigned to distinct columns in the ‘assigninfill’ parameter in an automunge(.) call — such as specifically setting infill to 0 or 1 or adjacent cell value for instance. Note that as a special case if there is only one unique value then missing cells (like NaN) will instead be encoded as the second most common value. Binary encoding can also be extended to sets with >2 unique values with the ‘1010’ transform, which encodes similar to one-hot but multiple columns may be activated for some category (such as shown here with two activations for ‘triangle’, or all 0’s for ‘1234’). This is a new method in the library and so don’t have ML infill set up for it yet just yet — future extension. And note that default infill (such as shown here for NaN) is a unique infill encoding, which differs from one-hot encoding. The whole point of ‘1010’ is that it is more memory efficient than one-hot encoding — for example a set with 15 elements would have 15 one-hot encoded columns vs 4 columns for ‘1010’, while still retaining the advantages of one-hot encoding vs ordinal encoding for instance (such as distractions from a training operation potentially seeking to interpret the order of integers for an integer encoding). As a quick aside note that some machine learning libraries offer options to perform ML derived embeddings for categorical sets — Automunge does not currently have this option, but if you want to take advantage my advice would be to pass an ordinal encoded set for embeddings conducted in an external training operation.

(that’s an E flat in case you were wondering)

II. Numerical Data

In this next scene we’ll talk about numerical transformations, one of the key parts of the tool after all. The z-score normalization technique (subtract mean, divide by standard deviation) was one of the first implementations I built so perhaps has some sentimental value, it is the default for evaluated numerical columns under automation after all. Of course the basis of transformations are based on numerical properties derived from the designated train set, and then consistently applied to test or subsequent data in the postmunge function. In addition to the properties applied for transformations, the evaluation also collects a few simple metrics that can be used to track drift of data set properties when evaluating subsequent data in postmunge (with the ‘driftreport’ parameter). Of course driftreport metrics are not exclusive to numerical sets, we also track some drift metrics for categorical data for instance, this just seems as good a place to bring it up as any other. An alternative to z-score normalization is available in the ‘MADn’ transform which bases normalization on mean absolute deviation instead of standard deviation — the difference is somewhat esoteric, basically there are some distributions where mean absolute deviation may be more tractable to give you an idea. These are also the kind of distributions that may be appropriate for a box-cox power law transformation, which derives a beta distribution parameter based on the train set properties and then applies a corresponding power law transformation — our default here follows the box-cox transform with a z-score normalization since if there are like low digit decimals for instance the returned distribution can get kind of wild.

There are certainly some applications where you might prefer an alternative to a z-score normalization, for instance you might want your output to remain all non-negative. We have a min-max scaling available via ‘mnmx’ which scales values to within a range 0–1. Some of the drawbacks here are that the min/max values are based on values found in the train set so for instance if there are more extreme values in the test set then the subsequent data could still return values below 0. We have a few ways to address this built in, for one with the ‘mnm3’ variant the ceiling and floor of the min-max scaling is capped based on 99th and 1% quantiles of the train set. Or another approach is in the ‘mnm6’ variant a floor is placed on subsequent data below the minimum from the train set. So these kind of methods help to address outlier scenarios when the test data does not perfectly line up to the train data, as there are some methods (kernel PCA is just one example) where negative values could corrupt a transformation. Oh and on a tangent if you look closely you might see that some of the different methods here have different rows subject to infill (generally a default infill of mean is applied), for instance the bxcx and log0 apply infill for any non-positive value (including 0), while the square root transform ‘sqrt’ allows values of 0.

III. Binned Data

Continuing to Act II, the next set of transforms we’ll discuss involve the graining of continuous numerical data to binned sets. We have a few methods built in right now, including bins based on number of standard deviations from the mean and bins based on powers of 10. For the standard deviation bins, we have two options, ‘bins’ and ‘bint’, which differ only in that bint assumes that the data was previously z-score normalized while bins assume raw numerical data. They default to returning six boolean columns with activations for data that falls within a bin, basically identifying values of number of standard deviations from the mean <-2, -2 to -1, -1 to 0, 0 to +1, +1 to +2, and >+2 — sort of similar to one-hot encoding in a way. The ‘bsor’ is a similar transform except instead of creating a distinct column per set it ordinal encodes the groupings as integers (0–5). Generally unless you’re constrained by scale of data my thought is that the binned sets are more efficient, but certainly an argument could be made for running this as ordinal instead (part of the rationale for creating an ordinal alternative is relevant to my discussions for categorical sets, such as if one wanted to create some kind of embedding based on these groupings).

The powers of ten equivalent, ‘pwrs’, similarly creates a distinct column for data falling within each group, however note that the range of columns need not be contiguous, for example if our source data had range of values in 10–99 and 1,000–999 (but no values within 100–999), then no column would be created for the intermediate gap associated with 10². And of course negative powers of ten are allowed for values in range 0–1. It just occurred to me that it might be appropriate to build a comparable transform that allows for powers of ten bins that include values <0, such as -10¹ etc, I’ll need to get on that. And just like in standard deviation bins we have an ordinal equivalent. So as a quick explanation, a reasonable intuition about these transforms is “well you’re actually throwing away information in the aggregations so how is this helpful?” Well my thought is that by giving the data consolidated groupings I suspect it may help the training get started in the early epochs. But more importantly, one of the benefits of creating these type of aggregations in a labels set for instance is that they then enable the ability to perform preparation of data for oversampling in cases of label set class imbalance with our ‘TrainLabelFreqLevel’ parameter. For example if you wanted to leave numerical labels unprocessed but aggregate standard deviation bins for levelizing, you could assign the label column to category ‘exc3’, or for an equivalent z-score normalization coupled with standard deviation bins you could assign category ‘nbr3’. Pretty cool stuff.

IV. Date Time

Our date time methods are really some of the standout transforms in the tool. Built on top of the Pandas platform which has excellent capabilities to recognize even potentially inconsistent time series data formatting, the default methods for time series data segregate these date time objects by time scale before applying their respective transforms. Our default time series method ‘dat6’ (it took a few iterations to get there) treat years on their own with a simple z-score normalization of the value, and then for other time scales an aggregation is performed such as between months/days (mdsn/mdcs) and hours/minutes/seconds (hmss/hmsc). You’ll notice here the md and hms groupings each have two columns, these are based on performing separately a sin transform and a cos transform based on period of the time scale, for instance the sin transform would start at 0.5 on Dec 31st and follow a sin curve back through 0.5 at some intermediate point then again on Dec 31st the next year. The training operation will be able to distinguish between these by the simultaneous exposure to the cos transform (which curve has a constant offset), plus the year value will be present as well. Similarly, the hms cycle will be at time scale of 24 hours, and will follow a similar trigonometric cycle through each day.

Our default time series also includes a few added bin sets for a few special categories that might make a training operation a little more fruitful. For instance our ‘bshr’ category identifies those times that fall within 9:00am — 5:00pm, sort of a traditional workday unless you’re like me and like to sleep in. Although probably worth note that this designation in current implementation is not time-zone aware. Pandas also has the ability to build on top of time zone designations, but since was trying to make this a little more generalizable form I left that for a future extension. A comparable bin set is the ‘wkdy’ for identification of any days that fall with Mon-Fri. And ‘hldy’ identifies days of US Federal holidays. The inclusion of these bin sets really falls in with the Automunge philosophy of ‘artificial learning’ — without these identifiers a training operation may take a while longer to distinguish between such designations just based on data-set properties, so the expectation is that this will support a more efficient training operation.

V. Cumulative Data

Let’s take a second to look at another few methods potentially useful for time series data. One element that might be relevant to data collected along a time domain is that there may be unbounded cumulative sets, such as for example distance traveled in a road trip or dollars incurred on a credit card. Thus the training operation may not be exposed to multiple instances of the same state, after all by definition a cumulative measure only reaches each state once. In such a scenario we may instead look to infer properties of an observation not based on a specific point but instead on relative properties between different time steps. To do so we’ll need to maintain the order of rows in our tabular set (no shuffling for instance), and Automunge has a really useful method to derive such properties based on an analogue to a set’s derivative through transforms built off of ‘dxdt’ which take deltas between adjacent time steps.

The analogue to derivatives (like velocity, acceleration, jerk) are built on simply taking a difference between adjacent time steps, where for instance acceleration takes the difference between adjacent velocity steps. This is kind of like an estimate of a variable’s derivative where we take delta of the variable divided by delta of time, but here we’re just setting the time delta as equal to 1 (as in one time step). For cases where we’re interested in longer term relations between time steps, we’ve demonstrated a comparable approach with the ‘dxd2’ which takes the average of latest two time steps minus average of two preceding time steps, which kind of smoothes out and de-noises the data. Of course this method could be extended for any arbitrary length of time step based on comparable methods. The cool thing to keep in mind is that we are building on the ‘family tree’ primitives for defining a sequence of transformations with potential for generations and branches etc, for more on how that is specified please feel free to read the READ ME, more on that below.

VI. String Parsing

The ‘splt’ string parsing methods are a recent addition to the library and I think are certainly worthy for consideration. Consider as an example if we were evaluating a collection of real estate listings by address with a single column capturing combined entries for street address / city / state / zip code. I think it is intuitive that it would be beneficial for training to identify independently cases where portions of the address string have overlaps — for example, addresses that share the same zip code or are on the same street might be beneficial. Now one method could be for a user to designate a formula to extract portions of the string based on order of segments separated by spaces for instance. Well that’s not what’s being offered here, the Automunge philosophy is to try and automate where possible, so instead we have a method to automatically evaluate a set of unique values in the column and identify cases where unique values may share some overlap in string characters. The splt transformation does just that (in current implementation giving precedence to identified overlaps of higher length, limited to one overlap per row, and minimum of five characters in overlap), with returned overlaps presented as a one-hot encoding of sorts. Of course some of the limitations of the method are that for example if two houses share the same street and same city with a few other houses, the overlap would be identified for the combined street and city and thus would not carry consistent activation as other addresses in same city, even though they do share a (shorter length) string overlap).

A few variants of the splt transform are the spl2 and spl5 which share similar functionality. In spl2 a set is evaluated for overlaps just like in splt, but instead of created boolean activations, it instead replaces instances of overlap with the (potentially shorter string) of the identified overlap — for example ‘Orlando, FL 32789’ might be replaced with just ‘Orlando, FL’. And in the default implementation, this transformation is followed by an ordinal transform for integer encoding. The ‘spl5’ transforms shares similar functionality but differs in that rows without replacement of overlap are simply set to 0, which for instance if one wanted to run multiple tiers of overlap tests this could be beneficial. As an example of what I mean by multiple tiers of overlap, consider the transform in our library ‘ors6’, which produces a defined ‘family tree’ of transforms. By presenting the same set in multiple formats to the training algorithm we expect to facilitate a more efficient extraction of properties for generating predictions. For example, as shown here the ors6 takes a categorical set of strings and generates separately an ordinal encoding sorted by frequency (ord3), a spl5 overlap encoding followed by an ordinal encoding sorted by frequency (spl5_ord3), and then based on the spl5 overlap encoding generates a second tier of offspring overlap encodings using the splt method for boolean activation — identifying overlaps within the overlaps so to speak.

VII. Family Tree

The specification of a set of transformations can be conducted by an elegant arrangement of what we call our ‘family tree primitives’. The specification of such a set always starts with a root category, which would be the category intended for assignment in the assigncat parameter. The user can then define a set of transformation category entries associated with any of the family tree primitives. The primitives are there to specify some fundamental operations such as where a source column is supplemented or replaced and whether a transformation category will have any offspring generations. Any upstream transformation category entries without offspring, such as auntsuncles and cousins, are applied and returned. If one (or more) transformation category(s) are entered in an upstream family tree primitive with offspring (such as ‘parents’ or ‘siblings’), the family tree of that transformation category entry is inspected to identify any corresponding offspring transformations and how they should be applied. Offspring children are treated as the next generation’s parents, niecesnephews as siblings, coworkers as auntsuncles, and friends as cousins.

Through the simple assignment of transformation categories to this set of family tree primitives, sets of transformations may be specified for a given column. Such sets of transformations may include multiple generations and/or branches of offspring originating from the same source column. Some current limitations of the implementation include that generations may only originate from a single source column, there is no mixing via multicolumn derivations. We consider this acceptable in that the intended use of Automunge is for the final steps of data wrangling prior to the application of machine learning — this is supposed to be the step that immediately precedes the application of training a predictive model in a mainstream machine learning library. We are not suggesting that experiments in multicolumn derivations are not ever appropriate, but given that automated multicolumn extractions are part of the core strengths of neural network models, we feel comfortable that automated feature engineering methods for tabular data that generate offspring from a single source column may prove quite useful for a machine learning practitioner.

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge:

Hat Tips:

The following list is intended as a citation list of sorts for those resources that served as some of the inspiration for methods demonstrated in this essay:

Categorical:

It’s been a while since I put these together, but I believe the one-hot encoding was inspired by discussions in either Deep Learning With Python by Francois Chollet or Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron (not sure which). The idea for ordinal encoding as an alternate to one-hot encoding was inspired by the Kaggle Intermediate Machine Learning micro-course by Alexis Cook. The multi-category binary encoding was partly inspired by the Medium post Visiting: Categorical Features and Encoding in Decision Trees by “Laurae”.

Numerical:

The use of z-score normalization for numerical data I believe was inspired by discussions in either Deep Learning With Python by Francois Chollet or Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron (not sure which). Min-max scaling, box-cox power law transform, and logarithmic transforms were inspired by discussions in Feature Engineering for Machine Learning by Alice Zheng & Amanda Casari. The use of mean absolute deviation was inspired by discussions at the Real World Risk Institute seminar about tractability of mean absolute deviation vs. standard deviation in fat-tailed distributions. The square root transform was inspired by discussions in Probability for Machine Learning by Jason Brownlee.

Binned Data:

I can’t remember if they discussed standard deviation bins directly, but the idea of binning numerical data in general was inspired by discussions in Feature Engineering for Machine Learning by Alice Zheng & Amanda Casari.

Date Time:

I believe the original spark for segregating time series data by time scale was inspired by discussions in fast.ai lectures by Jeremy Howard, which may have also contributed to the idea of binning for things like weekends and business hours — although I think some discussions by Sam Charrington at TWiML may have also contributed to those aspects. The idea for applying sin and cos transformations to time data was partly inspired by the blog post Encoding cyclical continuous features — 24-hour time by Ian London.

Cumulative Data:

The spark for taking derivatives of cumulative data was just kind of a bit of inspiration. I’ll offer a hat tip to Mark Ryan as that inspiration came about while perusing some of the pre-release chapters in his forthcoming Deep Learning With Structured Data, but really just kind of a coincidence there.

String Parsing:

The methods demonstrated here for string parsing are original, but the idea that some type of evaluation for string overlaps may be beneficial was partly inspired by Kaggle discussions of competition data properties provided by Chris Deotte after his winning entry to the IEEE fraud detection competition.

Family Tree:

The transformation primitives for python based specification of sets of transformations is an original invention inspired by the process of building out Automunge.