Oversampling with Automunge

2 + 2 > 2 + 1

Nicholas Teague

Published in

Automunge

6 min readMar 29, 2021

Wanted to quickly offer an introduction to a particular feature of the Automunge python library for tabular data preprocessing, this one associated with preparing training data for oversampling in cases of label set class imbalance. In some ways it is sort of a trivial thing, probably why I haven’t written about it previously outside of the documentation, but even for such a simple operation it has the potential to materially benefit model performance characteristics, and so perhaps some expansion on details here is justified.

Label set class imbalance refers to cases where for a classification application the distribution of label classes in the training data has some segments that are underrepresented. You see in a perfect world the features of our training data would be iid, a statistical term standing for independent and identically distributed, which would by proxy result in a data set with an equal distribution of labels. When researchers derive various learning algorithms, it is a common assumption that their training data be received in this form. In the real world the adherence of training data to the iid property is generally much more of a loose assumption. Before the era of deep learning, the practice of feature engineering was partly an art of trying to transform the data to closer adhere to this assumption, such as for example by principle component analysis to increase the variance between features. Now that we have deep networks the iid assumption in many cases may be less relevant, although that relaxation may not extend to label class distribution properties. After all, in the collection of training samples features and labels, the label set has special significance in that the values are directly integrated into the loss function, and so some extra care is appropriate for their presentation.

Automunge offers a push-button solution to prepare training data in a fashion so that a downstream learning model can be exposed to a more balanced set of labels. By activating the (admittedly somewhat inelegantly named) automunge(.) parameter TrainLabelFreqLevel = True, the label set is evaluated to identify cases of class imbalance, and the returned processed data has under-represented labels (and their corresponding feature set samples) duplicated in the returned sets to result in an increased number of rows.

(As a hat tip the idea for oversampling was partly inspired by a comment in one of Jeremy Howard’s fast.ai lectures from a while back (couldn’t tell you which one though).)

Of course it is worth noting that there is more than one way to handle oversampling. The Automunge approach is to increase the size of the training data by appending duplicates of underrepresented label classes. As an alternative, it is also possible to apply oversampling in some learning libraries by weighting the batch sampling during training, for example in TensorFlow you can call class_weights as part of a Keras fit operation and manually assign weights that way. When available, this approach has some benefit as far as memory overhead is concerned since you aren’t duplicating training data samples. The reason we still offer our clunkier alternative is because not all learning libraries offer this feature, and as a learning framework agnostic preprocessing library we want to make oversampling as easy as possible.

When I say that the automunge operation is push-button, that means that the automatic derivation of duplication counts is included, where that formula is pretty straight forward: for each label class the counts are collected, and then the number of extra duplicates is derived as

(extra duplicates count) = round( (maximum class count) / (target class count) ) - 1

Here the rounding operation has the effect of only applying duplicates when the target count is less than two thirds of the max count.

(As a quick tangent, when I was writing this was vaguely reminded of discussions by David Deutsch in The Beginning of Infinity about the apportionment paradox resulting from such rounding operations when trying to assign legislative representative counts in legislatures based on population size. In our case we don’t need to worry about perfect balance, close to balanced is good enough.)

There is a failure mode worth note that a user should be aware of. For cases of extreme class imbalance, say where class 1 count >> class 2 count, the number of class 2 duplicates may climb pretty high, which result may be compounded when there are multiple edge case labels, resulting in somewhat excessively inflated duplication counts, which in the worst case may approach memory constraints. So yeah just something to be aware of, as long as label class counts aren’t too extremely imbalanced this won’t be an issue.

Since the convention for the automunge(.) function is that prepared data (including separately the sets intended as train, test, or validation) are each returned in 3 corresponding sets of [features, index, labels], for purpose of oversampling each of these sets will be consistently transformed with the corresponding oversampling duplications. Also, if data set shuffling is elected (which is on by default for train data), the shuffling takes place after oversampling duplications — so if you want to inspect the duplications you can deactivate shuffling for an easier view (such as by inspecting the index numbers of the returned data at bottom rows of the dataframe). Note that oversampling in the base configuration is just applied to the designated training set. There are some workflows where comparable treatment may be desired for test data with labels — this is fully supported by passing TrainLabelFreqLevel = ‘traintest’ (to oversample both train and test data) or as ‘test’ (to just oversample test data). Similarly, test data with labels passed to the postmunge(.) function can be prepared for oversampling by passing TrainLabelFreqLevel=True.

Oh and this particular feature is pretty nifty. Note that the Automunge oversampling need not only be applied to classification applications as it is in other libraries. In regression applications, by assigning numeric labels to a transformation set which supplements numeric data with aggregated bins (such as bins for number of standard deviations from the mean, powers of ten, custom bins, etc), oversampling can be automatically performed to levelize distribution of these bins. That’s right, oversampling for regression. Pretty neat.

In closing, to zoom back out for a minute, the whole point of these oversampling preparations are to encourage the training operation to pay more attention to any underrepresented labels, which may improve inference calibrations. This may especially be noticeable when returning probabilistic predictions for instance. In real world applications, it is often the edge case label classes that are of most interest to an evaluation, and so any label engineering that can help improve their assessment is well worth the effort — especially in the case of Automunge where their preparation simply requires the activation of a single parameter.

Automunge: we make machine learning easy.

Books that were referenced here or otherwise inspired this post: