Noise Injection Benchmarks

The fine print

Nicholas Teague

Published in

Automunge

12 min readNov 17, 2021

Introduction

As followers of this blog may be aware, the author has recently been offering some hypotheses regarding potential benefits of noise injections in the context of tabular learning applications. It is probably worth reiterating that several aspects of these suggestions are merely that, hypotheses. We have been building out features in the library to support that premise as well as to enable experiments to validate the concept. This essay is provided in the interest of transparency of some of the evaluations that have been performed in the time since in order to validate performance impacts of various noise settings. This will not be a very literary presentation, mostly this is just to distribute experiment results for advanced users’ consideration. Along with the extensive figures we’ll provide some commentary related to our interpretations. One way to think about this material is that it may serve as an appendix to the essay Noise Injections with Automunge.

Noise Injections with Automunge

Determinism is overrated

medium.com

Findings

Broadly, the experiments demonstrated that targeting noise injections to test data had higher performance sensitivity with increasing noise profiles than injecting to train data. We found that in general, injections to both train and test data had similar performance in comparison with injections to just test data. The performance penalty was fairly linear to categoric features with increasing noise profile, and weighted categoric sampling materially outperformed uniform sampling. The performance penalty to numeric injections had an immediate drop in transition from 0 noise to smallest settings, but then further increased settings when directed to train data was in many cases neutral to performance, especially when that adjacent setting was at a low setting — where adjacent is referring to either of the two parameters sigma (scale) and flip_prob (ratio). There did not appear to be a significant difference between numeric injections with Gaussian or Laplace sampled noise.

We believe these experiments have validated the suitability of current default distribution parameters for train data injections, including 0.03 for numeric flip_prob and 0.06 sigma, and 0.03 for categoric flip_prob with weighted sampling. However, based on these findings we are reducing the corresponding default distribution parameters for test data numeric injections from 0.06 to 0.03 sigma and test data categoric injections from 0.03 to 0.01 flip_prob.

Experiments

Experiments were performed to benchmark noise injection parameter settings. The IEEE CIS data set was selected for scale and familiarity. The data set was prepared in three versions: just numeric features, just categoric features, and all features. In general, the IEEE data set has a higher prevalence of numeric features than categoric, which will explain the disparity in performance between those two configurations. We prepared the data in automunge(.) with a common validation split between experiments to dampen stochasticity of results. Numeric features were given a z-score normalization and categoric features a binarization, in both cases without ML infill for purposes of speed. Noise injections were applied to either all numeric or all categoric features depending on the scenario. Each scenario was repeated with injections to just train data, just test data, and both train and test data. We varied the noise injection parameters across scenarios to evaluate influence towards model performance, including categoric parameters of {flip_prob, weighted} and numeric parameters of {sigma, flip_prob, noisedistribution}. In an alternate configuration we injected “swap noise”, which refers to randomly sampling from the set of entries in a feature. We applied the Catboost learning library as a proxy for gradient boosting which is more suitable when performed without tuning. The reported results are based on an average of ROC AUC metric (for this binary classification task) over 12 trials for each scenario in order to dampen stochasticity of results.

For clarity, the nature of these experiments were primarily for purposes of evaluating different noise profiles towards model performance as measured by ROC AUC.

We conducted three experiment configurations:

In the first, parameters were adjusted each in isolation of the others with flip_prob or sigma varied through 31 settings ranging from 0, 0.005, …, 0.150. These sets were repeated with numeric noise distributions of normal or laplace and the categoric sets with weighted or uniform sampling.
In the second round of scenarios we focused on the numeric injections with normal distribution and varied the two parameters sigma and flip_prob together instead of in isolation, with 11 settings for each as 0, 0.01, …, 0.10.
The third extended the first round of experiments to include swap noise injections for comparison to distribution sampling, including swap noise injected to numeric or categoric features.

Results

Experiment 1

In these experiments each parameter was varied in isolation, ranging from 0 (no noise) to 0.15 (which either represents ratio of injected entries for flip_prob parameter or scale of noise distribution for sigma parameter), with other parameters set to the default (e.g. when varying flip_prob, sigma was set to 0.03, and when varying sigma flip_prob was set to 0.03). Noise was injected to either all numeric or all categoric features depending on the noise type. The leftmost chart represents noise only injected to training data, the center chart represents noise only injected to test data, and the rightmost chart represents noise injected to both train and test data. Feature sets were composed of all numeric features (blue), all categoric features (orange), or all features (green). For the inspected benchmark data set there was a larger prevalence of numeric than categoric features.

Numeric noise interpretations:

One striking feature of the numeric injections was the immediate performance drop-off in transition from 0 noise to smallest setting. After this initial impact, increasing noise ratios or noise scale had closer to a linear impact. In some cases the slope of performance drop with increasing noise profiles after initial drop was negligible.
In general, the injections to test data (center charts) in comparison to the injections to train data (leftmost charts) appeared to demonstrate that the noise penalty was slightly higher for test set injections, as well as the slope slightly steeper with increasing noise profiles for test injections.
It was an interesting finding that the injections to both train and test data (rightmost charts) had nearly comparable performance profile to the injections just to the test data (center charts).
Injections with laplace distributed noise had nearly comparable profiles to the gaussian noise.

The first two rows here have Gaussian distributed noise, either varying flip_prob (injection ratio) with 0.03 sigma, or varying sigma (noise scale) with 0.03 flip_prob. The third and fourth rows are comparable but with Laplace distributed noise instead of Gaussian.

Categoric noise interpretations:

Similar to numeric noise, the performance penalty was more pronounced with test set injections and had similar profiles between just test data vs injections to both train and test data.
Unlike the numeric noise, with categoric noise there was no initial drop-off, and appeared linear through the entire range with increasing noise profiles.
The first row demonstrates categoric noise with weighted sampling, the second row categoric noise with uniform sampling. (Referring to sampling from the set of alternate activations to injection targets.)
Please note that these categoric charts have a different y-axis scale than the numeric charts shown above.

Distribution comparison interpretations:

The first two rows here are showing Gaussian (blue) and Laplace (orange) side by side while varying either the flip_prob (injection ratio) or sigma (noise scale) parameters to all numeric features.
For the most part the two curves appear to coincide, with perhaps a very slight laplace penalty in the traintest scenarios.
The third row is demonstrating categoric injections in comparison between weighted sampling (blue) vs uniform sampling (orange) to all categoric features. Clearly weighted sampling had a superior performance profile. Weighted is referring to weighting the sampling from alternate categoric unique entries based on their distribution in the training set.

General comments:

One interpretation here could be that numeric noise injections can handle larger flip_prob than categoric. This is kind of in line with intuition, after all numeric noise is centered to the original entry, while categoric may sample from the entire range of the feature.
As noted above, further findings include that Laplace noise performance is roughly in line with Gaussian. Weighted categoric sampling materially outperforms uniform sampling.

Experiment 2

In these experiments we focused on numeric noise injections with Gaussian distributed noise with parameter settings varied together for both flip_prob (injection ratio) and sigma (scale), with each presented as one of 0, 0.01, …, 0.10. Each chart here shows a static sigma value with increasing flip_prob on the x axis and performance on the y axis. Otherwise we used the same feature compositions (just numeric features and all features) and injection targets (train data, test data, or both train and test data). Each row represents an increasing static sigma value.

Interpretations:

When we first collected these charts we had omitted the 0 noise case, providing a surprising finding that at smallest sigma settings increasing injection ratios to train data actually appeared to have a slight positive slope towards performance. However after adding the 0 noise case it revealed the benefit was more than offset by the initial drop in performance from 0 to smallest setting.
These charts reaffirm the conclusion that injections to both train and test data have comparable performance in comparison to injections to just test data.
A more subtle property demonstrated by these charts is that the rate of change of the (slope of performance/flip_prob) with respect to increasing sigma has larger absolute magnitude for test injections in comparison to train injections. In other words, test data injections are more sensitive to increasing noise profiles than are train injections.
The performance / flip_prob curves for train data injections at sigma=0.07 was very similar to the performance curves for test data injections at sigma = 0.02. Similarly, the performance curves for train data injections at sigma=0.10 was somewhat similar to the performance curves for test data injections at sigma = 0.04.
We believe these curves demonstrate that our default settings for train data injections for z-score normalized data (0.03 flip_prob and 0.06 sigma) are not unreasonable. Our default for the DP family of transforms is that noise is injected to train but not test data. However, based on these findings, we have decided to update the default distribution settings for test noise from 0.06 to 0.03 sigma, which setting is inspected when test injections are activated by the testnoise parameter.
We also inspected the same data presented as a static flip_prob on each chart with sigma on the x-axis. The curves and conclusions were very similar. (Figures not shown here for brevity.)

Experiment 3

In this experiment we inspected an alternate noise type of “swap noise” (which integration was inspired by a description in the NeurIPS paper “SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning” by Talip Ucar et al). Swap noise is referring to replacing an injection target with a random sample from the set of entries in a feature, and it can be performed in a similar manner to both numeric and categoric features. One way to think about swap noise is that it is sampling from the underlying distribution of the feature. Numeric noise from Gaussian or Laplace distribution sampling will have a more compact noise profile, especially with Gaussian at default settings, as it is centered to the entry. The expectation was that categoric swap noise will perform roughly in line with our current weighted sampling scenario. Swap noise may be more suitable for injections to train data in comparison to test data since in some cases test data may be prepared with very few samples.

The first two rows here represent swap noise injected to sets of all numeric features (blue), all categoric features (orange), and all features (green). The green curves are same on both chart with a different y-axis scale.

Interpretations:

The swap noise injections to numeric features had a linear degradation profile, closer resembling what prior experiments realized with categoric injections. This makes sense since the swap noise is sampling from the entire range of the distribution as opposed to being centered to the entry, which closer resembles how our categoric noise samples from the entire range of entries.
The categoric noise had a similar profile to the prior shown categoric weighted sampling as expected.

The next two rows represent all numeric features (fist row) and all categoric features (second row). The first row has Gaussian sampled noise (blue) compared to swap noise (orange). The second row has weighted categoric sampling (blue) compared to swap noise (orange).

Interpretations:

In the first row the difference between Gaussian and swap is stark with increasing injection ratios, with swap noise much less tolerant to high penetrations. That being said, there are actually a few points at the lowest end of the noise profile where swap noise outperforms Gaussian, owing to swap’s linear degradation compared to the initial drop at smallest setting for Gaussian noise noted earlier.
The second row’s train injection demonstration shows a very similar profile as expected, however the test injections found an unexpected disparity between weighted and swap. We believe the cause of this originates from fact that categoric weighted sampling is fit to the distribution of the training data, which is also the basis of the trained model, while swap noise applied to test data draws from the distribution of the test data, which apparently in this case did not align with the train data distribution. Interestingly, this disparity was erased when swap noise was applied to both train and test data, which is the first case we found where the traintest injection scenario did not align with the test injection scenario.

Closing

Although there is clearly a small performance impact demonstrated with each of these noise injection scenarios, these experiments were conducted on the original data set without duplication. In our paper Numeric Encoding Options with Automunge we demonstrated that noise injection could serve as a resource for data augmentation, with training set duplicates prepared each with a unique sampled noise injection in order to increase the training set size. That paper found that in a deep learning application such augmentation was neutral to performance in a fully represented data set, and was increasingly beneficial to performance with underserved training data. Please note that the library now has push button support for data augmentation by noise injection by way of the automunge(.) noise_augment parameter.