Noise Injections with Automunge

Determinism is overrated

Nicholas Teague

Published in

Automunge

14 min readOct 17, 2021

Abstract

This paper offers full introduction to the practice of stochastic noise injections into the features of tabular data fed to machine learning training or inference. Noise injection to the entries of continuous numeric feature sets may be applied by sampling from discrete distributions to select entry injection targets and sampling from continuous distributions for added noise. Noise injections to the entries of categoric feature sets may be applied by sampling from discrete distributions to select entry injection targets and separate discrete distributions to identify replacement activations for targeted entries. This paper offers demonstrations for applying noise injection with the Automunge library for tabular munging. We speculate that machine learning applications that could benefit from noise injections include data augmentation, model perturbations for the aggregations of ensembles, differential privacy, bias mitigation, or other scenarios benefiting from non-determinism.

Introduction

The Automunge python library is intended as a resource for automatically preparing tabular data for machine learning by way of numeric normalizations, categoric binarizations, and missing data infill. When training data is prepared in the automunge(.) function, a compact python dictionary is returned recording steps and parameters of transformations, which may then serve as a key for preparing additional corresponding data on a consistent basis in the postmunge(.) function, as may include the preparation of streams of data for inference. In addition to preparing data under automation, Automunge may also be applied as a platform for specifying univariate transformations fit to training data properties. The library has a unique API, which includes a set of family tree primitives for simple command line specification of transformation sets that may include generations and branches of derivations. In addition to univariate feature transformations, the library also supports numeric consolidations (via PCA) and categoric consolidations (via our Binary transform). Missing data is automatically imputed by training feature set specific models to infer imputations from properties of the surrounding features, what we call ML infill.

This paper is to offer full introduction to Automunge options for noise injection. Noise injection refers to feature set preparations that incorporate sampled stochastic noise in derivations. Stochastic noise may be sampled from discrete distributions (such as the Bernoulli distribution), or continuous distributions (such as the Gaussian distribution). Noise injections may be applied to all entries in a feature or only a sampled subset ratio of entries in a feature. Noise injections to numeric features may directly apply the form of sampled distribution or may scale noise based on entry properties to maintain consistent range of entries before and after injections. Noise injections to categoric features may apply a uniform random draw from the set of unique entries to a replace targeted entries with a sampled alternate activation, or may apply a weighted draw based on distribution of unique entries as found in the training data.

Noise Injection

Noise sampling in the library [1] is built on top of Numpy’s [2] np.random module, which defaults to the PCG [3] algorithm as a pseudo random number generator (which recently replaced the Mersenne twister [4] generator used in earlier implementations). np.random returns an array of samples, where for example samples from a Bernoulli distribution could be a set of 0’s and 1’s, or samples from a Gaussian could be a set of floats. The sampling operation accepts parameters consistent with their distribution (such as e.g. for Gaussian would be mean and standard deviation), and also a specification for shape of returned samples, which we match to the shape of the feature for targeted injection.

For numeric injections, noise is applied on a scaled version of the feature which allows for specification of distribution parameters independent of feature properties — e.g. when diverse numeric features are z-score normalized with a mean of 0 and a standard deviation of 1 then a noise profile can be specified independent of feature properties. [Eq 1] demonstrates noise injection to a z-score normalized feature. The multiplication of columns for sampled Bernoulli (with 0/1 entries) and Gaussian (with float entries) results in Gaussian noise injected only to entries targeted based on the Bernoulli sampling.

Equation 1: Noise injection to z-score normalized feature

The z-score injection [Eq 1] may have the potential to increase the maximum value found in the returned feature set or decrease the minimum value. In some cases, applications may benefit from retention of feature properties before and after injection. When a feature set is scaled with a normalization that produces a known range of values, as is the case with min-max scaling (which scales data to the range between 0–1 inclusive), it becomes possible to manipulate noise as a function of entry properties to ensure retained range of values after injection [Eq 2]. Other normalizations with a known range other than 0–1 (such as for our ‘retn’ scaling) can be shifted to the range 0–1 prior to injection and then reverted after for comparable effect. (Please note that since this results in noise distribution derived as a function of feature properties, the sampled noise mean is adjusted to closer approximate a zero mean for the scaled noise.)

Equation 2: Noise injection to min-max normalized feature with scaled noise

The injection of noise into categoric features is realized by sampling from discrete distributions. For boolean integer categoric sets (as is applied to features with 2 unique values in our ‘bnry’ transform), in one configuration the injection may be applied by directly applying a Bernoulli 0/1 sample to flip targeted activations [Eq 3], although our base configuration applies comparable approach as ordinal encodings to take advantage of weighted replacements. For ordinal integer encoded categoric sets, a Bernoulli 0/1 sampling is applied to select injection targets, and a random choice sampling is applied to select alternate activations from the set of unique entries for the feature found in the training data [Eq 4]. In our base configuration, the sampling of alternate activations is weighted by frequency of those activations as found in training data, which can be deactivated for sampling from uniform draw for potential latency benefit. Having injected noise, a downstream transform can then be applied to convert from ordinal to some other form, such as one hot encoding or binarization. We also have an alternate noise injection transform that can directly be applied downstream of multi-column categoric encodings for comparable result.

Equation 3: boolean integer activation flip

Equation 4: ordinal integer activation replacement

Noise Options

In the preceding section we demonstrated noise injections that made use of the Bernoulli distribution for sampling injection targets, the Gaussian distribution for sampling numeric noise, and weighted random choice for sampling categoric noise. We have a few variations available for the derivation of numeric noise. A Laplace distribution may be applied as a direct drop in replacement for Gaussian, which practice is simplified by Numpy’s use of comparable distribution parameters for both. The Laplace distribution, aka the double exponential distribution, has a sharper distribution peak than Gaussian, meaning more samples will be within close proximity to the mean, but also thicker tails, meaning more outlier entries may be sampled. In fact one way to think about it is that exponential tails can be considered as a kind of boundary between what would be considered thin or thick tails in other distributions. We expect there may prove to be benefit of other noise profile distributions in certain applications, this is an open area for future research.

There may be some applications where a user prefers noise to be injected to a feature as all positive noise (for increased targeted entry values) or all negative noise (for decreased targeted entry values). This could be relevant to applications where directions of transience carry added significance. This option is available in the library through distribution parameter selection. In the context of machine learning applications, there will likely be tradeoffs associated with uni-directional noise in that with a non-zero mean the injected positive signed noise won’t be countered by negative signed noise, which could introduce bias.

In regards to categoric noise injections, the defaulted use of weighted sampling of alternate activations based on frequency of entries in the training data feature set was a recent addition to the library, and initial benchmarks have demonstrated that weighted categoric noise materially outperforms uniform sampling. It is our expectation that in cases of feature set entry imbalance weighted sampling may result in better performing models than uniform sampling. The np.random.choice parameter documentation upon which this is built implies that there may be latency impacts associated with the practice in comparison to uniform sampling. Both of these scenarios for weighted or uniform are available based on parameter selection.

Noise injections in the library are applied in conjunction with other preparations, for example noise injection for numeric features are applied in conjunction with normalizations and scaling. Noise injections to categoric features are applied in conjunction with integer encodings. Thus root categories for noise injection transformations can be considered a drop in replacement for the corresponding encoding [Table 1].

Table 1: Noise injection root categories

Train and Test Data

One of the key distinctions in the library with respect to data preparation is the difference between training data and test data. In a traditional supervised learning application, training data would be used to train a model, and test data would be used for validation or inference. When Automunge prepares training data, in many cases it fits transformations to properties of a feature as found in the training data, which is then used for preparing that feature in the test data on a consistent basis. The automunge(.) function for initial preparations accepts training data and optionally additionally test data with application. The postmunge(.) function for preparing additional data assumes that received data is considered test data.

In the context of noise injections, the train/test distinction comes into play. Our default configuration is that noise is injected to training data and not injected to test data. This was built around use cases of applying noise for applications in model training, such as for data augmentation, differential privacy, and model perturbation in the aggregation of ensembles. And we thus assumed that test data, as intended for inference, may not have comparable benefit for stochasticity. As we’ve continued to iterate, we slowly realized that there may actually be scenarios for noise injection for test data as well as train data — or perhaps even just for test data and not for train data. After all by injecting noise into an inference basis a user can make model predictions non-deterministic.

We thus have a few relevant parameters for distinguishing between these scenarios. In the base configuration, the training data set returned from automunge(.) receives noise when relevant transforms are applied, and does not receive noise to the corresponding features in test data, including test data sets returned from both automunge(.) or postmunge(.). To treat test data passed to postmunge(.) as training data, postmunge has the traindata parameter, which can be turned on and off as desired with each postmunge call. To configure a transformation to default to applying injected noise to train or test data, parameters can be passed to specific transformations as applied to specific columns with the automunge(.) assignparam parameter. The noise injections transforms accept a trainnoise specification (defaulting as True) signaling that noise will be injected to training data, and a testnoise specification (defaulting as False) signaling that noise will not be injected to test data. Please note that these assignparam parameters, once specified in an automunge(.) call, are retained as basis for preparing additional data in postmunge(.). If validation data is prepared in automunge(.) it is treated comparably to test data.

Table 2: Injection scenarios for train and test data

Automunge Demonstrations

Assigning noise injection root categories to targeted input columns is applied in the assigncat automunge(.) parameter, which once assigned will be carried through as the basis for postmunge(.). Here we demonstrate assigning DPnb as the root category for a list of numeric features, DPod for a list of categoric features, and DPmm for a specific targeted numeric feature.

To default to applying noise injection under automation, one approach could be to overwrite the family trees associated with root categories applied under automation. For example, the default root category applied to numeric features is by the ‘nmbr’ category, which we can overwrite that family tree with the transformdict automunge(.) parameter, resulting in that updating family tree being applied when the nmbr category is assigned under automation. The various family trees in library are detailed in documentation for reference.

Another approach for applying noise injections under automation could be to take advantage of the automunge(.) powertransform parameter which is used to select between scenarios for default transformations applied under automation. powertransform accepts specification as ‘DP1’ or ‘DP2’ resulting in automated encodings applying noise injection, further detailed in the read me powertransform parameter writeup.

Noise injection parameters can be passed to transformation functions through the automunge(.) assignparam parameter, which will then be carried through as the basis for preparing additional data in postmunge. In order of precedence, parameter assignments may be designated targeting a transformation category as applied to a specific column header with suffix appenders, a transformation category as applied to an input column header (which may include multiple instances), all instances of a specific transformation category, all transformation categories, or may be initialized as default parameters when defining a transformation category.

Here we demonstrate passing three different kinds of assignparam specifications.

‘global_assignparam’ passes a specified parameter to all transformation functions applied to all columns, which if a function does not accept that parameter will just be ignored. In this demonstration we turn on test noise injection for all transforms via the ‘testnoise’ parameter.
‘default_assignparam’ passes a specified parameter to all instances of a specified category. Here we demonstrate updating the ‘flip_prob’ parameter from the 0.03 default for all instances of the DPod transform, which represents the ratio of entries that will be targeted for injection.
To target parameters to specific categories as applied to specific columns, can specify as {category : {column : {parameter : value}}}. Here we demonstrate targeting the application of the DPmm transform to a column ‘<targetcolumn>’ in order to apply all positive signed noise injections by setting the ‘noisedistribution’ parameter to ‘abs_normal’, and also reducing the standard deviation of the injections from default of 0.03 to 0.02 with the ‘sigma’ setting.

Having defined our assignparam specification dictionary, it can then be passed to the automunge(.) assignparam parameter. As an asterisk, it’s important to keep in mind that targeting a category for assignparam specification is based on that category’s use as a tree category (as opposed to use as a root category), which in some cases may be different. The read me documentation on noise injection details any cases where a noise injection parameter acceptance may be a tree category differing from the root category, such as for a few of the categoric injection transforms.

Having defined our relevant parameters, we can then pass them to an automunge(.) call.

In addition to preparing our training data and any validation or test data, this function also populates the postprocess_dict dictionary, which we recommend downloading with pickle if you intend to train a model with the returned data (pickle code demonstrations provided in read me). The postprocess_dict can then be uploaded in a separate notebook to prepare additional corresponding test data on a consistent basis, as may be used for inference.

Applications

The origination of noise injections in the library was originally in hope of supporting differential privacy, which refers to machine learning applications where noise may be injected into various segments of supervised learning in order to mask specific training data entries from recovery with inference. As we began to learn more about model ensembles we realized this type of training data stochasticity could become a useful resource for perturbing trained models in the aggregation of ensembles, although we have yet to test this hypothesis. In our paper Numeric Encoding with Automunge we validated numeric noise injections for use towards data augmentation in a deep learning tabular application, and found that this form of data augmentation increasingly benefited model performance in scenarios of underserved training data, and was otherwise fairly benign towards model performance with the fully represented data set. Subsequent to this paper, we also conducted a similar pipeline to validate categoric noise injections for use towards data augmentation and found benefit to model performance, although this result wasn’t published.

We noted in the abstract that one potential benefit of injected stochasticity could be associated with bias mitigation. This thought occurred to me when attending a Black in AI social at ICLR 2020, and we found some reinforcement from NeurIPS reviews of our paper Missing Data Infill with Automunge, which review comments partly inspired our incorporation of stochasticity into derived imputations. Specifically, one of our reviewers noted that in the context of stochastic verses deterministic derivations of imputations, deterministic imputation often leads to bias in the downstream analysis, citing (Little, 2019) [5]. We believe it is a reasonable extension of this finding that if stochasticity in derived imputations benefits bias mitigation, then stochasticity in meta model inference operations should likewise benefit bias mitigation.

A helpful way to think about what is taking place with stochastic noise injections, particularly with injections into test data, is that one is lifting determinism from an inference operation. The same feature inputs applied with multiple inference inspections may result in different results. We speculate that this may benefit susceptibility to adversarial example [6] extractions from third parties with exposure to a model, as could addressed by channeling any inference inquiry through a locally managed postmunge(.) application. There are likely some scenarios where determinism may be desired, as a concrete finding is preferred. We believe there are potentially many scenarios where non-deterministic inference should be considered as the default.

Future Research

There are several opportunities for further research in this domain. It is probably worth noting that prior literature review has not been extensively performed for this discussion. We are software developers first, if we spent all of our time reading papers it would interfere with building. That being said, we expect it is probably an open research question as to how much noise can be tolerated in the context of model training or inference. We expect there are likely some existing studies performed in applications of differential privacy for noise injection. The benefit of Gaussian noise verses Laplace noise probably deserves some attention, we linked above to a blog post from John Cook that suggested Laplace may be more appropriate for differential privacy. We expect this question may not have a single answer but may need considerations of application, such as potential outlier impacts towards safety, bias, etc. There are probably scenarios where other distributions may be beneficial as well. We’ve noted in this write-up several applications that we have not had resources to extensively validate, including data augmentation, model perturbation for aggregation of ensembles, bias mitigation, etc.

Automunge is open source software. It is available for free install and use by data scientists looking to automate data prep for tabular learning. The goal is to provide as much value as we can to the machine learning community with intent to commercialize on adjacent services. We welcome inquiries from potential collaborators in research and industry who may find benefit from such applications. Our contact information is provided at automunge.com.

Glass’ 19th Etude — Nicholas Teague

References

[1] Nicholas Teague. Numeric Encoding Options with Automunge. https://medium.com/automunge/string-theory-acbd208eb8ca, (2020).

[2] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, & Travis E. Oliphant. Array programming with NumPy. Nature, 585, 357–362, (2020).

[3] Melissa E. O’Neill. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation. https://www.cs.hmc.edu/tr/hmc-cs-2014-0905.pdf (2014).

[4] M. Matsumoto, & T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation. 8 (1): 3–30, (1998).

[5] R. J. Little, & D. B. Rubin. Statistical analysis with missing data (Vol. 793). John Wiley & Sons, (2019).

[6] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep Learning. MIT Press, (2016).

Intellectual Property Disclaimer

Automunge is released under GNU General Public License v3.0. Full license details available on GitHub. Contact available via automunge.com. Copyright © 2021 — All Rights Reserved. The Automunge library is Patent Pending, including applications 16552857, 17021770