DataPerf ML with Automunge

ICML 2022, DataPerf Workshop

Nicholas Teague
Automunge
4 min readJun 14, 2022

--

The following presentation is based on the paper Stochastic Perturbations of Tabular Features for Non-Deterministic Inference with Automunge, to be presented at the ICML 2022 DataPerf workshop. We welcome inquiries and opportunities, contact is available at automunge.com.

Stochastic Perturbations poster

Transcript

Hello. This is Nicholas Teague, the developer of Automunge, a python library that encodes dataframes for machine learning.

Automunge encodes dataframes by normalizations and binarizations. The encodings can be automated or custom defined, and fit to a training data basis. The library includes machine learning derived imputations, it detects data distribution drift, and can also recover a prior form by inversion. It uses Pandas dataframes and Numpy.random distributions. This paper introduces a new method to channel quantum sampled perturbations into features for non-deterministic inference, which represents a new way to channel quantum algorithms into classical learning.

Non-deterministic inference is referring to the injection of stochastic noise into features. Most of the prior work considering noise injections has been for purposes of training data due to regularization properties, data augmentation, differential privacy, or adversarial robustness. Most of the prior work on inference noise has primarily been considered for adversarial robustness. We propose that in addition to adversarial robustness, inference noise may benefit fairness by exposing a broader range of possible inference scenarios, and may also benefit adjacent quantum computing applications.

These noise injections, which we refer to as stochastic perturbations, may be sampled from several types of distributions. For numeric features we suggest sampling from a Gaussian or Laplace distribution, which can be channeled to a subset of entries based on a Bernoulli sampling. The scaling of that noise can be specified independent of feature properties by injecting to a normalized feature. For categoric data, we suggest a weighted activation flip by a Choice sampling, channeled to a subset again based on a Bernoulli sampling.

Our appendix included several benchmark sensitivity analysis trials, including for gradient boosting and neural network platforms. We found a distinction between the two, in that gradient boosting appeared to better tolerate noise just to test data in inference, with a small performance impact at low scales. Neural networks on the other hand benefited from injecting comparable noise to both train and test data, with the regularization benefits improved with increasing noise scaling at a low injection ratio.

In order to access more pure randomness profiles, we suggest stochastic noise be sampled from quantum circuits, as quantum entropy is closer to i.i.d. than pseudo random number generators. We note several types of quantum circuits that can serve this purpose, which can be conducted through a numpy.random formatted generator, as is available through the QRAND library, or Automunge accepts an integer array of externally sampled entropy seeds. Thus, Automunge is a quantum hardware ecosystem agnostic platform.

For additional information, including links to documentation, tutorials, and essays, please check out automunge.com. Thank you.

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.