A Library of Contributions
The full scope of Automunge
Had a good experience interacting with reviewers for a recent conference submittal associated with our paper Missing Data Infill with Automunge, the following are inspired by a few of those exchanges.
The full scope of Automunge
Automunge has attempted to consolidate the full range of the tabular learning workflow in between the two specific boundaries of 1) received “tidy data” (one column per feature and one row per sample) and 2) returned sets suitable for direct application of machine learning — by channeling through a single interface of a preprocessing platform built on top of the Pandas dataframe library. A helpful way to think of Automunge is that in addition to missing data infill it is a platform for applying univariate data transformations to tabular feature sets that may be fit to properties of a training set for consistent basis on additional data. Feature set transformations may be fairly simple, such as numeric normalization and categoric binarization under automation, or may be more elaborate.
Data transformations are available to be assigned in sets that may include generations and branches of derivations, such as to present feature sets to ML training in multiple configurations of varying information content. Data transformations fit to properties of a train set may be applied under automation (such as based on an evaluation of data or distribution properties), may be sourced from an extensive internal library, or may even be custom defined by users by use of a very simple template, which custom transformations may then be integrated into a pushbutton operation for preparing streams of data including transformations and imputations. We gave several examples of potential data transformations in section 3, including our own “parsed categoric encoding” (an alternative to one hot encoding) in which categoric features are vectorized based on grammatical structure shared between entries. We also noted several other benefits of channeling preprocessing through our interface, such as automated measurements for data distribution drift and an option for pushbutton data transformation inversions.
The integration of ML infill into a preprocessing platform enables several benefits not available to other ML imputation libraries, such as the ability to recognize features encoded in multicolumn representations and the ability to automatically account for potential data leakage between redundantly encoded feature sets.
New early stopping criteria now implemented
ML infill is an important component of the library, and was the main focus of this paper, but we believe a more accurate representation is that in addition to missing data infill the core of Automunge is a platform for applying univariate data transformations to tabular feature sets that may be fit to properties of a training set for consistent basis on additional data. We took the approach of building a preprocessing pipeline standard from the ground up (on top of the Pandas dataframe library), we have not sought to integrate into the Scikit-learn interface, please consider Automunge an alternative to Scikit for data transformation pipelines. We suggest applying our library as a precursor to any subsequent model training incorporating elements like cross-validation and grid-search.
We’ve tried to offer for ML infill different autoML library scenarios with different strengths which was not addressed at depth in the paper. Scikit Random Forest was selected as the default due to simplicity, latency, and tendency not to overfit. We have seen benchmarks that suggests AutoGluon will likely outperform random forest, albeit with a high disk storage requirement associated with model ensembles and reduced latency performance. We included the FLAML library for their simplicity of setting a max training duration time through hyperparameter tuning, as well as their claimed performance from a latency standpoint. We selected CatBoost as a gradient boosting option and for GPU support. Partly we did not feel that we had sufficient rigor in our benchmarking to include discussions in the paper on this topic. We believe that an autoML library’s performance against generic tabular learning benchmarks should be a good proxy for their sufficiency to serve as an imputation model basis, and have tried to offer some diversity for user choice.
We took reviewer input as a clear signal that our library would benefit by some increased sophistication for stopping criterion in imputation iterations, which was prior based on specifying a hard coded integer of number of rounds. We have thus now rolled out our own version of early stopping for ML infill iterations. Our approach is based on comparing derived imputations of the current iteration to the preceding and deriving for all numeric features in aggregate and separately for all categoric features in aggregate a metric based on comparison of the imputations derived in the current iteration to the iteration preceding, with stopping conducted when both the numeric metric and categoric metric are within a configurable tolerance. Our numeric criteria has some similarity to approach in Scikit IterativeImputer, although we apply a different denominator in the formula (we believe the IterativeImputer formula may lose validity in scenario of presence of distribution outliers in their denominator set), and our categoric stopping criteria has some similarity to the MissForest approach, although we take a different approach of evaluating for metric within a tolerance instead of the sign of rate of change. More particularly, our numeric halting criteria is based on comparing for each numeric feature the ratio of max(abs(delta))
between imputation iterations to the mean(abs(entries))
of the current iteration, which are then weighted between features by the quantity of imputations associated with each feature and compared to a numeric tolerance value, and the categoric halting criteria is based on comparing the ratio of number of inequal imputations between iterations to the total number of imputations across categoric features to a categoric tolerance value.
Imputations are now non-deterministic
In our experiments we tried to simulate a wide range of missing data assumptions for validations in comparison to each. We simulated by engineering missing data injections scenarios of missing (completely) at random and missing not at random in targeted numeric and categoric features, each selected for their influence towards downstream model performance by way of an Automunge feature importance evaluation by shuffle permutation. Our goal was to validate the relative benefit of ML infill and NArw aggregation in comparison to each missing data assumption, such as to demonstrate that defaulting to these methods will often reach at or near top performance against the other imputation scenarios independent of assumptions, relying on the model training to navigate any peculiarities of a particular feature / missing data assumption.
We have recently rolled out some novel methods for incorporating stochasticity in our derived imputations, partly to serve as response to reviewer feedback. Firstly we have introduced a default convention that learning algorithms that accept random seeds are fed a random sampled seed with each application, as opposed to prior convention that a uniform random seed was applied between each. This introduces some stochasticity into the model training aspects, although while still resulting in a deterministic imputation basis it at least encourages some diversity between iteration cycles. Secondly we have introduced an option for a novel method for stochastic noise injections into derived imputations prior to insertion. The noise injections were partly inspired by some of our existing data transformations with noise injection we had available for uses like data augmentation and differential privacy (these noise injection transforms are discussed in the cited paper Numeric Encoding Options with Automunge). We refer to this new method as “stochastic_impute”, which can be activated for noise injections to imputations associated with numeric or categoric features. Numeric noise injections sample from either a default normal distribution or optionally a laplace distribution. Default noise profile is mu=0, sigma=0.03, and flip_prob=0.06 (where flip_prob is ratio of a feature set’s imputations receiving injections), each as can be custom configured. Please note that this noise scale is for injection to a min/max scaled representation of the imputations, which after injection are converted back to prior form, where the min/max scaling is based on properties of the feature in the training set. Please note that noise outliers are capped and noise distribution is scaled to ensure a consistent range of resulting imputation values is maintained in comparison to the feature properties in the training set. Categoric noise injections sample from a uniform random draw from the set of unique activation sets in the training data (as may include one or more columns of activations for categoric representations), such that for a ratio of a feature’s set’s imputations based on the flip_prob (defaulting to 0.03 for categoric), each target imputation activation set is replaced with the randomly drawn activation set in cases of noise injection. We make use of numpy.random for distribution sampling in each case. (Part of reason we preferred deterministic imputations prior was to support our software rollout validations which compare outputs of train and test data to ensure consistency, stochasticity kind of interferes with that.)
We tried to balance a few agendas with this paper. One was to offer introduction and validation to ML infill with justification for casting as a new default in the library. Another agenda was to serve as an introduction to a fully featured preprocessing library that may serve as a replacement for instance to Scikit-learn for data pipelines, including demonstrations of code. We tried to clarify in the abstract that we considered one of the contributions of this paper to be the integration of ML imputation into a preprocessing platform, which platform has yet to be published in any formal venue. For more details on the derivations, intuitions, and analysis please refer to any of the many preprints that we cited covering this territory, including for instance the paper Parsed Categoric Encodings with Automunge and Numeric Encoding Options with Automunge. Outside of the preprints cited there is also a much more expansive backlog of writings that have covered similar material, however given their informal nature we decided not to include reference in this venue.
Integration of ML imputations into a preprocessing platform
One of the contributions of this paper is that we are not only introducing ML infill, but that we are introducing the integration of ML imputations into a novel tabular preprocessing platform for applying univariate data transformations to the features in a tabular data set, as may be fit to properties of features in a training set for consistent basis on corresponding additional data, with data transformations as can be sourced from an extensive library or even custom defined with a simple template, and with transformations as may be applied under automation or with custom engineered data pipelines. We prepare tabular data for machine learning.
The benefits associated with integration of ML imputations into a preprocessing platform include the ability to recognize features encoded in multicolumn representations and the ability to automatically account for potential data leakage between redundantly encoded feature sets. Our ability to apply imputations to received raw data based on an initial automated encoding was also novel when implemented, although it appears there is now a similar capability in a much reduced fashion with the DataWig imputation library that another reviewer noted. Our library also predates IterativeImputer, which based on our review of their documentation appears limited infill to continuous numeric features with regression models.
Automunge allows imputations to mixed data sets that include both numeric and categoric features. We assume data is received in a tidy form, which is shorthand for tabular data with one column per feature and one row per observation. However data returned from Automunge may not strictly adhere to the tidy data principle, as features may be redundantly encoded in multiple configurations (such as configurations with varying information content) and those configuration may include multicolumn representations (such as for example a categoric binarization).
New automated data leakage detection
Please note that as we were drafting our prior responses clarifying benefits of ML imputations as integrated into a preprocessing platform surrounding the capacity to automatically account for potential data leakage for redundantly encoded feature sets, it occurred to us that some further refinements could be implemented to automatically recognize other sources of data leakage for imputation models from the surrounding features. Particularly, it occurred to us that separate features with high prevalence of correlated missing data entries shared in common rows could be evidence that an imputation model including in basis that correlated feature could be a source of data leakage, since in imputation model training both features may have valid entries but in imputation model inference both features will have missing data.
Thus we have now implemented a method to compare aggregated NArw activations from a target feature in a train set to the surrounding features in a train set and for cases where a surrounding feature shares a high correlation of missing data based on the shown formula we exclude those surrounding features from the imputation model basis for the target feature.
((Narw1 + Narw2) == 2).sum() / NArw1.sum() > tolerance
We believe this automated detection for another source of data leakage could be considered another contribution of this library, and propose that we could make note of it in the paper and provide further detail in the appendix.
Please note that citations associated with this essay are by embedded hyperlinks, formal citations are noted in the preprint Missing Data Infill with Automunge.
For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com