Missing Data Infill with Automunge

An ML infill validation

Nicholas Teague
Automunge
14 min readFeb 27, 2021

--

Abstract

Missing data is a fundamental obstacle in the practice of data science. This paper surveys a few conventions for imputation as available in the Automunge open source python library platform for tabular data preprocessing, including ML infill in which auto ML models are trained for target features from partitioned extracts of a training set. A series of validation experiments were performed to benchmark imputation scenarios towards downstream model performance, in which it was found for the given benchmark sets that ML infill performed best for numeric target columns in cases of missing not at random, and was otherwise at minimum within noise distributions of the other imputation scenarios. Evidence also suggested supplementing ML infill with the addition of support columns with boolean integer markers signaling presence of infill was beneficial to downstream model performance. We consider these results sufficient to recommend defaulting to ML infill for tabular learning, and further recommend supplementing imputations with support columns signaling presence of infill, each as can be prepared with push-button operation in the Automunge library.

Introduction

Missing data is a fundamental obstacle for data science practitioners. Missing data refers to feature sets in which a portion of entries may not have samples recorded, which may interfere with model training and/or inference. In some cases, the missing entries may be randomly distributed within the samples of a feature set, a scenario known as missing at random (MAR). In other cases, certain segments of a feature set’s distribution may have a higher prevalence of missing data than other portions, a scenario known as missing not at random (MNAR). In some cases, the presence of missing data may even correlate with label set properties, resulting in a kind of data leakage for a supervised training operation.

In a tabular data set (that is a data set aggregated as a 2D table of feature set columns and collected sample rows), missing data may be represented by a few conventions. A common one is for missing entries to be received as a NaN value, which is a special numeric data type representing “not a number”. Some dataframe libraries may have other special data types for this purpose. In another configuration, missing data may be represented by some particular value (like a string configuration) associated with a feature set.

When a tabular data set with missing values present is intended to serve as a target for supervised training, machine learning libraries may require as a prerequisite some kind of imputation to ensure the set has all valid entries — which for most libraries means all numeric entries (although there are some libraries that accept designated categoric feature sets in their string representations). Conventions for imputation may follow a variety of options to target numeric or categoric feature sets [Table 1], many of which apply a uniform infill value, which may either be arbitrary or derived as a function of other entries in the feature set.

Other, more sophisticated conventions for infill may derive an imputation value as a function of corresponding samples of the other features. For example, one of many learning algorithms (like random forest, gradient boosting, neural networks, etc.) may be trained for a target feature where the populated entries in that feature are treated as labels and surrounding features sub-aggregated as features for the imputation model, and where the model may serve as either a classification or regression operation based on properties of the target feature.

This paper is to document a series of validation experiments that were performed to compare downstream model performance as a result of a few of these different infill conventions. We crafted a contrived set of scenarios representing paradigms like missing at random or missing not at random as injected in either a numeric or categoric target feature selected for influence toward downstream model performance. Along the way we will offer a brief introduction to the Automunge library for tabular data preprocessing, particularly those aspects of the library associated with missing data infill. The results of these experiments summarized below may serve as a validation of defaulting to ML infill even when faced with different types of missing data in real world tabular data sets.

Automunge

Automunge [1], put simply, is a python library platform for preprocessing tabular data for machine learning. The interface is channeled through two master functions: automunge(.), for the initial processing of training data, and postmunge(.), for subsequent processing of additional data on the train set basis. In addition to returning transformed data sets, the automunge(.) function also populates and returns a compact dictionary capturing all of the steps and parameters of transformations, which dictionary may then serve as a key for processing additional data in the postmunge(.) function on a consistent basis.

Under automation the automunge(.) function performs an evaluation of feature set properties to derive appropriate simple feature engineering transformations that may serve to normalize numerical sets and binarize categoric sets. A user may also apply custom transformations, or even custom sets of transformations, assigned to distinct columns. Such transforms may be sourced from an extensive internal library, or even may be custom externally defined with only minimal requirements of simple data structures. The returned transformed data log the applied stages of transformations by way of suffix appenders on the returned column headers.

Missing data imputation is handled automatically in the library, where each transformation applied includes a default imputation convention, one that may also be overridden for use of alternative imputation conventions by assignment.

Included in the library of infill options is an auto ML solution we refer to as ML infill, in which a distinct model is trained for each target feature and saved in the returned dictionary for a consistent imputation basis of subsequent data in the postmunge(.) function. The model architecture defaults to random forest by Scikit-Learn [2], and other auto ML options are also supported including CatBoost [3] and AutoGluon [4].

The ML infill implementation works by first collecting a ‘NArw’ support column for each received feature set containing boolean integer markers corresponding to entries with missing or improperly formatted data. The types of data to be considered improperly formatted are tailored to the root transformation category to be applied to the column, where for example for a numeric transform non-numeric entries may be subject to infill, or for a categoric transform invalid entries may just be special data types like NaN or None. Other transforms may have other configurations, for example a power law transform may only accept positive numeric entries.

This NArw support column can then be used to perform a target feature specific partitioning of the training data for use to train a ML infill model [Fig 1]. The partitioning may segregate rows between those corresponding to missing data in the target feature verses those rows with valid entries, with the target feature valid entries to serve as labels for a supervised training and the other features to serve as training data. Note that for cases where a transformation set has prepared a target input feature in multiple configurations, those derivations other than the target feature are omitted from the partitions to avoid data leakage.

Figure 1: ML Infill partitioning

There is a classification associated with each transformation category to determine the type of training operation, for example a target feature set derived from a transform that returns a numeric form may be a target for a regression operation or a target feature set derived from a transform that returns an ordinal encoding may be a target for a classification operation. In some cases a target feature may be composed of a set of more than one column, for example in the case of a one-hot encoding. For cases where a learner library does not accept some particular form of encoding as valid labels there is a conversion of the target feature set for training and an inverse conversion after any inference, for example it may be necessary to convert a binarized target feature set to one-hot encoding or ordinal encoding for use as labels in different ML frameworks.

A similar partitioning is performed for test data sets for ML infill imputation, although in this case only the rows corresponding to entries of missing data in the target feature set are utilized.

As a further variation available for any of the imputation methods, the NArw support columns may themselves be appended to the returned data sets as a signal to training of entries that were subject to infill.

Experiments

Some experiments were performed to evaluate efficacy of a few different imputation methods in different scenarios of missing data. To amplify the impact of imputations, each of two data sets were pared down to a reduced set of the top 15 features based on an Automunge feature importance evaluation [5]. (This step had the side benefit of reducing the training durations of experiments.) The top ranked importance categoric and numeric features were selected to separately serve as targets for injections of missing data, with such injections simulating scenarios of both missing at random and missing not at random.

To simulate cases of missing not at random, and also again to amplify the impact of imputation, the target features were evaluated to determine the most influential segments of the features’ distributions [6], which for the target categoric features was one of the activations and for the target numeric features turned out to be the far right tail for both benchmark data sets.

Further variations were aggregated associated with the ratio of the feature / feature segments injected with missing data, ranging from no injections to full replacement.

Finally, for each of these scenarios, variations were assembled associated with the type of infill applied by Automunge, including scenarios for defaults (mean imputation for numeric or distinct activations for categoric), imputation with mode, adjacent cell, and ML infill. The ML infill scenario was applied making use of the CatBoost library to take advantage of GPU acceleration.

Having prepared the data in each of these scenarios with an automunge(.) call, the final step was to train a downstream model to evaluate impact, again here with the CatBoost library. The performance metric applied was root mean squared error for the regression applications. Each scenario was repeated 100 times with the metrics averaged to de-noise the results.

Finally, a few of the scenarios were repeated again with the addition of the NArw support columns to supplement the target features.

Results

The results of the various scenarios are presented [Fig 2, 3, 4, 5]. Here the y axis are the performance metrics and the x axis the ratio of entries with missing data injections, which were given as {0, 0.1, 0.33, 0.67, 1.0}, where in the 0.0 case no missing data was injected and with 1.0 the entire feature was injected. Because these two cases had equivalent entries between infill types, their spread across the four infill scenarios are a good approximation for the noise inherent in the learning algorithm. An additional source of noise for the other ratios was from the stochasticity of injections, with a distinct set for each trial. Consistent with common sense, as the injection ratio was ramped up the trend across infill scenarios was a degradation of the performance metric.

We did find that with increased repetitions incorporated the spread of the averaged performance metrics were tightened, leading us to repeat the experiments at increased scale for some improved statistical significance.

For the missing at random injections [Fig 2, 3], ML infill was at or near top performance across both data sets, although the spread between imputations was not extremely pronounced. In most of the setups, mode imputation and adjacent cell trended as reduced performance in comparison to ML infill or the default imputations (mean for numeric sets and distinct activation set for categoric).

Figure 2: Missing at Random — Numeric Target Feature
Figure 3: Missing at Random — Categoric Target Feature

For not at random injections to the right tail of numeric sets [Fig 4], it appears that ML infill had a pronounced benefit to the Ames Housing data set, especially as the injection ratio increased, and more of an intermediate performance to the Allstate Claims data set. We speculate that ML infill had some degree of variability across these demonstrations due to correlations (or lack thereof) between the target feature and the other features, without which ML infill may struggle to establish a basis for inference. The final scenario of not at random injections to the categoric sets [Fig 5] appeared to have only negligible spread between infill methods.

Figure 4: Not at Random — Numeric Target Feature
Figure 5: Not at Random — Categoric Target Feature

An additional comparable series of injections were conducted with ML infill and the added difference of appending the NArw support columns corresponding to the target columns for injections. Again these NArw support columns are the boolean integer markers for presence of infill in the corresponding entries which support the partitioning of sets for ML infill. The expectation was that by using these markers to signal to the training operation which of the entries were subjected to infill, there would be some benefit to downstream model performance. For most of the scenarios the visible impact was that supplementing with the NArw support column improved the ML infill performance, demonstrated here for missing at random [Fig 6, 7] and missing not at random [Fig 8, 9] with the other imputation scenarios shown again for context.

Figure 6: NArw comparison — Missing at Random — Numeric Target Feature
Figure 7: NArw comparison — Missing at Random — Categoric Target Feature
Figure 8: NArw comparison — Not at Random — Numeric Target Feature
Figure 9: NArw comparison — Not at Random — Categoric Target Feature

Discussion

One of the primary goals of this experiment was to validate the efficacy of ML infill as evidenced by improvements to downstream model performance. For the Ames Housing benchmark data set, there was a notable demonstration of ML infill benefiting model performance in the scenario of the numeric target column with not at random injections, and also to a lesser extent with random injections. We speculate an explanation for this advantage towards the numeric target columns may partly be attributed to the fact that the downstream model was also a regression application, so that the other features selected for label correlation may by proxy have correlations with the target numeric feature. The corollary is that the more mundane performance of ML infill toward the categoric target columns may be a result of these having less correspondence with the surrounding features. The fact that even in these cases the ML infill still fell within noise distribution of the other imputation scenarios we believe presents a reasonable argument for defaulting to ML infill for tabular applications.

Note that as another argument for defaulting to ML infill as opposed to static imputations is that the imputation model may serve as a hedge against imperfections in subsequent data streams, particularly if one of the features experiences downtime in a streaming application for instance.

The other key finding of the experiment was the pronounced benefit to downstream model performance when including the NArw support column in the returned data set as a supplement to ML infill. This finding was consistent with our intuition, which was that increased information retention about infill points should help model performance. Note there is some small tradeoff, as the added training set dimensionality may increase training time. Another benefit to including NArw support columns may be for interpretability in inspection of imputations. We recommend including the NArw support columns for model training based on these findings.

Conclusion

We hope that these experiments may serve as a kind of validation of defaulting to ML infill with NArw support columns in tabular learning for users of the Automunge library, as even if in our experiments the material benefits towards downstream model performance were not demonstrated for all target feature scenarios, in other cases there did not appear to be any material penalty. Note that ML infill can be activated for push-button operation by the automunge(.) parameter MLinfill=True and the NArw support columns included by parameter NArw_marker=True. Based on these findings these two parameter settings are now cast as defaults for the Automunge library.

References

[1] Teague, N. (2021) Automunge, GitHub repository https://github.com/Automunge/AutoMunge

[2] Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825–2830, 2011.

[3] Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin. CatBoost: gradient boosting with categorical features support arXiv:1810.11363

[4] Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data arxiv:2003.06505

[5] Teague, N. Parsed Categoric Encodings with Automunge (2020) https://medium.com/automunge/string-theory-acbd208eb8ca

[6] Teague, N. Automunge Influence (2020) https://medium.com/automunge/automunge-influence-382d44786e43

Appendix

A. Function Call Demonstrations

Automunge is available for pip install:

Or to upgrade (we currently roll out upgrades fairly frequently):

Once installed, run this in local session to initialize:

Then, assuming we want to prepare a train set df_train for ML, can apply default parameters as:

Note that if our df_train set included a labels column, we should designate the column header with the labels_column parameter. Or likewise we can designate any ID columns with the trainID_column parameter.

The returned postprocess_dict should be saved such as with pickle.

We can then consistently prepare subsequent test data df_test in postmunge(.):

I find it helps to just copy and paste the full range of parameters for reference:

Or for postmunge(.) with full range of parameters:

B. Assigning Infill

Each transformation category has a default associated infill convention. For cases where a user wishes to override those defaults assignments can be passed in the assigninfill parameter. Here we demonstrate assigning infill types zero infill to column1 and ML infill to column2. We can also apply ML infill as our default infill to columns not otherwise assigned using the MLinfill parameter.

Note that the column headers can be assigned in assigninfill using the received column headers to apply consistent infill to all sets derived from an input column, or may alternatively be assigned using the returned column headers with transformation suffix appenders to assign infill to distinct returned columns.

C. ML Infill Parameters

The default ML infill architecture is a Scikit-Learn random forest with default parameters. Alternate auto ML options are available as CatBoost and AutoGluon. First we’ll demonstrate applying ML infill with the CatBoost library.

Note that we can either defer to the library default parameters or also pass parameters to the model initializations or fit operations. Here we also demonstrate assigning a particular GPU device number.

As another demonstration, here is an example of applying the AutoGluon library for ML infill and also applying the best_quality option which causes AutoGluon to train extra models for the aggregated ensembles. (Note this will likely result in a large memory overhead, especially when applying to every column, so recommend saving this for final production if at all.)

Just to be complete, here we’ll demonstrate passing parameters to the Scikit-Learn random forest models. Note that for random forest there are built in methods to perform grid search or random search hyperparameter tuning when parameters are passed as lists or distributions instead of static figures. Here we’ll demonstrate performing tuning of the n_estimators parameter (which otherwise would default to 100).

D. Architecture Comparisons

In general, accuracy performance of autoML options are expected as AutoGluon > CatBoost > Random Forest

In general, latency performance of autoML options are expected as Random Forest > CatBoost > AutoGluon

In general, memory performance of autoML options are expected as Random Forest > CatBoost > AutoGluon

Where Random Forest and CatBoost are more portable than AutoGluon since don’t require a local model repository saved to hard drive

And where AutoGluon and CatBoost include GPU support.

E. Intellectual Property Disclaimer

Automunge is released under GNU General Public License v3.0. Full license details available on GitHub. Contact available via automunge.com. Copyright © 2021 — All Rights Reserved. Patent Pending, applications 16552857, 17021770

Eddie Vedder — Into the Wild (soundtrack)

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.