Automunge

The preprint

Nicholas Teague

Published in

Automunge

12 min readApr 19, 2020

Abstract

Automunge is a Python library platform for preparing tabular data for machine learning. Through application feature engineering transformations may be applied to distinct columns from a basis of data properties derived from a designated training set used to “fit” the transformations, and then consistently applied to subsequent data on that basis. The type of transformations applied may be determined from either automated inference of a column’s data properties, user assignment from a library of transformations, or custom defined functions with minimal requirements of simple data structures. The transformations may be applied in sets, including generations and branches of derivations. Missing or improperly formatted data may be automatically cleaned with a library of infill options, including ML infill in which column specific machine learning models are trained from partitioned subsets of the training data.

1. Automunge

This paper will serve as a brief introduction to the Automunge library for preparing tabular data for machine learning. Automunge (Teague, 2020) is an open source python library, available now for pip install, built on top of Pandas (McKinney, 2010), SciKit-learn (Pedregosa et al., 2011), Scipy (Virtanen et al., 2020), and Numpy (van der Walt et al., 2011). It takes as input tabular data received in a tidy form (Wickham, 2014), meaning one column per feature and one row per observation, and returns numerically encoded sets with infill to missing points, thus providing a push-button means to feed raw tabular data directly to machine learning algorithms. The complexity of numerical encodings may be minimal, such as automated normalization of numerical sets and encoding of categorical, or may include more elaborate feature engineering transformations applied to distinct columns. Generally speaking, the transformations are performed based on a “fit” to properties of a column in a designated train set (e.g. based on a set’s mean, standard deviation, or categorical entries), and then that same basis is used to consistently and efficiently apply transformations to subsequent designated test sets, such as may be intended for use in inference or for additional training data preparation.

The library consists of two master functions, automunge(.) and postmunge(.). The automunge(.) function receives a train set and if available also a consistently formatted test set, and returns a collection of sets intended for training, validation, and inference — each of these groups further segregated into subsets of training data, index columns, and label sets. The validation sets, such as may be designated ratios of partitioned data from the train set, are segregated from the train set prior to transformations and then consistently prepared on the train set basis to avoid data leakage between training and validation operations. The function also returns a populated python dictionary, which we call the postprocess_dict, capturing all of the steps and parameters of transformations. This dictionary may then be passed along with subsequent test data to the postmunge(.) function for consistent processing on the train set basis, such as for instance may be applied sequentially to streams of data. Because it makes use of train set properties evaluated during a corresponding automunge(.) call instead of directly evaluating properties of the test data, processing of subsequent test data in the postmunge(.) function is very efficient.

2. Transformations

When the automunge(.) function is called, the categories of transformations applied to each column may either be based on an automated evaluation of data set properties, user-specified, or some combination thereof. The automated evaluation considers properties such as data types and distribution to determine an appropriate root category of transformation, such transformations intended to numerically encode sets to make raw data suitable for direct application of machine learning. The defaults under automation, including default root categories of transformations for the different data types and also the default evaluation function, are modular and may be custom configured by the user.

Included in the platform is a library of feature engineering methods. Each transformation category in the library is defined with a series of properties in two data structures we refer to as the “transformdict” and “processdict”. The transformdict defines the progression of sets of transformations associated with a root category which may include multiple generations and branches of transformation categories as entries to a set of “family tree” primitives. The processdict collects a set of properties associated with each transformation category, which includes associated transformation functions, designation of the types of data that will be subject to infill (such as for example non numeric entries or non positive numeric entries), designation of returned data properties (such as for example whether a returned set is made of floats or boolean activations), as well as a designation of a returned set target category for internal feature importance predictive algorithms in cases where a label set may be returned from a root category in multiple configurations. Both of these data structures may have entries altered or custom configured by users in the context of an automunge(.) call.

The transformation functions passed as entries to the processdict may be specified in multiple configurations. For cases where a transformation applied to a test set requires some extraction of properties from a train set for basis, transformation functions are prepared in two configurations: a “dualprocess” configuration in which a train and test set may be applied simultaneously and the train basis saved in a normalization dictionary, and a “postprocess” configuration in which the train set basis is accessed from the normalization dictionary for application to just a test set. Any function entries to dualprocess require a corresponding entry to postprocess. For cases where transformation applied to a test set do not require extraction of properties from a train set, a function may be instead stored as a “singleprocess” type, which may be applied comparably to either a train or test set in isolation. The dualprocess, postprocess, and singleprocess conventions may all be accessed at some point during an automunge(.) call, while a postmunge(.) call will only access postprocess or singleprocess versions.

As a convention for transformation functions, there is a simple set of data structures that are collected and returned with each application. These properties are stored in the postprocess_dict dictionary returned from an automunge(.) call and support various methods we will discuss such as infill and subsequent processing in the postmunge(.) function. These “column_dict” data structures are stored under the key of the column header string of each column returned from a transformation. (The column header strings are used as column identifiers throughout, and also serve the purpose of logging steps of transformations for each returned column by way of suffix appenders affixed for each transformation function applied. (In some cases the column header strings may be integers derived during conversion from Numpy array to Pandas dataframe.)) Entries in the column_dict data structures include category identifiers such as current category and root category, column identifiers such as source column, preceding column, current column, and returned columns. Column identifiers are further aggregated as lists of columns originating from the same source column and as a subset thereof of columns originating from the same transformation, such aggregations supporting the application of infill and other methods. The normalization parameters derived from the train set properties used to consistently prepare test data are also stored here. Finally, these data structures include boolean identifiers to support infill and column retention management.

The implementation includes means to pass transformation category specific parameters to the steps of transformations using the column header and transformation category as identifiers. Such parameters may be set as the default for a transformation category across columns or may alternatively be passed for application to specific columns. Such parameters allow further customization of feature engineering methods making use of transformations pre-defined in the library.

The implementation also includes means for a user to externally define fully custom transformation functions following these conventions and incorporate into sets of transformations mixed with those transformations built into the library.

3. Family Trees

For sets of transformations defined in the transformdict, such as may include generations and branches of derivations, the order of implementation is designated by passing transformation categories as entries to a set of family tree primitives, where each transformation category is defined for use as a root category with their own set of family tree primitives. A transformation category may be passed as an entry to one of its own family tree primitives although that is not required. The primitives are: parents / siblings / auntsuncles / cousins // children / niecesnephews / coworkers / friends. The first four of these are “upstream” primitives which means their entries are only applied in a root category’s first generation of transformations. The second four are “downstream” primitives, and their entries are only inspected when the root category is found as an entry in a tree’s primitive with downstream offspring.

The primitives can be distinguished by three properties: generation, action, and offspring. Generation refers to the distinction between upstream primitives for the first generation of transformations and downstream primitives for subsequent generations. Downstream primitive entries are treated as the upstream primitive entries for each successive generation. Action refers to the distinction of whether the column serving as input to a transformation is retained. Offspring refers to the distinction of whether an additional generation will be performed after completion of a primitive entry’s transformation. Category entries to a primitive with offspring have their own family trees inspected for presence of downstream primitive entries, which are then applied as upstream primitive entries for the successive generation.

The transformation functions associated with a category entry to a primitive are accessed from the corresponding entry in the processdict. A unique set of dualprocess / postprocess (or singleprocess) transformation functions may be associated with multiple different transformation categories used as entries to family tree primitives, while each transformation category may only be associated with a single set of transformation functions — thus it is possible to specify multiple configurations of unique branches downstream of common transformation functions in different family trees.

4. Infill

Besides feature engineering and numerical encodings, another fundamental challenge for preparing tabular data for machine learning is considerations associated with real world data sets which may include instances of missing or improperly formatted values requiring infill for machine learning. Automunge offers several options for this challenge.

For each column’s root category of transformation, such as that category based on an automated evaluation of data properties or user assignment, the corresponding processdict entry contains a designation of the type of values in the target column that will be subject to infill. For example, in a categorical transform we may only desire infill for received NaN values, or for the application of a power law transform we may desire infill for either nonnumeric or target values ≤0. This designation is used as basis for each source column to extract a set of boolean identifiers corresponding to rows that will be subject to infill. These sets may optionally be included with the returned prepared data, such as to signal to subsequent ML training the presence of infill in distinct columns.

For infill under automation, the convention is that each transformation function includes within a default infill method. For example, for z-score normalization of numerical sets the default infill may be a zero value representing the set’s mean, or for one hot encoding of categorical sets the default infill may be the lack of activations in a row. Alternatively, a user can also assign to distinct columns infill methods from a built-in library, which includes options like infill with a set’s mean, median, mode, 0, 1, adjacent cell, or ML infill. Some of these infill methods also support application for transformations returning multi-column sets. A user can assign a common infill method to all columns originating from source columns with the source column header string, or can assign infill to a subset of derived columns by passing the column header strings with included suffix appenders. The application of assigned infill methods takes place after the initial default infill so as to allow predictive algorithms to be trained for any use of ML infill.

The ML infill option refers to a method in which machine learning algorithms to predict infill are trained specific to each derived column (or for transformations returning multicolumn sets, specific to each set of derived columns). The training is conducted by, for each target column(s) in a designated train set, using the identified infill points as basis to partition the data into subsets for use as training data and labels for training, and also features for inference. For cases where a source column is transformed into multiple configurations, only the target column(s) from the source column is retained in the training subset to avoid data leakage. Subsequent test data is similarly partitioned into feature sets for inference. Note that ML infill may even be run for cases where initial training data does not require infill as a precaution against imperfections in subsequent data streams.

The methods for model training take into account the processdict entries associated with a target column’s originating transformation category. The predictive algorithms for ML infill are currently implemented using Scikit-learn’s Random Forest packages. A user can pass parameters to these models, and hyperparameter tuning is available by way of grid search or random search by passing these parameters as lists or distributions instead of distinct values.

5. Etc.

Beyond the core points of feature engineering and infill, the Automunge library contains several other push-button methods. The goal is to automate the full workflow for tabular data for the steps between receipt of tidy data and returned sets suitable for machine learning application. Some of the options include feature importance evaluation (by shuffle permutation (Breiman, 2001)), dimensionality reduction (including by means of PCA (Jolliffe and Cadima, 2016), feature importance, and binary encodings), preparation for oversampling in cases of label set class imbalance, evaluation of data distribution drift between initial train sets and subsequent test sets, and perhaps most importantly the simplest means for consistently and efficiently processing subsequent data with postmunge(.).

Acknowledgments

A thank you owed to: Stack Overflow for repeatedly serving as a useful reference for fundamentals. Aurélien Géron’s Hands On Machine Learning with Scikit-Learn & TensorFlow for introducing me to Scikit-learn. Andrew Ng’s Coursera MOOC for machine learning basics. Francois Chollet’s Deep Learning With Python for getting me started on Kaggle. Wes McKinney’s Python for Data Analysis for helping with Pandas and introducing me to the word “munge”. Alice Zheng and Amanda Casari’s Feature Engineering for Machine Learning for introducing me to several feature engineering methods. Ted Petrou’s Pandas Cookbook for helping me understand the principles of tidy data. Jason Brownlee’s Better Deep Learning for helping me understand issues around data leakage. Jeremy Howard’s fast.ai lectures such as for introduction to shuffle permutation and class imbalance concepts as well as discussing time series data, honestly really well done. Sam Charrington’s TWiML podcast for current events. Yan Xu’s Houston Machine Learning meetup for motivation. Steve McConnell’s Code Complete for giving me some foundation in software development. Sebastian Raschka’s Python Machine Learning for helping me think about data preparation. Levente Szabados for shared articles on github. Saku Panditharatne’s MOOC for getting me started with a GPU. Nassim Taleb and Raphael Douady for discussions around probability distributions. Ian Goodfellow, Yoshua Bengio, and Aaron Courville’s Deep Learning for helping me understand advanced topics. Thanks to those facilitators behind Python, PyPI, GitHub, Colaboratory, Anaconda, and Jupyter. Special thanks to those facilitators behind Scikit-learn, Numpy, Scipy stats, and Pandas.

References

L. Breiman. Random forests. Machine Learning, 45(1), 2001.

I. Jolliffe and J. Cadima. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci, 374:2065, 2016.

W. McKinney. Data structures for statistical computing in python. Proceedings of the 9th Python in Science Conference, pages 51–56, 2010.

F. Pedregosa, G. Pedregosa, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825–2830, 2011.

N. Teague. Automunge. https://github.com/Automunge/AutoMunge, 2020.

S. van der Walt, S. Colbert, and G. Varoquaux. The numpy array: A structure for efficient numerical computation. Computing in Science & Engineering, 13:22–30, 2011.

P. Virtanen, R. Gommers, T. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov, A. Nelson, E. Jones, R. Kern, E. Larson, C. Carey, I. Polat, Y. Feng, E. Moore, J. Vand erPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. Quintero, C. Harris, A. Archibald, A. Ribeiro, F. Pe- dregosa, P. van Mulbregt, and SciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261– 272, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.

H. Wickham. Tidy data. Journal of Statistical Software, 59(10), 2014.

Appendix A. Function Call Demonstrations

Appendix B. Assigning Transforms and Infill

Appendix C. Custom Sets of Transforms

Figure 4: More Elaborate Family Tree Specification Example

Appendix D. Intellectual Property Disclaimer

Automunge is released under GNU General Public License v3.0. Full license details available on GitHub. Contact available via automunge.com. Copyright © 2020 Nicholas Teague— All Rights Reserved. Patent Pending, application 16552857, 17021770