In Other Words, Put Simply

Automunge by the numbers

Nicholas Teague
Automunge
3 min readJul 29, 2020

--

1. Automunge is an open source python library available now for pip install.

2. Automunge prepares tabular data for machine learning by way of numeric encodings and missing data infill.

3. Automunge is built on top of Pandas and Numpy libraries, and also uses scikit-learn for predictive models and scipy stats for statistics.

4. Automunge assumes data is provided in a tidy form, which means one column per feature and one row per observation.

5. Automunge has a library of feature engineering transforms intended for different data types such as numeric, categoric, sequential, and date-time.

6. Automunge automatically evaluates column data properties to assign appropriate types of encodings.

7. Automunge also allows a user to manually specify types of encodings to distinct columns.

8. Automunge data transformations may be applied to distinct columns in sets, such as may include generations and branches of derivations, by way of “family tree” primitives.

9. Through these transformation sets, received feature columns may be returned in multiple configurations.

10. Data transformations may be sourced from the library or custom defined by the user.

11. The two master functions are automunge(.), for the initial preparation of data, and postmunge(.), for subsequent consistent preparations of additional data.

12. The automunge(.) function, in addition to performing data transformations, populates and returns a python dictionary capturing all of the steps and parameters of transformations.

13. This python dictionary may then be passed to the postmunge(.) function with additional data for fully consistent processing.

14. The python dictionary may be published by researchers to allow others to replicate preprocessing and infill methods on additional data.

15. Missing data infill may be based on the defaults for each transformation or assigned by user from a library of infill methods.

16. The infill library includes ML infill, in which Random Forest machine learning models are trained for each column to predict infill based on properties of other features.

17. For sets with many missing values, the ML infill training operation may automatically be iterated multiple times for sets to converge toward solution.

18. ML infill allows users to pass parameters to Random Forest, and parameters can even be passed as lists to apply grid search or as distributions to apply random search hyperparameter optimization.

19. Automunge includes a push-button feature importance evaluation that can calculate metrics by shuffle permutation associated with each source column as well as each derived column.

20. The feature importance metrics may be used to support a dimensionality reduction based on ranked influence.

21. Dimensionality reduction is also available by principle component analysis (PCA).

22. Dimensionality reduction is also available by binary encodings to consolidate categoric sets.

23. Automunge defaults to binary transforms for categoric sets, z-score normalization for numeric, and for date-time segregation by time scale.

24. The defaults for automation may be custom configured.

25. Automunge offers numeric set encoding options like normalizations, transformations, and binning.

26. Automunge offers sequential set encoding options to supplement numeric streams with proxies for derivatives such as velocity, acceleration, etc.

27. Automunge offers categoric set transformation options like one-hot encoding, ordinal encoding, and binary encoding.

28. Automunge also offers options to extract structure from categoric string sets by a type of string parsing to identify shared grammar between categoric entries.

29. Automunge can be run with default parameters as a push-button means to feed raw tabular data directly to machine learning.

30. Automunge can also serve as a platform for custom engineered data pipelines.

God bless.

Three Little Birds — Nicholas Teague

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.