Troubleshooting with Automunge

Getting to the root of the matter

Published in

Automunge

5 min readFeb 23, 2022

For those that haven’t been following along, I’ve used this forum in the last few years to document the development of Automunge, an open source python library platform that formalizes the preprocessing of dataframes for tabular machine learning. The tool may be applied for engineering custom feature transformation pipelines or alternatively a user can simply defer to automation, in which case numeric data is normalized, categoric is binarized, and missing data is imputed with automated machine learning trained on the surrounding features. Feature transformations are performed based on a “fit” to properties of the training data, and having established a pipeline and basis with the automunge(.) function, additional data for inference can then be consistently prepared in a pushbutton operation with postmunge(.).

One of the guiding principles for design that has yet to be discussed in this blog is simply to make everything fool-proof. It helps that this author is somewhat of a fool himself, so testing for this principle has been easy. Every time we run an experiment, benchmark, or validation, we naturally encounter edge cases or halt scenarios interfering with desired operation, which then immediately serves as firewood for enhanced troubleshooting guidance passed on to our users. In other words, every error channel we come across is logged, with corresponding validation tests and printouts then integrated within the workflow to mitigate. Automunge is antifragile to errors. They make us stronger.

This essay is thus intended as an introduction to the automunge validation suite, and we’ll keep it short in case you’re into that whole brevity thing.

Most of these error channels are associated with valid parameter settings or dataframe properties. For example, dataframes require all unique column headers matched in order between train and test data. An easy channel to miss is that if the training data includes a label feature, it must be specified to automunge(.) with the labels_column parameter, otherwise when preparing additional data in postmunge(.) the function will interpret as a missing feature, which since surrounding features all have an ML infill basis means this is a halt scenario. The printouts for this case look like this:

Another validation result worth note is associated with the “suffix overlap” channel. Because each derivation is logged on the returned column headers with an underscore and suffix appending, e.g. “column” returns as “column_suffix”, there is a remote edge case where this new column header may overlap with a pre-existing column header, which would be an error channel. An easy fix to avoid this channel is to abstain from use of the underscore character in headers passed to the functions, or can simply apply compilation and check for the validation result logged as postprocess_dict[‘miscparameters_results’][‘suffixoverlap_results’] and printouts will include a prominent returned printout warning looking like this:

Validations and troubleshooting results can be interfaced through three redundant channels:

Validation result printouts
A returned log of validation results
An external log of printouts and validation results

The printouts on their own will often be sufficient in manual operation. When a validation test is performed, the convention is that associated printouts are silent when everything passes and otherwise a descriptive statement is displayed warning of a negative result. Ease of interpreting these printouts may vary by printstatus parameter setting, where printstatus controls what some other libraries refer to as ‘verbosity’.

In the most verbose case of printstatus=True, validation warnings will be intermingled with other status reporting which can be quite lengthy.
In the default printstatus=‘summary’ case, printouts are more reasonable and validation warnings are visible without a lot of scrolling.
In the printstatus=False case, only validation warnings will be displayed, otherwise the printouts are silent.
In the printstatus=‘silent’ case, no printouts at all are returned, which means for troubleshooting a user will need to inspect a log of validation results.

Alternatively, a user can access a comprehensive validation log, which also records passed tests. The returned log can be accessed as postprocess_dict[‘miscparameters_results’], which for those less well-versed the postprocess_dict is a dictionary populated by the automunge(.) function that serves as the key for preparing additional data in the postmunge(.) function for inference. The log is simply a dictionary which in most cases comprises test identifier keys and boolean activation values, with False meaning a test passed and True meaning a validation warning was identified. A small number of the validation tests may have more elaborate reporting conventions or only be returned with specific parameter configurations. The test identifier and reporting conventions are detailed in the documentation provided on GitHub as the Validations file.

The only obstacle that may arise for inspecting this validation log is kind of obvious, if the function fails to compile then it won’t return a log. This scenario can be partially mitigated with printouts, but for cases of integration into automated workflows that may be less than ideal. To accommodate we recently rolled out another channel for accessing an external validation log, but in this case in a manner that is even accessible if the function fails to compile. Not only does this option give access to the validation log, but each printstatus summary is also recorded and returned. The implementation makes use of some python quirks for memory durability of dictionary data structures, and can be initiated by simply initializing an empty dictionary external to the function call and passing it the the logger parameter (which is a parameter for both of the API interfaces of the automunge(.) and postmunge(.) functions). Then if the functions fail to compile for any reason, all that is needed is to inspect the previously initialized dictionary and the validation logs and printouts are there for your use. Like magic.

logger = {}train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
  logger=logger,
  printstatus='silent')#and then, e.g.print(logger['debug_report'])
print(logger['info_report'])
print(logger['warning_report'])#or validation results available in logger['validations']#postmunge(.) logger applied in same manner

Automunge was developed by a licensed professional engineer, and we have sought to approach the process with a commensurate degree of rigor. We recognize that there is potential for this software to be integrated into mission critical applications, and with that exposure comes a certain degree of responsibility. We take that responsibility very seriously, and intend to continue doing so.

Until next time.

For further readings please check out A Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

Debussy’s Prélude Livre II, V — Bruyères

Troubleshooting with Automunge

Getting to the root of the matter

Written by Nicholas Teague