Diverse Feature Engineering with Automunge v1.799

For those that haven’t been following along, I’ve been using this forum in recent months to document the development of a software tool for wrangling structured datasets (aka tabular data) using a Python class that is available now for all of your Python data-wrangling needs (a companion notebook will be provided at completion of this essay). I recently transitioned to full time focus on this project, and am really happy with how much progress has been made to show for it. Quitting your job is never an easy decision, but it helps to remember that you’re playing a long game, and a useful heuristic is that over-optimizing for short term comfort is generally going to be a losing hand in the long run.

This week’s progress includes the rollout of version 1.799, which addresses a few points that I expect will really help this tool stand the test of time. First of all, we’ve updated and improved the feature importance evaluation method — where by improved I mean we’ve added separate evaluation metrics for original features vs derived features, and by update I mean it actually works now lol. I’ll go into a little more detail about the feature importance methods shortly. Another update is that we’ve incorporated some of those additional min-max scaling transformations demonstrated in the last essay as user-defined custom transformations and gone ahead and incorporated them into our core library, such that a user now has built-in access to the demonstrated transforms for ‘mnm2’ and ‘mnm3’ for assignment (as well as a few more variations). Another useful method of this week’s essay will be the demonstration of how to define the default transforms for automated Automunge application. For instance, while our default address for columns evaluated as numeric data includes a range of derived transformations such as z-score normalization, power law transforms, and standard deviation bins, a user might not want quite as many children columns as I’ve specified, after all the scale of data budget may not support quite that much dimensionality, so we’ll introduce here methods to assign default generation configurations to our automated address.

Feature Importance

We rolled out a version of feature importance evaluation in our presentation for version 1.78. It turns out that after a little more testing we found the method needed a little work. No worries, when you move this quickly to roll out new functionality these things are bound to happen. The good news is in the process of troubleshooting we identified a little hidden process flaw that probably would have gone unnoticed for a while, so this turned out to be a blessing. The feature importance method makes use of permutation evaluation, in which a machine learning model is trained, and we measure the impact to prediction accuracy to a validation set after randomly shuffling the feature we are evaluating. We implemented this permutation method in two different ways. The first method shuffled all features derived from the original source column. The second method shuffled all but one column derived from the original source column. Thus the purpose of the first metric is to evaluate the importance of the original source column, and the purpose of the second metric is to evaluate the relative importance of each of the derived columns from a single source column by means of comparison. There was some additional functionality intended for the feature importance method, specifically the trimming of candidate features based on performance on the evaluation metric. I’m going to save that step for the next update as trying to get this essay out with quick turnaround. For now the purpose of the feature importance method is simply for reporting. I’m sure another improvement to the feature importance method will be a more formal means of report, for now passing this code to the returned feature importance object will print the relevant aspects.

Min-Max Scaling

After the great reception for the last essay (now up to 7 readers!) it seemed appropriate to go ahead and build into the core library a few of the different variations on the min-max scaling technique that we defined for purposes of demonstrating custom transformations. I’ve now made available in the library the following defined transformation sets associated with min-max scaling which can be assigned to a numeric column by passing the column name through the assigncat argument of Automunge function. As a reminder the family tree of transformation primitives are means of telling the algorithm whether a transformed column should replace or supplement the original column and what kind of downstream transformations should be performed. I’ll defer you to the last essay if you want to see again the definitions of these primitives. (Occurs to me there are some updates needed to the official READ ME document to capture some of this functionality I suppose.) A reminder that the process_mnmx function performs min-max scaling between 0–1, and the process_mnm3 function does the same after capping the 0.99 and 0.01 quantiles (note that mnm3 is specifically intended for those applications where tail values do not influence your target variables, I expect for many fat-tailed distributions this is the exception so only recommend using this when appropriate).

Default transformations for automation

So one of our big focuses in last two updates was building out the capabilities for user-assigned processing methods. Along the way we rolled out with 1.78 the potential for user-assigned processing methods and infill methods for specified columns, and then in 1.79 we rolled out the potential for a user to build on the Automunge platform and pass their own custom tree of transformations and processing functions for assignment, allowing a user the ability to take advantage of Automunge’s built-in capabilities for such useful methods as machine learning derived infill, feature importance, and especially the easiest possible way to consistently process subsequently available data, with just one simple function call to the postmunge function. And as I was giving all of this attention to customization, it finally occurred to me that perhaps we need to allow a little more customization to the automated address (duh). So I’m going to describe here a method to overwrite the default transformation methods that are performed for default evaluated categories of data such as those methods implemented for nmbr, bxcx, bnry, text, or date.

Companion Colaboratory Notebook

*(link added 4/27/19)

