Diverse Feature Engineering with Automunge v1.799

Hashtag #WomenOfDataScience

Nicholas Teague
Automunge
10 min readApr 27, 2019

--

St Elizabeth Holding a Book, early 16th century
Tank and the Bangas — Tiny Desk Concert

For those that haven’t been following along, I’ve been using this forum in recent months to document the development of a software tool for wrangling structured datasets (aka tabular data) using a Python class that is available now for all of your Python data-wrangling needs (a companion notebook will be provided at completion of this essay). I recently transitioned to full time focus on this project, and am really happy with how much progress has been made to show for it. Quitting your job is never an easy decision, but it helps to remember that you’re playing a long game, and a useful heuristic is that over-optimizing for short term comfort is generally going to be a losing hand in the long run.

This week’s progress includes the rollout of version 1.799, which addresses a few points that I expect will really help this tool stand the test of time. First of all, we’ve updated and improved the feature importance evaluation method — where by improved I mean we’ve added separate evaluation metrics for original features vs derived features, and by update I mean it actually works now lol. I’ll go into a little more detail about the feature importance methods shortly. Another update is that we’ve incorporated some of those additional min-max scaling transformations demonstrated in the last essay as user-defined custom transformations and gone ahead and incorporated them into our core library, such that a user now has built-in access to the demonstrated transforms for ‘mnm2’ and ‘mnm3’ for assignment (as well as a few more variations). Another useful method of this week’s essay will be the demonstration of how to define the default transforms for automated Automunge application. For instance, while our default address for columns evaluated as numeric data includes a range of derived transformations such as z-score normalization, power law transforms, and standard deviation bins, a user might not want quite as many children columns as I’ve specified, after all the scale of data budget may not support quite that much dimensionality, so we’ll introduce here methods to assign default generation configurations to our automated address.

Feature Importance

We rolled out a version of feature importance evaluation in our presentation for version 1.78. It turns out that after a little more testing we found the method needed a little work. No worries, when you move this quickly to roll out new functionality these things are bound to happen. The good news is in the process of troubleshooting we identified a little hidden process flaw that probably would have gone unnoticed for a while, so this turned out to be a blessing. The feature importance method makes use of permutation evaluation, in which a machine learning model is trained, and we measure the impact to prediction accuracy to a validation set after randomly shuffling the feature we are evaluating. We implemented this permutation method in two different ways. The first method shuffled all features derived from the original source column. The second method shuffled all but one column derived from the original source column. Thus the purpose of the first metric is to evaluate the importance of the original source column, and the purpose of the second metric is to evaluate the relative importance of each of the derived columns from a single source column by means of comparison. There was some additional functionality intended for the feature importance method, specifically the trimming of candidate features based on performance on the evaluation metric. I’m going to save that step for the next update as trying to get this essay out with quick turnaround. For now the purpose of the feature importance method is simply for reporting. I’m sure another improvement to the feature importance method will be a more formal means of report, for now passing this code to the returned feature importance object will print the relevant aspects.

Min-Max Scaling

After the great reception for the last essay (now up to 7 readers!) it seemed appropriate to go ahead and build into the core library a few of the different variations on the min-max scaling technique that we defined for purposes of demonstrating custom transformations. I’ve now made available in the library the following defined transformation sets associated with min-max scaling which can be assigned to a numeric column by passing the column name through the assigncat argument of Automunge function. As a reminder the family tree of transformation primitives are means of telling the algorithm whether a transformed column should replace or supplement the original column and what kind of downstream transformations should be performed. I’ll defer you to the last essay if you want to see again the definitions of these primitives. (Occurs to me there are some updates needed to the official READ ME document to capture some of this functionality I suppose.) A reminder that the process_mnmx function performs min-max scaling between 0–1, and the process_mnm3 function does the same after capping the 0.99 and 0.01 quantiles (note that mnm3 is specifically intended for those applications where tail values do not influence your target variables, I expect for many fat-tailed distributions this is the exception so only recommend using this when appropriate).

Default transformations for automation

So one of our big focuses in last two updates was building out the capabilities for user-assigned processing methods. Along the way we rolled out with 1.78 the potential for user-assigned processing methods and infill methods for specified columns, and then in 1.79 we rolled out the potential for a user to build on the Automunge platform and pass their own custom tree of transformations and processing functions for assignment, allowing a user the ability to take advantage of Automunge’s built-in capabilities for such useful methods as machine learning derived infill, feature importance, and especially the easiest possible way to consistently process subsequently available data, with just one simple function call to the postmunge function. And as I was giving all of this attention to customization, it finally occurred to me that perhaps we need to allow a little more customization to the automated address (duh). So I’m going to describe here a method to overwrite the default transformation methods that are performed for default evaluated categories of data such as those methods implemented for nmbr, bxcx, bnry, text, or date.

If you’ve made it this far, I suspect you might be sticking around from curiosity associated with the hashtag teasing for #WomenOfDataScience, which full disclosure was initially conceived as a blatant attempt at pandering to a new audience to garner some more attention for this software. I suspect this is a risky strategy, after all there is a certain contingent of the online tech community that has a tendency to, how should I say, gang up on those who dare to deviate from party lines on what has surprisingly turned into a sort of prominent issue. Never having been one to shy away from a perfectly suited pseudo-controversial opportunity to generate free word of mouth (he wouldn’t have been elected if this messaging tactic for public figures wasn’t so rare you know), well what the hell let’s just jump right on in and see where it takes us.

I’ll go ahead and lean in on this helpful soundbite to get us started. While it’s certainly true that these statistics are a coarse-grained assessment of the complexity inherent to the dynamics at play, it’s hard to argue that this doesn’t paint a picture of a somewhat obvious disparity in equality of opportunity for some segments of the workforce. I’m going to focus on gender disparity here, but most of what I address can be extended to racial dynamics as well. I suspect part of the reason that this issue can generate so much energy is because a version of it has been played out in the mainstream political arena for generations around the subject of affirmative action (3rd rail alert). I believe affirmative action is appropriate and actually sort of an obligation for cases where some segment of the population has been previously subject to laws/regulations that limit their opportunities (I don’t think this should be an extremely controversial statement). Outside of this narrow band of duty however, I believe affirmative action in the public sector is sort of inappropriate. Notice the qualifier for public sector though. The statistics quoted here aren’t for employment in the public sector, these are for private business. And my argument here is going to be that the debate over affirmative action is not a direct analog for diversity considerations of private employers.

I’ll spare you the suspense. The rationale behind this assertion is based on the fact that the private enterprise has no duty to equal representation and treatment of citizens. Its duties are to the shareholders (*although as an asterisk this duty is not the only one and should be kept in the context of duty to support the infrastructure and environment that facilitate a private enterprise’s continued operation, sorry for the tangent). Anyway where I was going is that within that duty to shareholders there is the need to exceed industry norms for performance; there is a need to build a competitive advantage. So how does employee gender composition translate to advantage? This is where I’m going to start to deviate from core principles and start well basically just offering opinion. My argument is that organizational dynamics are more functional and employees are exposed to a better working environment — with improved employee retention — when there are employee dynamics that share aspects of work and social pursuits. Some companies may consider it a feature not a bug that they require their workers to segregate their thoughts by time of day, to live hours upon hours of their life in a cold and sterile environment. I’m not going to name any names but we all know this is out there. I believe that organizations that cultivate a culture and ecosystem where people are allowed to be people, where high stress environments are interspersed with periods of camaraderie and real world interactions, well I think their employees will work that much harder to keep a seat at the table.

Head of a Woman in Profile to the Right, Timoteo Viti, 15th-16th century

Probably worth a clarification here. I don’t want this to be interpreted as an argument that we need more women so that we can all date each other (lol I can see how you might interpret that way, that’s not what I meant). What I’m getting at is that group dynamics and decision making is improved with diversity of representation. Each employee has some distinct circle of experience to draw from. When we are representationally stunted, well we have a narrower fitness landscape to circumvent, we have higher risk of missing some local minimum altogether. Since I opened the door of employee interactions, dating and the like, I’ll go ahead and throw out my personal philosophy. Occasionally, a coworker may get asked out on a date. There will always be some standards of behavior necessary, there is no getting around that, but by setting oppressive boundaries between an employee’s professional and personal life you are contributing to a slave/master mindset. Some companies are so worried about minimizing their liability for a rare bad actor that they end up sucking the joy out of everyone else’s friendships and interactions. Trying to bleach out the humanity from our professional interactions is a recipe for a neutered corporate culture. We are people. We live in a complex and contradictory world.

The Priestess Tagerem, 3rd century B.C.

The standards of interactions for our female colleagues in my mind can be thought of with very simple rules. On subjects pertaining to business there should be very little difference in how you should interact with a male vs female colleague. On subjects pertaining to social interactions, very simply a man should treat his female colleague with the exact same respect and courtesy that he would show to a sister. I recognize that many of us are not religious, but try to imagine what standard of behavior would be expected between two members of the same congregation. Sorry I’ll get off my pulpit now really just wanted to share a new update to the Automunge software. Hoping that this essay might generate some controversy but you know, like, not too much. All the best. Nick

Companion Colaboratory Notebook

*(link added 4/27/19)

Books that were referenced here or otherwise inspired this post:

From the Diaries of John Henry — Nicholas Teague

From the Diaries of John Henry

* This software does not offer investment advice or securities.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.