Composing Features with Automunge v1.8

Don’t Stop Believing

Nicholas Teague
Automunge
9 min readMay 5, 2019

--

That time of year thou mayst in me behold

When yellow leaves, or none, or few, do hang

Upon these boughs which shake against the cold,

Bare ruin’d choirs where late the sweet birds sang.

- Shakespeare

For those that haven’t been following along, I’ve been using this forum in recent months to document the development of a Python class intended as a tool to automate and/or simplify the final steps of data processing of tabular data prior to the application of machine learning. The tool takes as input pandas dataframes and transforms the output into numerically encoded sets which can be directly fed to machine learning algorithms in the framework of a user’s choice. In addition to a few default methods for automated processing, the tool also allows a user to assign distinct processing and infill methods to specific columns. Another nifty feature includes a method for deriving infill for missing or improperly formatted data with machine learning in a fully generalized and automated fashion. Probably the most underrated feature of the tool is the capability to facilitate consistent transformations for subsequently available data with just the simplest of function calls. Finally it’s certainly worth note that this class is more than just a static tool, it’s intended as a platform allowing users to incorporate their own feature engineering functions using simple data structures to build on top of all of these built in capabilities.

This week’s updates, in addition to attendance at a really fun charity golf tournament and concert to benefit victims of breast cancer, included a little bit of playing catchup with respect to those snafus that were noted in last week’s essay. These included an issue with power-law transformations, an issue with feature importance methods applied parallel to custom processing, as well as the implementation of feature selection methods for trimming candidate column derivations from the output. Each of these were thankfully resolved, and along the way we even got a chance to round out the library of feature engineering methods with some really neat supplemental derivations for time-series data which we’ll demonstrate below.

The bug fixes turned out to be pretty straight forward. In the demonstration Colaboratory notebook linked from our last post “Diverse Feature Engineering with Automunge,” we had noted two issues. The first popped up when attempting to run automated numerical processing with the demonstration Boston housing market data set. It turns out that the Box-Cox transformation used for power-law address is somewhat idiosyncratic, in that it may lead to error messages for numerical sets with incompatible properties. Our current method here was simply to test for whether a numerical set contained all-positive values, and I guess the lesson here is that this single rule alone may be insufficient to detect sets appropriate for this transform in our automated address. The solution was simply to add a few transformations to the library allowing a user to turn off default power-law transform and simply apply to specified columns, thus allowing a user to test for those columns which may be more suited to the method themselves, reference new library categories of ‘bxc2’ and ‘bxc3’.

Why was Helen Keller a bad driver? Because she was a woman.

- bad joke (sorry couldn’t help myself)

Another issue that came up in our demonstration notebook was identified for cases where a user was overwriting the default transformation functions with the automunge() function passed objects transformdict and processdict, and specifically the issue came up when these methods were used in parallel with feature evaluation. I feel kind of silly about this one as it turns out this wasn’t due to the internal of the automunge function, it actually had to do with the specified transformdict. Specifically, we had defined a new trasnformdict for ‘nmbr’ category using the category ‘nmb2’ in the cousins primitive. Well here’s the issue, because cousins is a primitive that supplements but does not replace the source column, our error arose because we had specified a cousin primitive for a root category without also specifying at least one replacement primitive. You see if we do want to leave one of our source columns in place, we need to do so by passing the column to an ‘excl’ category which excludes from transforms (which adds the ‘_excl’ suffix to label and builds out the appropriate data structure to support). So yeah my bad should have passed ‘nmb2’ using auntsuncles primitive instead. Something to keep in mind if you want to experiment with the tool.

It was also perhaps a little embarrassing that I had tried to roll out a method in version 1.78 which allowed a user to automatically trim outputted feature sets based on their evaluated feature importance (either via a defined percent of ranked features or alternatively via some feature importance metric threshold), which lol didn’t actually work at the time per say. Fortunately our methods were sound just a few implementation details in the nuts and bolts that were the source. So long story short the user now has the working option to pass either a featurepct or featuremetric value to the automunge function such as to allow the algorithm to trim those source features below this threshold. The curse of dimensionality shall not prevail if we have anything to say about it.

In me thou see’st the twilight of such day

As after sunset fadeth in the west,

Which by and by black night doth taketh away,

Death’s second self, that seals up all in rest.

- Shakespeare

Of course this week wasn’t all troubleshooting and bug-zapping, we also took the opportunity to build on our library of feature engineering transformations, this time with focus on time-series data sets. We’d kind of set aside time-series data after our original address in the essay “It’s Only a Munger of Time,” which demonstrated our methods for segregation of time series data by time scale (year/month/day/hour/minute/second) followed by normalization. Of course people have devoted entire careers to addressing time series data (this was even the original purpose of the pandas library when first authored by Wes McKinney), and we don’t pretend that our single function is perhaps the most sophisticated method, but it served its purpose as a placeholder while we built out the platform.

Just as a man working with his tools should know its limitations, a man working with his cognitive apparatus must know its limitations.

- Charlie Munger

The additions here to round out our library included a series of boolean columns serving to identify whether a given timestamp entry fell within a few categories of interest. We’ve demonstrated methods here for assembling boolean identifiers of business hours (9–5), weekdays (M-F), and holidays (US Federal), which basically means if a timestamp falls within one of these buckets the cell will activate with a 1 otherwise it will default to 0. These methods are now available in the core library under categories ‘bshr’, ‘wkdy’, and ‘hldy’, and can be implemented to a time-series column parallel to the original time processing methods by assigning the column to category ‘dat2’. Probably worth note that although the original time category is still carved out from the ML infill methods, these new boolean categories are able to be served with the machine learning infill methods.

I think the time-series portion of the tool has a lot of potential and there are certainly a few things that could round it out further. Probably worth note that the current methods are time-zone neutral, meaning they just process data based on the current time zone of the set (by default pandas objects are time-zone unaware). A future extension could certainly round this out for establishing a time zone basis, such as might be useful if one was wishing to compare a set to trading hours for the NYSE for instance. For instance we could test a column on whether a time zone is specified and if so incorporate a few additional methods. It’s probably worth note that the time basis as implemented is not extremely well suited for recurrent neural nets or their analogs (such as convolutions or transformers for instance), and I guess this current intent is that a user looking to incorporate this tool into that type of project pass their own processing functions using the methods established in the preceding two essays.

In me thou see’st the glowing of such fire

That on the ashes of his youth doth lie,

As the death-bed whereon it must expire,

Consum’d with that which it was nourish’d by.

- Shakespeare

Since we’re on the subject of time-series data and since well I have been known to go on tangents (this will be a big one), I’d like to just share a few thoughts on music composition because well hey this is my blog I can do whatever I want. So anyway I’ve noticed that there is a certain lack of dimensionality in mainstream pop that I find kind of troubling. It’s like we have all of these generations of musicianship over the years that is being discarded in the interest of easily digestible formulas. And where I’m going with this is that I feel like the time scale or tempo axis is one that has a lot of room for exploration. Consider that SL1200 turntables had a slider for speed, but speed also varies pitch and so must be used sparingly, after all we’re kind of constrained by key signatures. In modern practice, we have tools like autotune that can offset pitch deviation, so it could be easily implemented to allow a DJ producer or digital opera singer or whoever to remix or even compose time-axis fluidic works.

Can you imagine how much more interesting music would be without that redundant drum loop sample to lull you into a trance? I mean we would need to use this sparingly, at least at first, we wouldn’t want to confuse the audience who have grown so accustomed to formulas and repetition. Heck it’s bad enough that the dynamics of mainstream music rarely follow even the simplest volumetric dynamics e.g. p/mp/mf/f, could you imagine a time-set tempo crescendo of the likes of Ravel’s Bolero? Anyway thought that might be fun to give it a try. All the best. Nick

This thou perceiv’st, which makes thy love more strong,

To love that well which thou must leave ere long.

- William Shakespeare

Books that were referenced here or otherwise inspired this post:

Python for Data Analysis — Wes McKinney

Python for Data Analysis

Five Lessons — Ben Hogan

Five Lessons

(As an Amazon Associate I earn from qualifying purchases.)

* This software does not offer investment advice or securities.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.