Automunge 1.79: An Open Source Platform for Feature Engineering

BYOF: Bring Your Own Function

Nicholas Teague
Automunge
13 min readApr 21, 2019

--

The Gutenburg printing press, 1455 (replica)
The Beatles — A Hard Days Night (live)

In case you haven’t been following along, I’ve been using this forum over the last few months to document the development of a tool for wrangling structured data sets. The long range goal for the effort remains the full automation of workflow such as to enable the direct application of machine learning to some arbitrary form of data, however in practice there are still some technical hurdles that remain before this becomes a viable approach, so in the mean time we’ve made a few compromises in the design to allow a user options between data processing specification verses automation — in fact this was one of the key contributions of our last update for version 1.78 presented in our last essay.

One thing this essay won’t attempt is an overview of the software, heck I just did that with the presentation in preceding link. I mean repetition is useful and all but after a while you’re going to have to visit the table of contents I suppose. This is a book after all. Just because you’re not reading it on a printed page doesn’t make it any less so. (Although come to think of it by that line of reasoning the definition of a book can become somewhat blurry — is a song also a kind of book? A play? An abstract painting? I’m kind of leaning toward yes. (You’ll have to forgive me if getting kind of spacey here it’s been a hard day’s night. :))

artist credit: Dofey

What this essay will attempt is to flesh out the mechanisms behind the use of Automunge software not just as a tool, but as a platform. Yes this is the first we’re using that P word that I imagine venture capitalists like to hear so much. Automunge started as a very narrow segment of the data science workflow, but through the development of the tool I think we’ve successfully realized a generalized method to applying feature engineering transformations to structured data with the incorporation of machine learning derived infill, feature evaluation, consistent processing of subsequently available data, and well a whole host of useful mechanisms to simplify the life of Joe Schmo data scientist. (We’re here to help after all!)

(Please consider the inclusion of this totally random link as kind of like marketing for my book — you know just trying to generate a buzz :)

So let’s just go ahead and jump right on in. Along the way we’ll demonstrate a few methods to build on the Automunge platform. First we’ll highlight the development of a new transformation function for inclusion in the internals of the tool (this is open source software after all perhaps some of you may want to contribute your own transformation functions). Second we’ll demonstrate how a user can pass their own transformation functions to the tool and make use of all of the nifty features without having to wait for me to figure out how to coordinate external contributions of transformation functions (hey I’m kind of new at this whole open source thing cut me some slack! (Actually come to think of it if you want to get plugged in as part of the Automunge community feel free to drop a message on automunge.com requesting invite to the Slack channel — currently a very exclusive club! :))

Automunge: we do things that don’t scale until they do.

The transformation function we’ll implement here is one that is pretty fundamental to address of numeric sets, actually kind of embarrassing that this wasn’t incorporated yet (it’s not the number of users that matter, it’s the growth rate — same goes for number of transformation functions I suppose). Specifically we’ll demonstrate a new category of transform key (we’ll dub in the traditional four character string length) as “mnmx” which has nothing to do with that rapper FYI it actually stands for Min-Max Scaling. Min-Max scaling is quite simply implemented by finding the maximum of a set we’ll call max, the minimum of a set we’ll call min, and then for each point xi deriving mnmx(xi) = (xi-min)/(max-min), which gives us a value within range 0 <= mnmx(xi) <= 1 for all points in the set. (A hat tip for this simple formula to the Feature Methods for Machine Learning text by Zheng and Casari).

Now the complexity comes in in that we are not simply applying the transformation in our function, we’re also parallel developing the data infrastructure to support all of the methods in Automunge for machine learning infill, feature evaluation, consistent downstream processing, etc. Bearing in mind that we’re performing transformations on two different sets — the train and test data — but only using the properties of the train data for determination of normalization parameters (in this case the min and max values), we end up defining two different transformation functions, the first intended for simultaneous address of train and test data, the second intended for downstream transformations of test data that wasn’t available at initial address, for which we’ll need the normalization parameters from the train data as an input. Here we demonstrate the nuts and bolts of these two functions for the Min-Max scaling. Note that the infill performed here is only the initial address and user-specified or ML infill methods follow separately outside of this function. Also note that this demonstration is for a singular derived column, these methods can be easily extended for transformations with multiple derived columns using these same data structures.

Tennessee Ernie Ford — 16 Tons (live)

You’ll be happy to know that I’ve gone ahead and built this ‘mnmx’ transform function into the tool, and it’s rolled out with the new version 1.79 that is being released simultaneous to this essay (“Move quickly and be careful not to break anything.” — that’s our motto :). This wasn’t the only update associated with 1.79, another key feature that’s being rolled out here is the facilitation of user-defined transformation functions. We kind of hinted at that in the last Colaboratory demonstration notebook and truth be told there were a few implementation details that still needed to be sorted out at the time. The good news is that those details are now officially sorted. We can now call ourselves a big functional platform. The way I see it is that hopefully a supportive community can crop up to donate transformation functions to our library, and parallel if you’ve got some unique stuff to your dataset no need to get in line, you now have the ability to define your own transformation functions on top of the platform. Cool! Let’s go ahead and demonstrate how.

Let’s say that we still want to make use of the min-max scaling transform (aka category = ‘mnmx’), but also want to parallel perform a z-score normalization transform on the original column, thus ending up with derived columns of mnmx, nmbr, and just because it’s good practice a NArw column indicating cells that were subject to infill. The way we’ll do so is to define and pass to automunge some additional definitions to our “family tree” of data transformations. In so doing we’ll make use of previously defined functions, so there will be no need to define new transforms in the processdict (more on that below), this will just tell the function the order and composition of transformations to perform on a specified column. Note that if we pass a category to a primitive with downstream offspring, those downstream transforms previously defined in the family tree will be implemented. For example, if we pass a ‘nmbr’ category as a parent, then all of the associated downstream transformations for nmbr will be implemented (such as power law transform, standard deviation bins, etc). However if we pass a nmbr category as an auntsuncle primitive instead for instance the nmbr branch will not include downstream offspring. Similarly, if we passed nmbr as siblings or cousins (instead of parents or auntsuncles) we would have similar results except the original column would be kept in place (i.e. supplement vs replace). Pretty nifty right? Here again are the defined primitives of our family tree:

Automunge family tree of transformation primitives

Here we define our new root category of ‘mnm2’ making use of existing category definitions. We do so by defining a transformdict and processdict outside of the automunge call and passing these to the automunge function as arguments. The result is that the assigned columns col1 (and similarly col2) are transformed into derived columns [col1_mnmx, col1_nmbr, and col1_NArw]. And since we assigned column col0 to the mnmx transform, which didn’t have the nmbr in it’s family tree, that one is just realized as [col0_mnmx, and col0_NArw] per the tree for root ‘mnmx’ previously defined internal to automunge.

Keeping up? I know we’re getting into the weeds a bit it kind of goes with the territory I suppose. Don’t worry the good news is this type of customization is not required to use Automunge, I’m only demonstrating the logistics if one wanted to build on top of the platform such as to customize data transformations. The application of Automunge without customization is actually very very simple, I promise! That being said ( :), we’re going to dive even deeper into the weeds for a minute to follow through on the real promise of Automunge, a platform for incorporating all of these neat concepts (ML infill, consistent downstream processing, infill methods, etc) into user defined transformation functions. This is only intended for advanced users.

Tina Turner — Proud Mary

To demonstrate a custom transformation, let’s go ahead and make one up. I have an idea, how about we extend the idea of min-max scaling, but try to take account for fact that our data may have outliers that we want to ignore. For example, if we were going to perform some auto insurance analysis where in addition to how fast someone drives we took driver wealth into consideration, well we wouldn’t necessarily want to include Bill Gates in our analysis, after all he would skew all of our statistics. So let’s define a function where we implement min-max scaling but with the proviso that data points are capped at the 99% and 1% quantiles, noting that we would want to do this before deriving the mean for infill for instance. We’ll call this new transform mnm3, and we’ll have a process_mnm3 function for automunge and a postprocess_mnm3 function for postmunge. Now this is going to look very similar to the mnm2 defined above, but with difference that since this is defined external to the automunge class, well we won’t include the “self” convention in the definition (do a quick control-F search for “self” above if you don’t follow this point). Also important to keep in mind: this defined function won’t be saved in the postprocess_dict dictionary returned form automunge, so we won’t have access to it for postmunge unless we re-define the process_mnm3 functions in the same notebook where we apply postmunge. (It may be possible in a future iteration to save an externally defined function into our returned dictionary, tbh this question is a little above my pay grade (I’m bootstrapping to give you an idea), we’ll save this for a future extension.) What I’m getting at is save your custom function definitions people, it’s important! Here we go, let’s define our process_mnm3 functions (shown for both automunge and postmunge) which again are meant to illustrate how one would go about passing custom transformations on top of the Automunge platform.

Now before we demonstrate the assembly of the transformdict and processdict for passing these to Automunge, to be thorough lets quickly define one more min-max processing approach to illustrate another scenario. Let’s say that we have a function that we want to apply to the train and test sets but that doesn’t require any properties to be derived from the train set (and thus no passing of train derived properties to the test set in the normalization parameters). Well for this simpler case we won’t need to define separate functions for automunge and postmunge address, we can just define what we call a “singleprocess” function which is only applied to one data set at a time. Here we’ll define what we call mnm4 for cases where we know in advance that we want to use a hard coded mean for infill, and min/max values for scaling. In addition to the single passed dataframes to the function, another difference for this scenario is how the function is saved in the processdict, which we’ll demonstrate below.

And now that we’ve defined these custom functions for mnm3 and mnm4, we’ll have to pass them to our automunge call, which we’ll do using construction of associated transformdict and processdict objects. Again quick reminder that we’ll have to redefine any external functions (like for mnm3 and mnm4) if we apply postmunge in a separate notebook session. Also a word of caution, be careful not to introduce any infinite loops when populating the transformdict, I have some thoughts about how to solve the halting problem but haven’t quite rolled that feature out yet :).

And there you have it: you’re now equipped to make use of the Automunge platform for data wrangling with custom transformation functions! In summary, you now have the ability to take advantage of all of the capabilities of Automunge with user customized transformation functions, such as machine learning derived infill, feature evaluation, and most importantly a very simple method for consistent processing and normalization for data that was not available at initial address — all with just one simple function call. Quick break to cleanse the visual palette before some closing thoughts:

Another masterpiece by Dofey

I watched an episode of Shark Tank the other night, in case you’re not familiar it’s this really neat show where entrepreneurs are brought before a panel of judges / investors to simply pitch their startup for funding from those judges. Literally capitalism at it’s finest. And I was reminded of a story I heard about the evolution of the concept. In the first season of the show contestants who wanted an opportunity to pitch, simply to pitch mind you with no guarantee of funding, were required to give up a small slice of equity to their startup. This was literally craziness. While I’m sure the exposure from the show has some value, in no universe would it merit an equity stake. Fortunately noted investor Mark Cuban put his foot down to the producers and this egregious clause was stricken. If the investors wanted a stake they would have to negotiate for it on equal footing. I imagine the position of an angel investor or venture capitalist in the startup realm has some similarities to this Shark Tank show, such that proven cash flow / revenue streams are always first in line and short of that it requires a really special opportunity or founders to merit much attention. In addition to an admittedly somewhat creative exercise at times, this blog has partly been my attempt at trust-building. By documenting my progress and attention thoroughly, I’ve tried to demonstrate potential for rapid progress — after all when evaluating popularity of some software, it’s not the number of users that matter, it’s the growth rate — same goes for the progression of a startup I suppose. The human mind sometimes finds it hard to grasp just what is possible with exponential growth rates, such as for instance the difference between a quantum computer with 10 qubits vs 54.

Hat tip image via Giuseppe Carleo

This Automunge tool started from the barest seed of an idea from tackling some of the beginner problem sets on Kaggle. I think it’s come a long way in a short amount of time and merits some attention of the data science community. I hope that my creative streaks don’t distract from the message, I mean I could have taken the approach of Turing with his PhD thesis in writing in the most archaic formalism of the likes of lambda calculus or in my case just like dry python documentation, but instead I’ve tried to make it a little more digestible with music and art. One thing I don’t expect is that this same formula will necessarily work the length of the project, after all what got you here won’t get you there as they say, but as we are trying to draw interest and build trust from a user base I think transparency and thorough communications has a lot of value. I admit there’s a little bit of a Field of Dreams scenario being played out with this tool, there’s a voice in the wind that keeps whispering to me “If you build it they will come.” While this is certainly motivating to me, I hardly expect some investor to buy in on such an intuition, however far it has taken me yet. But you know what, every moon shot has to start somewhere.

(Nothing to do with my startup, I’m just happy with how this one turned out wanted to share again)

I’ll end this essay in a request to you the reader. If you are a data scientist or know someone who might benefit from this tool, please share! Even if you’re not a likely user yourself, consider giving it a retweet or posting one of our essays to facebook. Exponential growth has to start somewhere, even with the barest seed. Who know’s you just might be that butterfly that flaps his wings and starts a hurricane. All it takes is one.

Outkast — Hey Ya

Books that were referenced here or otherwise inspired this post:

Feature Engineering for Machine Learning — Alice Zheng & Amanda Casari

Feature Engineering for Machine Learning

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on linkedin for professional inquiries or twitter for personal.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.