Dimensionality Reduction with Automunge v1.9

Another week in the bag

Nicholas Teague

Published in

Automunge

8 min readMay 10, 2019

The Allman Brothers — Jessica

For those that haven’t been following along, I’ve been using this forum in recent months to document the development of a software tool implemented in Python intended to simplify a data scientist’s workflow associated with wrangling structured (tabular) data in preparation for the application of machine learning. The tool takes as input Pandas dataframes and converts them into numerically encoded sets suitable for the direct application of machine learning. The tool allows a user the choice between deferring to automated evaluation of data set properties to determine appropriate processing methods on each column verses user-specified application of column specific feature engineering and infill methods. The tool includes a library of feature engineering transforms that we are continuing to build out, and a user also has the ability to pass their own sets of feature engineering functions incorporating simple data structures to build on the capabilities of automunge for such useful methods as machine learning derived infill, feature importance, and the simplest means available for consistent processing of subsequently available data with just a simple function call. In short, we make machine learning more chill.

In recent weeks we’ve continued to build out our library of feature engineering transformations for numerical and time series data, introduced functionality to flesh out the potential of automunge as a platform for user-defined feature engineering methods, as well as developed feature importance evaluation metrics. Last week we even demonstrated a method to trim features from our model based on performance in the feature importance evaluation. This week’s update’s were primarily focused on an application somewhat tangent to this feature importance method, in that we incorporated another option for a user to address dimensionality reduction, now via PCA (Principle Component Analysis).

I’ll offer a quick introduction to PCA for those who may be unfamiliar. PCA is a method for reducing the dimensionality of a set, such that if let’s say we have a tabular set with several columns of numerical data, with PCA we can specify how many columns we would like to reduce this to while still capturing the most variance we can in the distributions of the remaining transformed features. As an illustrative example, let’s say we have a data set including one column that carries values returned from some function f(x), and let’s say we have a second function that includes values returned from a similar function whose output is doubled as g(x) = 2 * f(x). Well if we were going to train a model from this two-column set, we expect the model probably won’t learn a great deal more with the inclusion of both columns f(x) and g(x) verses training with just the single column f(x), after all the two columns carry very redundant information in that there is a direct linear relationship between f(x) and g(x) with respect to x. Now if we wanted to apply PCA to this two column set, we could transform the set [f(x), g(x)] into a single column set [h(x)] which contains all of the relevant dataset variance to serve as fuel for a subsequent training operation.

Molly Hatchet — Gator Country

Let’s consider a second example where we again have a column carrying values returned from some function f(x), but this time in addition to the column g(x) = 2 * f(x) we have a third column containing values returned from the function h(x) = f(x)², which represents a nonlinear relationship between the sets f(x) and h(x). Now there are versions of PCA that allow for conversion of non-linear relationships (such as the kernel PCA method in Scikit-Learn library), however the version implemented here will be based on the vanilla PCA offered by Scikit which is only meant to address linear variance (we’ll defer the non-linear variance to the machine learning model intended to be trained with the automunge function’s returned sets). A third example now, let’s say we have one more column containing values returned from the function i(x) = f(x+y). This time we have elements of the function that are orthogonal to f(x), in that we’re assuming there is no direct relationship between x and y. So one result from a PCA dimensionality reduction of the set [f(x), g(x), h(x), i(x+y)] could be something of the form [j(x), k(y)] for instance, after all the goal of PCA is to break down the dimensionality to orthogonal components that explain the most variance with a reduced number of features. (For an example of orthogonality consider say position and momentum of a particle in superposition such as may described by the Heisenberg uncertainty principle, kind of a tangent.)

The whole point of all of this is dimensionality reduction in that it results in a reduced number of features, just one more tool in our arsenal for the fight against the curse of dimensionality. The PCA also serves as a kind of regularization on a downstream model, albeit one applied prior to the actual model training operation. Note too that the application of PCA is a kind of unsupervised learning, in that we are not making use of the data set labels in application. As implemented here, the PCA is trained using the Scikit-Learn library based on properties of the train set, and then that same model is used to transform the train and test sets, as well as saved in the returned dictionary for comparable application to subsequently available data through the postmunge function. The method is realized with two simple passed parameters to Automunge. A user can pass either a float between 0–1 or an integer to the PCAn_components which mirrors a similar method in Scikit’s PCA implementation. For values between 0–1 the PCA will transform features to the minimum number of columns that can explain that percentage of the variance, and for integers >1 the PCA will generate that number of columns with features transformed for maximum variance. The second passed argument is called PCAexcl, and basically this is a list of columns to exclude from the PCA operation, such that a user can either pass the name of original columns (pre-transformation) to exclude all associated derivations, or alternatively pass the column names for a subset of those columns produced in the automunge transformations (I’ll demonstrate these methods in the Colaboratory notebook linked below).

“The Halo” — sculpture by James Lee Byars

I’ll offer in closing for this week’s updates a few thoughts on the Automunge project, including strength’s and weaknesses as currently implemented. First it’s probably worth note that the core functionality of the tool is implemented with the support of two mainstream libraries, Pandas and Scikit-Learn. In fact one way to think about Automunge is that it’s an interface for streamlined preparation of tabular data for the application of machine learning, which otherwise would often be implemented manually using some combination of these two libraries if not through an expensive ($$$) data preparation package. So what we’ve accomplished here is roughly analogous (loosely) to Francois Chollet building an interface on top of TensorFlow for streamlined machine learning projects with the Keras library. And we’re kind of moving in the direction for increased programmability of the Automunge tool such that it is more than a pushbutton operation but can be built on as a platform for user customized feature engineering projects. Of course with the use of Pandas and Scikit we inherit all of their limitations. Pandas works on a single CPU core and is limited by the memory bandwidth of the session, so although I haven’t extensively tested this yet my expectation is that the size of datasets that can be handled through the Automunge tool is possibly on the order of <10Gb, that’s kind of a guess though, some validation is needed here. Another limitation of these frameworks is that I believe neither Pandas nor Scikit allow for parallelization of operations such as might facilitate GPU acceleration, so for larger datasets it may take a few minutes to apply. I know that NVIDIA is actively developing some libraries here to make use of their GPU’s for this purpose, and I expect down the road we’ll be hearing more from them as an alternative to Pandas. But really the key takeaway I think a user should get here is that Automunge is the only resource that I know of for freely available open source software for user customized push-button data wrangling. All of the other offerings from commercial providers have a subscription model or charge for access to computing resources. Of course the goal is to join this club, but our approach is different in that we only intend to charge once we reach the stage of offering processing methods that require external computing resources. In the mean time, you have here at your disposal a very powerful tool for working with tabular data in your own python environment. Put it to use! Ok good week must be traveling on.

Lynard Skynyrd— Free Bird