Out of the Box Feature Engineering

Because learning is faster with a teacher

Nicholas Teague
Automunge
10 min readOct 5, 2018

--

The Traveling Wilburies — volume 1 (full album)

For those that haven’t been following along, I’ve been using this forum over the last few weeks to document the development of some algorithms for the automated processing of structured data sets so as to prepare them for direct application of machine learning algorithms. Along the way we’ve developed the automunge tool, for simultaneous processing of a ‘train’ data set intended for training a neural network and a ‘test’ set intended for generating predictions from that trained model. We’ve also developed a tool we call postmunge for the subsequent processing of test data that was not available at the initial model training, which uses as input a python dictionary produced from the autmomunge application we call postprocess_dict. Some of the key features of the tool include automated evaluation of each column in a 2D dataframe to determine appropriate processing approach of application. For numerical data the column is normalized to mean 0 and standard deviation 1, for categorical binary data the column is converted to 0/1 designation, for categorical data the column is processed via one-hot encoding into multiple columns, and for time series data the data is separated into multiple time scales of year/month/day/hour/minute/second and then normalized comparable to the numeric sets. Automunge also addresses cells with missing data by inferring appropriate infill using properties derived from the rest of the data by using the missing cell’s row to generate predictions from a column specific machine learning model, an approach we call ML infill.

I’ve noted before that although I consider the practice of feature engineering to be a candidate for full automation, the automunge tool in its current iteration has a ways to go before it can make that claim. However, those features that have been implemented are intended as suitable for application as the final step prior to application of a machine learning algorithm, and thus the intent is that as we continue our development journey, we will piece by piece move further up the value chain until we have reached the point of full automation. Initially we will roll out the tool as a free python library intended for application in a local session to address the final steps prior to application of machine learning, and then as we start to incorporate features requiring external servers we will incorporate an API to interface with this module that requires a paid user account on a per use pricing model. This is our secret master plan, just between you and me. (hat tip)

image via SpaceX

In my last post we implemented some revisions to the python code to facilitate speedup in processing speed. In the time since I’ve had an opportunity to do a little reading, and of those books reviewed I found one particularly useful and relevant to this project was the book Feature Engineering for Machine Learning by Alice Zheng & Amanda Casari. Originally I was going to write a full review of the book here, but I think I would be doing it a disservice to try to capture so many relevant lessons in a single post, and instead will focus on one application.

Feature Engineering for Machine Learning

This essay will build off some of the discussions in the book to derive a new automunge method for normalizing a numerical set. I have come across in my explorations prior the idea of a logarithmic transform of a numerical series to address time series data like one might use to address price fluctuations of a security for instance. I was interested to see the discussions presented in this book about an extension of the logarithmic transform which is more generalizable representation such as to both the square root or logarithmic transform by the inclusion of the parameter lambda in a transformation known as Box-Cox. The transform acts to compress higher values as λ<1 and setting λ>1 has the opposite effect, which acts to normalize data that might have fatter tails. The transformation is applied to the data as follows:

Box_Cox power law transform

Now you might ask what is the benefit of the more generalizable form if it requires an extra parameter to set, we’re trying to automate here after all and extra parameters make that more challenging. Well the cool point is that it turns out that the Python SciPy stats library not only has the ability to implement this transformation, but in so doing it has the ability to project an appropriate parameter λ by maximizing the log-likelihood function for the given distribution. Now obviously as we get more data in our set the accuracy of the “optimal” λ parameter will improve, but we don’t need “optimal” to improve the performance of our neural network, we only need an improvement to tractability — so as long as we are moving in the right direction with this normalization technique my hope is this transformation will have a positive effect on training even with inadequate data for a very accurate estimation of the maximum log-likelihood parameter λ for the transform.

A further limitation of the box-cox transformation is that it only works on a set with all positive values. For a manual address we could work our way through this in a few ways, the simplest being that if we knew that our set had a natural minimum value we could shift the set to all positive by adding a constant to each value (another approach could be to ‘squish’ the range of values to within a desired band such as 0 to 1). However since we’re trying to automate here I don’t see potential to reason our way through whether the set has a natural minimum, we’ll have to instead apply logic tests. Keeping with the philosophy of keep it simple stupid, I think the simplest address will be to test our train set for all positive values, and if so to proceed, otherwise just defer to the prior normalization address. Because our transform has an asymptote at x=0, we’ll actually test our values for x>0.001. Further, when we see our test set assuming inconsistent distribution I’m thinking the easiest approach would be to clip any negative values and set them to NaN to facilitate infill as a quick fix (yes I know not ideal, but remember we’re trying to maintain consistent processing as the train set our machine learning model was trained on).

Now the SciPy stats boxcox method will perform this transform for us for the train set. When we then apply the method to the test set we’ll want to set the lmbda value to that which was returned by our application of .boxcox to the train set. We’ll also want to store the lmbda value in our postprocess_dict for application in the postmunge function (forgive the terminology if this is your first post in this series, note we briefly re-introduced this naming convention in the opening paragraph).

A further question we’ll have to ask ourselves is other than the ‘all-positive’ test, when would it be appropriate to apply this box-cox transformation in the first place. I’m going to make a general statement here and I expect some may disagree, but from what I’ve gathered in explorations of Kaggle kernels, I think there is a tendency for machine learning practitioners to overuse graphical depictions of their data in their feature engineering analysis. I suspect this partly stems from the need to “do something” and bonus graphs look cool. While I am sure there are cases where graphical depictions of data sets will reveal some useful directions, I think we can make a general statement that statistical measures will be a more robust measure of some particular data streams usefulness, and certainly more amenable to automation. In the cases where both a column and target labels are numerical, that statistical measure could simply be the R value correlation statistic. For instance, we could test a potential feature engineering transformation’s usefulness by evaluating a correlation statistic between the column and the labels before and after the transformation to determine which is more suitable. For cases where either the column or the labels are categorical that type of evaluation becomes more challenging and I’m going to have to put some thought into that. In the mean time, again in the interest of keeping it simple, I think a simple solution can be had by making a hypothesis.

Feature engineering hypothesis: when in doubt about which of two approaches to use to process a stream of data, include both and let the backpropagation select which one to pay attention to.

This hypothesis I think can be useful as long as we don’t overdo our transformations, as obviously as our different iterations accumulate, the cost of training will grow right along. The intent for automunge is to develop some heuristics along the way to keep from this redundancy from getting out of hand. However, for the case of the box-cox transformation, I am going to move forward with the assumption that this transform is universal enough that it is worth the simple redundancy of address, after all we are tailoring the transformation to the data based on the derivation of the lambda parameter. A future iteration of the tool may re-evaluate this approach.

From the collection of the College Football Hall of Fame

(This paragraph can be skipped if you’re not interested in the code:)

Ok from a software standpoint, let’s quickly talk our way through this: When the evalcategory function looks at a set, it will first test for numeric. If it finds numeric, it will then test for all positive, if it finds all positive it will return category of ‘bxcx’, else it will return a category of ‘nmbr’. Note that we need to confirm that this test for all positives will work with missing values prior to infill. I think we will need a number dictionary comparable to our text dictionary and data dictionary to keep track of the range of columns associated with the set. We’ll call it ‘nmbr_dict’ and we’ll use it for both categories of nmbr and bxcx — even though we don’t need it for nmbr since has the original column name it will still be good to build the architecture for consistency (update, going to table this architectural cleanup of creating a dictionary for nmbr and bnry categories for a future iteration). We’ll create a processing function specific to bxcx that first infills mean in missing values and then applies the SciPy stats boxcox transform, which we’ll use to add a transformed column to the original column and also returns the lambda parameter associated with the transform. We’ll also call the process_numnerical_class function here within to process the original column. The arguments to the process_bxcx function will include a parameter lmbda=None, so that we can then reuse the same function for processing the test set simply by assigning the returned lmbda from the train set, it will also include parameters for mean and std from the prior transform again for this purpose (thus we’ll be able to use same function for processing train set and post processing test sets). We’ll perform all of this prior to application of the ML-infill with the hope that ML infill will do a better job of assigning infill after the transformation. Note that we’ll need to update our ML infill methods in automunge and postmunge to handle the additional column by use of the nmbr_dict and nmbrcolumns in a fashion comparable to how we addressed the additional columns in text and date classes. For the postprocess_dict we’ll create a new column_dict entry which will correspond to the bxcx class and include the lambda parameter in addition to the values used to normalize the associated nmbr column. We’ll add an argument to the master automunge function defaulted as ‘powertransform = True’ which allows user option to turn off this power law transform if not desired for their data.

Here is the code for the processing function. There’s also a companion Colaboratory notebook available here. The essay concludes below:

Before closing I want to thank the authors of Feature Engineering for Machine Learning for another few points that were brought to my attention. I recall they noted a word of caution about treating numerical data which is actually a numerically encoded categorical set. For example we might have values of zip codes or phone numbers or serial numbers etc that although they register as numerical data, do not carry information in the traditional numerical sense. Our current version of automunge is dumb from this standpoint. It doesn’t recognize the difference between numerically encoded categorical values vs numerical sets. Another issue again with numerical data might be a case with what was originally a sparse set with lot’s of zeros, but after normalization in our original methods we end up with a dense set which will be less computationally efficient for training. I’ll have to put some thought into how to address these scenarios, I expect for the solution will come up with some tacky heuristics and will just have to call out the assumptions in the product documentation. Until next time.

Jimmy Buffett — Changes in Attitudes

Books that were referenced here or otherwise inspired this post:

Feature Engineering for Machine Learning — Alice Zheng & Amanda Casari

Feature Engineering for Machine Learning

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share — or I don’t know, consider hiring me or something :). I can also be reached on linkedin for professional inquiries or twitter for personal.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.