Tool Review: Lessons learned from using FeatureTools to simplify the process of Feature Engineering

(note: this post was updated on Nov 9th)

Feature Engineering is a crucial step in many machine learning projects, but can be difficult and time consuming if you aren’t already deeply familiar with the data and/or domain. So when I came across the FeatureTools framework, which promises to make Feature Engineering faster and easier, I was excited to try it out.

FeatureTools allows you to setup Entities and relationships in your data and can then automatically generate tens to hundreds of new features for you.

Concept behind FeatureTools

  • relies on Deep Feature Synthesis to generate features
  • defines a library of “primitives” — basic operations such as max, average etc that are typically applied to numerical data.
  • There are two types of Feature primitives:
  • Transformations: applied to one or more columns within a single table
  • Aggregation: applied across multiple tables to entities with a parent/child relationship, such as max sales per customer

The following diagram from FeatureLabs shows an example for aggregating orders across multiple tables/data sets

from https://www.featurelabs.com/blog/deep-feature-synthesis/

Sounds great! Let’s try it out

I jumped in to try to use FeatureTools on the Ames Housing Data set, which seemed ideal for Feature Engineering. However, I was getting some strange results and so decided to back off and try it out on the much simpler Titanic data set. I still didn’t quite get the results I was expecting, but did learn a lot about how to use tool in the process.

As usual, the first step is to install and import the package

#pip install featuretools
import featuretools as ft
import featuretools.variable_types as vtypes

I did some very basic cleanup of the data frame. After some experimentation, I realized that I needed to treat Pclass and Embarked as ordinal features rather than LabelBinarizing them (at least to start with). I won’t go through all the pre-processing steps here, but the full notebook is available on github. This is the resulting cleaned up data frame:

Before doing anything with FeatureTools, I ran this data through a basic DecisionTree and LogisticRegression which scored an accuracy of 0.812 and 0.798 respectively on the test set

Setup the EntitySet

FeatureTools requires you to set up an overall EntitySet and then add Entities to it. Entities can be thought of as tables in a relational database (i.e. Product, Sales and Customer) or separate data frames in Python than you are linking together.

For the Titanic dataset, I named the EntitySet “Survivors” and added my X_train dataframe as the Passengers entity (using the entity_from_dataframe() function)

# creating and entity set 'es'
es = ft.EntitySet(id = 'Survivors')
# adding a dataframe 
es.entity_from_dataframe(entity_id = 'Passengers', dataframe = X_train, index = 'PassengerId')

This created our base Entityset

We can use the variables attribute to check what variables FeatureTools has learned and can see that it has identified that all our columns are numeric, when really Sex, Pclass and Embarked should be treated as categorical.

es["Passengers"].variables

Lesson 1 — it’s safer to be explicit about the variable types when creating the entity. So let’s try that again…

variable_types = { 'PassengerId': vtypes.Categorical,
'Sex': vtypes.Categorical,
'Pclass': vtypes.Categorical,
'Embarked': vtypes.Categorical}
es.entity_from_dataframe(entity_id = 'Passengers', dataframe = X_train, index = 'PassengerId', variable_types=variable_types)

Deep Feature Synthesis

Now let’s try generating some features! We call the Deep Feature Synthesis, dfs(), function, telling it what entity to target (Passengers is our only entity in this case) and how deep to go when stacking primitives (i.e. how many calculated features it layers together — this only seems to be relevant when you have multiple Entities defined)

feature_matrix, feature_names = ft.dfs(entityset=es, 
target_entity = 'Passengers',
max_depth = 3,
verbose = 3,
n_jobs = 1)

Hmm… when we check the output, nothing has been generated!

Lesson 2— in general, you need to define more than one entity for FeatureTools to produce new features. One exception is if your entity contains a date field — in that case, FeatureTools will automatically generate Year, Month, Day and DayOfWeek columns for you

Let’s try adding a custom entity to capture the Passenger’s class and highlight to FeatureTools that this is a column of interest

es = es.normalize_entity(base_entity_id='Passengers', new_entity_id='Pclass', index='Pclass')

Run DFS again…

feature_matrix, feature_names = ft.dfs(entityset=es, 
target_entity = 'Passengers',
max_depth = 2,
verbose = 3,
n_jobs = 1)

Success! This time FeatureTools generated 17 new features for us, focusing on the interactions between PClass and the other features

Let’s take a closer look at a few of those features to better understand how they are being constructed:

  • Pclass.SUM(Passengers.family_count): for all passengers with PClass=3, sum up the family_count. We can reproduce this with the formula
X_train[X_train['Pclass']==3]['family_count'].sum()
>> 386
  • Pclass.MEAN(Passengers.family_count): similarly, this is the average family count for all passengers with PClass=3
X_train[X_train['Pclass']==3]['family_count'].mean()
>> 1.0293333333333334

Now we can see why the variable type of the different features is important. FeatureTools generated different features for the different column types:

  • numeric (i.e. family_count): SUM, STD, MAX, SKEW, MIN, and MEAN
  • categorical (i.e. Embarked or Sex): NUM_UNIQUE and MODE

Lesson 3: as an aside, sometimes the output features may contain nulls (although it didn’t happen this time), so you may want to add a function to strip out any features that contain nulls

Applying the same changes to our Test Set

Before we can run a model using the new features, we need to apply the same transformation to the test data set. However, it’s not obvious how you do this — most of the examples I’ve seen were done before the train/test split or were for time series data (for which Feature Tools does have a compelling answer). So, at first, I wrapped a function around the steps to create the entities and run dfs so I could call it for both Train and Test dataframes, but then I found a better approach on StackOverflow from someone at FeatureLabs.

Lesson 4: Max Kanter’s advice is to “create an EntitySet using the test data and recalculate the same features by calling the ft.calculate_feature_matrix with the list of feature definitions from before”

To try this out, we first need to encode the categorical features in our Training set and save the result

feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_names, include_unknown=False)

This essentially LabelBinarizes our categorical Features

We’ll save the output of this as our X_train

X_train = feature_matrix_enc.copy()
X_train.shape
>> (668, 27)

We have to create a new entity set for our test dataframe and repeat the steps for adding the Passengers and PClass entities

# creating and entity set 'es'
es_tst = ft.EntitySet(id = 'Survivors')
# adding a dataframe
es_tst.entity_from_dataframe(entity_id = 'Passengers', dataframe = X_test, index = 'PassengerId')
# add PCLass entity
es_tst = es_tst.normalize_entity(base_entity_id='Passengers', new_entity_id='Pclass', index='Pclass')

Now we can call calculate_feature_matrix() on our test entity set and pass in the list of saved features from training

feature_matrix_tst = ft.calculate_feature_matrix(features=features_enc, entityset=es_tst)

We can confirm that the output matches the training data frame and then save this back into X_test

And now our train and test sets both have 27 columns

Let’s test some models

Now that we have generated a bunch of new features, let’s try feeding them into our classification models. This function runs a simple DecisionTree Classifier and LogisticRegression using the hyper-parameters selected by GridSearchCV for our original dataframe)

Hmm… this resulted in accuracy scores of .587 and .601, which is significantly worse than we achieved with the original data frame before we added the FeatureTools columns (.8117). Even worse, our confusion matrix shows that the decision tree is basically just predicting that everyone dies

This doesn’t make any sense! How could adding some additional features completely destroy the predictive power of the model? I can see that the new features might not prove to be overly useful for such a simple dataset, but it shouldn’t have this effect! I tried a few methods of Feature Selection to prune back the number of features and weed out highly correlated features, but this didn’t make much difference.

Back to Basics — could our data frame have become reordered?

After a bit of puzzling and head scratching, it eventually occurred to me that perhaps the X_train/X_test data frames got shuffled during the feature generation steps. To test this out, let’s compare the first few rows of our new X_train to our original X_train before we added the new features:

Aha! The ages do not match up! Our data frame has been re-ordered and so longer matches up the y_train/y_test data frames, which explains why our model is performing so poorly!

Lesson 5: If you do a train/test split before using FeatureTools, you need to take some extra steps to ensure that your data frame is not reordered!!

Repeat the FeatureTools generation steps while preserving the order of our data frame

  • our original X_train and X_test data frames are shuffled, so let’s reset the index to an ascending sequence
X_train_orig.reset_index(drop=True, inplace=True)
X_test_orig.reset_index(drop=True, inplace=True)
  • when creating the FeatureTools entityset, we’ll preserve the PassengerID column rather than using it as our index
# creating and entity set 'es'
es = ft.EntitySet(id = 'Survivors')
variable_types = {
'Sex': vtypes.Categorical,
'Pclass': vtypes.Categorical,
'Embarked': vtypes.Categorical}
es.entity_from_dataframe(entity_id = 'Passengers', dataframe = X_train_orig, index = 'Id', variable_types=variable_types
es = es.normalize_entity(base_entity_id='Passengers', new_entity_id='Pclass', index='Pclass')
  • When we call Deep Feature Synthesis, DFS, tell it to ignore the PassengerId column(we don’t want to generate any features for it)
feature_matrix, feature_names = ft.dfs(entityset=es, 
target_entity = 'Passengers',
max_depth = 2,
verbose = 3,
n_jobs = 1,
ignore_variables={'Passengers':['PassengerId']})
  • Repeat the step for encoding the features
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_names, include_unknown=False)
X_train = feature_matrix_enc.copy()
  • confirm that our X_train is still in the correct order
  • repeat the steps to generate features for X_test
# creating and entity set 'es'
es_tst = ft.EntitySet(id = 'Survivors')
# adding a dataframe 
es_tst.entity_from_dataframe(entity_id = 'Passengers', dataframe = X_test_orig, index = 'Id')
# add PCLass entity
es_tst = es_tst.normalize_entity(base_entity_id='Passengers', new_entity_id='Pclass', index='Pclass')
feature_matrix_tst = ft.calculate_feature_matrix(features=features_enc, entityset=es_tst)
  • Now re-run the DecisionTree and Logistic Regression classifiers
  • That looks much better!! Our scores are back in the ballpark of our original model and the confusion matrix is much more balanced

Use Feature Selection to prune the features

Now that we’ve generated a number of new features, we probably need to go through a selection process to prune them back. Specifically, a number of the new features are likely to be highly correlated, so let’s identify and remove those features.

# Threshold for removing correlated variables 
threshold = 0.7
# Absolute value correlation matrix corr_matrix = X_train.corr().abs() upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Select columns with correlations above threshold collinear_features = [column for column in upper.columns if any(upper[column] > threshold)]
X_train_flt = X_train.drop(columns = collinear_features)
X_test_flt = X_test.drop(columns = collinear_features)
X_train_flt.shape, X_test_flt.shape

This trimmed our data frame down to the following columns

Let’s retry our classification:

This definitely looks better, although we’re now basically back to exactly where we were before we added the features. In fact, the DecisionTree is ignoring the new features, while LogisticRegression is giving a few of them a minor importance.

Concluding Thoughts

FeatureTools is undoubtedly a powerful tool, but isn’t a “magic bullet” that can be blindly applied to any data set. There is a learning curve involved in getting it to work properly and it may not bring much benefit to a simple data set, such as the Titanic data.

FeatureTools is targeted at more complex data sets where there are several entities with parent/child relationships (i.e. Customers who can have one to many Transactions which can include one to many Products) or time series data (i.e. Log files). For this kind of dataset, FeatureTools likely will prove to be a big time saver and a very useful tool for the tool-belt. I look forward to trying it out again on a more appropriate problem!

For reference, my full notebook with all my code is available on GitHub

Summary of Lessons Learned

  • Lesson 1 — be explicit about the variable types when creating the entity and don’t pre-convert categorical variables to dummies
  • Lesson 2- in general, you need to define more than one entity for FeatureTools to produce new features. One exception is if your data includes a date/time column
  • Lesson 3: The output features may contain nulls that you will need to strip out before running a model
  • Lesson 4: If you train/test split before calling FeatureTools, you’ll need to take special steps to ensure that you have the same features generated for Train and Test. One way is to create a separate EntitySet using the test data and call calculate_feature_matrix() with the feature definitions from the training set
  • Lesson 5: If you do a train/test split before using FeatureTools, you need to take some extra steps to ensure that your data frame is not reordered!!

Useful resources for learning more about FeatureTools