Simple Automatic Feature Engineering — Using featuretools in Python for Classification

Preface

While I was thinking about this post I came across this post and subsequent repos, which helped solidify a lot of the concepts here. William Koehrsen, in particular, has some great posts on Medium, and I highly suggest you check out his other articles. He was also quite helpful in working through some of the problems I had when trying to use distributed computing with featuretools, so thank you for all your help William.

As with most thing, getting the final product processing raw material into different stages

Preprocessing Takes Time

So the reality of data science or any sort of actions that revolve around data, is a majority of the time spend on work is actually spent on data preparation before the modelling occurs. To answer the question(s) you or a stakeholder has asked, you need the right state of your data to begin digging in. And, if you expect follow up questions from the results you should have the ability to quickly answer them rather than starting all over from scratch.

Consider if you’re Uber and you want to provide your drivers with the best chance of making more money. You want may want to look at your ride data, and predict trips occurring in a city on the weekend vs. the weekend, so your drivers know where to be without having to guess and burn money on gas.

By giving this information to the drivers they are happier because they make more money, and in turn so does Uber. However, to do this correctly, more and more features within the data is needed to give an accurate answer:

  • The area or neighbourhood of the city;
  • The types of vehicles used in those areas, and then;
  • The day, month, hour, and minute, of each type of vehicle in each area over weekend and weekdays.

You can see how this starts to get complicated quickly, as the number of features you want to calculate from a simple question (where are trips occurring ?) starts to grow exponentially.

It can be a long arduous process creating all these features from scratch, but with a python package called featuretools, a good chunk of time can be saved by automatically creating features for you. It’s not a magic bullet, as you need to put in some thought into what features to be generated. However, if done correctly, you’ll come out with more time for your project.

I’m going to be using some randomly generated data I prepared myself, which can be found at this link. I did this because, I wanted to put some different types of data types (latlng, boolean, etc.) in the structure to be able to explore some of the capabilities of the package.

Classification Goal

The goal of the project is to classify clients who make large orders: having an order of more than $8000 at one time. And, to understand what features most help classify these clients.

E.g. Do sales in a certain area have an effect on the clients who place large orders in those areas, or do certain promo_codes help with repeat orders and more orders? Maybe we need to log or standardize some of our columns so we can correctly get an inference out of it?

There are many many things we should be doing to get more information out of our data and what is what featuretools aims to help us with. It takes a lot of the heavy lifting out of creating additional columns, and let’s you focus on the work, rather than the grunt work.

Our Data

Consider a database with five different tables:

  • Clients;
  • Orders;
  • Products;
  • Areas, and;
  • Suppliers

Here are the field names, data types, description, and an example for each. Oh, and the table ascii generator I used for this can be found here.

# CLIENT Table
+-----------------+--------+--------------------------+-----------+
| Name | Type | Desc. | Example. |
+-----------------+--------+--------------------------+-----------+
| client_id | Int | The client's id | 44 |
| created_date | Date | When created | 12/4/2015 |
| on_rewards | Bool | Part of rewards program? | TRUE |
| area_id | Int | Where they are located | 1 |
| plan | String | Type of plan | Basic |
+-----------------+--------+--------------------------+-----------+
# ORDERS Table
+---------------------+---------+--------------------+------------+
| Name | Type | Desc. | Example. |
+---------------------+---------+--------------------+------------+
| order_id | String | The order's id | 4da2e6a5 |
| client_id | Int | The client's id | 44 |
| supplier_id | Int | The supplier's id | 2 |
| product_id | Int | The product's id | 3 |
| cost_of_items | Double | Base cost of order | 334.45 |
| shipping_costs | Double | Shipping costs | 23.2 |
| order_total_costs | Double | Sum of costs | 923.12 |
| shipping_date | Date | When order shipped | 23/05/2012 |
| order_arrived_late | Bool | Late delivery? | FALSE |
| promo_code | String | What promo code? | PROMO |
| rush_shipping | Bool | Was it rushed? | TRUE |
+---------------------+---------+--------------------+------------+
# AREAS Table
+------------+--------+----------------------+----------+
| Name | Type | Desc. | Example. |
+------------+--------+----------------------+----------+
| area_id | Int | The area's id | 3 |
| city | String | Name of city | New York |
| state | String | Name of state | NY |
| population | Int | Total num. of people | 452,234 |
| lat | Geo | Latitude | 82.34 |
| lng | Geo | Longitude | -79.321 |
+------------+--------+----------------------+----------+
# SUPPLIERS Table
+---------------------+--------+---------------------+----------+
| Name | Type | Desc. | Example. |
+---------------------+--------+---------------------+----------+
| supplier_id | Int | The supplier's Id | 5 |
| type | String | Type of goods | Reseller |
| ships_international | Bool | Ship overseas? | TRUE |
| preferred_supplier | Bool | Pref. supplier? | FALSE |
| on_time_reliability | Double | Late order percent | 0.069 |
+---------------------+--------+---------------------+----------+
# PRODUCTS Table
+---------------------+--------+---------------------+----------+
| Name | Type | Desc. | Example. |
+---------------------+--------+---------------------+----------+
| product_id | Int | The product's Id | 9 |
| part_number | String | Part number prod | DA7D471F |
+---------------------+--------+---------------------+----------+

Now let’s get this into our notebook, and make a master dictionary of our dataframes. I find this is easier than having various dataframe which need to be referenced throughout the project, as there is only one object to reference.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import featuretools as ft
wb = pd.ExcelFile('./data.xlsx')
sheets = {}
for name in wb.sheet_names:
sheets[name] = pd.read_excel(wb, name)

Let’s do some basic manual feature engineering, as if we did not have featuretool available. Getting the mean and median of all orders by client_id or the number clients on each plan, is trivial and nothing new to even the novice data scientist.

In fact, it’s all quite boring which is why you probably don’t want to be writing this every time you need to do an analysis. Because this is the tedious part that prevents you from moving on to the more exciting part (modelling) of data science.

# An example of how to do some basic feature engineering
# Shipping costs between different clients
sheets.get('ORDERS').groupby('client_id')['shipping_costs'].agg(
['sum', 'mean', 'max', 'min', 'std', 'median'])
# Clients and late shipping
sheets.get('ORDERS').groupby('client_id')['order_arrived_late'].agg(
['sum', 'mean', 'max', 'min', 'std', 'median']).head()
# Suppliers and rush shipping
sheets.get('ORDERS').groupby('supplier_id')['rush_shipping'].agg(
['sum', 'mean', 'max', 'min', 'std', 'median']).sort_values('sum', ascending=False).head()
# Etc.

But what about all the features you did not think about. What about the MAX(LOG(orders.shipping_costs)) — PERCENT_TRUE(clients.on_rewards) — which is something I don’t think I would have ever have thought about or even considered — as a feature I should consider when looking at specific clients, but could be useful to our model one we get to that stage.

We just don’t know until we try.

Also, I don’t even want to begin thinking about how to manually slice and dice all the latitude and longitude data with our datetime data, as I know it will be frustrating to no end.

Let’s Try featuretools

By using featuretools, an almost exhaustive list of different features can be created to test within the model. After relationships between each dataframe are established and data types defined, the package creates additional features to test.

It will become clearer once the example is displayed below.

Before we start engineering anything, here is the basic schema of our data presented above, as we need to understand how our five different tables connect to each other.

Our Schema

Now that we know about the schema, we need to dive into the two main parts of featuretools. The first is entities, which you can think of as just a single table in a database. And, from the image above we know what each of our entities (tables) are.

The second is entitysets, which is a collection of entities (tables) and how they related to each other. Again, because we have our schema laid out we’re able to easily make relationships within our entityset.

Pre Pre-processing

Before we start creating features, we want to reduce the size of our dataframe by modifying our data types to those that will take up less space in memory. This idea was pulled from this notebook, and modified for this article.

We’ll be converting int64 to int32, float64 to float32, and any objects that could be categorical into category variables, so when featuretools is running it is as efficient as possible. When preparing for this article, one of the dataframes had over 1M rows and the computed feature sets took over 36 hours to compute. So, this is just a necessary step in saving time.

I also kept the boolean data types the same, and did not convert them to integers as it helped with the feature creation later on.

def convert_columns(df):
  for col in df.columns:
    col_type = df[col].dtype
col_items = df[col].count()
col_unique_itmes = df[col].nunique()
    if (col_type == 'object') and (col_unique_itmes < col_items):
df[col] = df[col].astype('category')
    if (col_type == 'int64'):
df[col] = df[col].astype('int32')
    if (col_type == 'float64'):
df[col] = df[col].astype('float32')
  return df
# Convert the columns
for _, i in sheets.items():
convert_columns(i)

I initially the joined all the dataframes together into one massive dataframe to save on processing time. However, I found using the built in relationship function between each of the entities to be the best approach to modelling. It just made for better features and less processing of the data.

Just as a quick note, because the Areas table had both lat and lng values, featuretools requires these to be in a tuple.

# Create tuple for lat and lng
areas = sheets.get('AREAS')
areas['latlng'] = areas[['lat', 'lng']].apply(tuple, axis=1)
areas = areas.drop(columns=['lat', 'lng'])
sheets['AREAS'] = area
print(areas['latlng'].sample(1))
>>
(40.71426773071289, -74.00597381591797)

I’ve also encoded the categorical variables to ints so they can be fed into our classifiers later on. Here is an example of how to do it on the cities features in the AREAS dataframe.

# Quick encoding for cities
le = LabelEncoder()
le.fit(sheets.get('AREAS')['city'])
sheets.get('AREAS')['city'] = le.transform(sheets.get('AREAS')['city'])

I’ve also done the same for promo_codes, plans, state, type and part_number.

Starting to Put It Together: Entity Sets and Entities

As quickly mentioned above, we need to create entity sets and fill it with a collection of entities.

import featuretools as ft
# Create our first entityset
entity_set = ft.EntitySet(id = 'clients')
...
# Create client entity
entity_set = entity_set.entity_from_dataframe(
entity_id='clients',
dataframe=sheets.get('CLIENTS'),
index='client_id',
variable_types={
'plan': ft.variable_types.Categorical
}
)
... 
# Create supplier entity
entity_set = entity_set.entity_from_dataframe(
entity_id='suppliers',
dataframe=sheets.get('SUPPLIERS'),
index='supplier_id',
variable_types={
'type': ft.variable_types.Categorical
}
)

To make sure the entities read the correct data types in, you can specify this directly in the initialization of the entity. As well, featuretools has a list of supported data types in ft.variable_types.ALL_VARIABLE_TYPES.

# Check entity_set to see what was added
entity_set
>>
Entityset: clients
Entities:
clients [Rows: 1000, Columns: 5]
orders [Rows: 21000, Columns: 11]
areas [Rows: 12, Columns: 5]
products [Rows: 500, Columns: 2]
suppliers [Rows: 20, Columns: 5]
Relationships:
# We don't have any relationships yet

As stated above we need to let featuretools know the links between our data by creating relationships. When creating relationships always think of parent and child. Where the parent can only have one unique value per record in the table, the child can have multiple values.

In this example, there can only be one client, but a client can appear in multiple orders. So, client is the parent of orders. Or, there can only be one area (Chicago, New York, etc.) but an area can appear on multiple clients, making area a parent of client.

If you run into a case there you have a parent (A) and then a child (B), and that child has a child (C) of its own, you run the risk of creating a diamond graph and feature tools will throw an error. The best way to avoid this is to have each parent only connect with its child. E.g areas connects with clients, and then clients connects with orders.

# Create the relationships
relationship_clients_orders = ft.Relationship(entity_set['clients']['client_id'], entity_set['orders']['client_id'])
relationship_suppliers_orders = ft.Relationship(entity_set['suppliers']['supplier_id'], entity_set['orders']['supplier_id'])
relationship_area_clients = ft.Relationship(entity_set['areas']['area_id'], entity_set['clients']['area_id'])
relationship_products_orders = ft.Relationship(entity_set['products']['product_id'], entity_set['orders']['product_id'])
# Add the relationships to the entity set
entity_set = entity_set.add_relationship(relationship_clients_orders)
entity_set = entity_set.add_relationship(relationship_suppliers_orders)
entity_set = entity_set.add_relationship(relationship_area_clients)
entity_set = entity_set.add_relationship(relationship_products_orders)
# Check entity_set
entity_set
>> 
Entityset: clients
Entities:
clients [Rows: 1000, Columns: 5]
orders [Rows: 21000, Columns: 11]
areas [Rows: 12, Columns: 5]
products [Rows: 500, Columns: 2]
suppliers [Rows: 20, Columns: 5]
Relationships:
orders.client_id -> clients.client_id
orders.supplier_id -> suppliers.supplier_id
clients.area_id -> areas.area_id
orders.product_id -> products.product_id

If we go back to our schema, we should see four connections which is what we have here as well.

Primitives and Custom Primitives

So what exactly is being calculated from our features? The calculations come from primitives and featuretools provides two types of them: aggregations and transformations. There are many already built into the package, and we’ll be adding our own for additional calculations

For aggregations, the data is combined to get a certain value such as the sum, count, mean, etc.

For transformations, the data is not combined but instead changed based on a function. E.g. Getting the hours from a datetime feature or the percentile of that value in the column.

# 
ft.primitives.list_primitives()
>>
...
num name type desc
17 percent_true aggregation Finds the percent of 'True' value ...
18 max aggregation Finds the maximum non-null values ...
19 or transform For two boolean values, determine ...
20 days_since transform For each value of the base feature ...

I’m going to set a semi-random set of primitives to look at when we’re running our feature tools. However, I’ve made sure to include count and percent_true as we’ll be using those later for checking out many large_orders a client has had. If the count or the percent_true is > 0.00 then we know they are a client who has had large orders.

# Select the agg and trans primitives you want to look over
agg_primitives=[
'std', 'min', 'max', 'mean',
'percent_true', 'last', 'count',
'trend', 'n_most_common', 'time_since_last',
'avg_time_between'
]
trans_primitives=[
'years', 'month', 'weekday', 'percentile',
'isin', 'cum_mean', 'subtract', 'divide',
'time_since_previous', 'latitude', 'longitude'
]
# I'll go over this below
where_primitives = ['std', 'min', 'max', 'mean', 'count']

But before we finalize this list, I want to add two more transitive primitives to our list Log and Square_Root. This is just to see if these data transformations have any effect on our predictive variable, and also to show how to create the different types of primitives for various projects.

from featuretools.primitives import make_trans_primitive
from featuretools.variable_types import Numeric
# Create two new functions for our two new primitives
def Log(column):
return np.log(column)
def Square_Root(column):
return np.sqrt(column)
# Create the primitives
log_prim = make_trans_primitive(
function=Log, input_types=[Numeric], return_type=Numeric)
square_root_prim = make_trans_primitive(
function=Square_Root, input_types=[Numeric], return_type=Numeric)

Now we have two objects to add to our trans_primitives array.

trans_primitives.append(log_prim)
trans_primitives.append(square_root_prim)

Look For A Specific Event

From the short description above, we wanted to look for clients who have large orders. For a problem you’re going to solve on your own — given some domain knowledge, — you might want to consider certain thresholds or events as important by labeling them to featuretools. These are called seed features, and using them can enable domain-specific knowledge in your feature engineering.

In our example we want to account for free_shipping, clients who live in large cities (large_city), and if it is a particularly large order over $8000 (large_order).

free_shipping = ft.Feature(entity_set["orders"]["shipping_costs"]) <= 0.00
large_city = ft.Feature(entity_set["areas"]["population"]) > 1500000
large_order = ft.Feature(entity_set["orders"]["cost_of_items"]) > 8000
# Put features in an array
seed_features=[free_shipping, large_city, large_order]

So, by looking for a column that shows us the count of large_orders, we’re starting to get a sense of what our dependant variable is for our model and how we can use the features generated to classify records.

Interesting Values

Sometimes we want to look at more than a seed_features, we want to look at an interesting value. Something we know from our domain knowledge of the problem which allows us to better look into the feature.

This is — unsurprising ly— called interesting_variables in featuretools. Let’s take a look at how these three cities (New York, Los Angeles, and Chicago) from our areas, influence any of the data.

# You want to see JUST how New York influences everything
# We then specify the aggregation primitive to make where clauses for using where_primitives
entity_set["clients"]["city"].interesting_values = ['New York', 'Los Angeles', 'Chicago']

The difference here is we’re directly adding them to our entity_set rather them specifying them later in the dfs (explained below) object as we do with the other parameters. However you will need to include the where_primitives as what you want to be able to query your results on.

You want to count something to do with New York? Make sure to include in where_primitives=['count'] in dfs. Then when the features are created you’ll able to query a feature based on the interesting value.

E.g. N_MOST_COMMON(orders.promo_code WHERE city = New York) or COUNT(orders.free_shipping WHERE city = Chicago)

Now you’re able to ask questions from your data, and that can be quite powerful.

Look at This, Not at This

The last two things we want to do is make sure we’re only considering certain features and time frames, and not others. If we were including everything the number of features might get too high to compute in an efficient timeframe. As well, we might not need to consider certain dates within our entity_set. The more we can exclude the faster our features will be generated.

ignore_variables={“products”: “part_number”}

Between Certain Times

Finally, there is a parameter which I’m not going to include, but wanted to highlight. If we wanted to evaluate clients who were created after a certain date — or any other cutoff times for that matter — we can use the dfs parameter called cutoff_times which allows us to subset the data while we’re creating the features.

E.g. if I wanted to only look at clients created during or after 1986, I would subset that section of data and include it as my cutoff_time parameter. Keep in mind you need the index and the date in a dataframe for featuretools to recognize it correctly.

# Create some cutoff times for the entity_set
cutoff_times = orders_df[[‘client_id’, ‘shipping_date’]]
# Create a cutoff time in a dataframe
cutoff_times = cutoff_times[cutoff_times[‘shipping_date’] > ‘1986–01–01’]
# Now include cutoff_times as the parameter for cutoff_time

Putting It All Together

Let’s review all the objects created before determining our features. We created:

  • Our entity_set;
  • An agg_primitives array;
  • A trans_primitives array;
  • A seed_features array, and;
  • An ignore_variable array.

That’s a lot, but it gives us the foundation to create more features as we get deeper into this project.

Let’s create our dfs object which is the Deep Feature Synthesis to calculate a feature_matrix (a dataframe of all the calculated features) and a list of all the features we have created. Our dfs does all the heavy lifting when performing our feature engineering.

We’re targeting specific clients, so target_entity=’clients’ which tells dfs to use the unique indexes of the clients entity, and create features around those. We could do the same for areas, products, etc.

With verbose=True as a parameter, should see the output for the features and how long you have left until they are all calculated. I’ve also added n_jobs=-1 which uses additional free resources in your computer to help with the final results.

The max_depth has also been set to 1, just to show what sort of features are created without any stacking between each of the features.

features, feature_names = ft.dfs(
entityset=entity_set,
target_entity='clients',
agg_primitives=agg_primitives,
trans_primitives=trans_primitives,
seed_features=seed_features,
ignore_variables=ignore_variables,
max_depth=1,
n_jobs=-1,
verbose=True)
>>
Built 80 features
Elapsed: 00:07 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 10/10 chunks

So final compute time for 80 features in around 7 seconds. Let’s see what we created.

print(features)
[<Feature: on_rewards>,
<Feature: area_id>,
<Feature: city>,
<Feature: population>,
<Feature: plan>,
<Feature: state>,
<Feature: population > 1500000>,
<Feature: STD(orders.supplier_id)>,
<Feature: STD(orders.product_id)>,
<Feature: STD(orders.cost_of_items)>,
...
<Feature: MAX(orders.on_time_reliability)>,
<Feature: MEAN(orders.supplier_id)>,
<Feature: MEAN(orders.product_id)>,
<Feature: MEAN(orders.cost_of_items)>,
<Feature: MEAN(orders.shipping_costs)>,
<Feature: MEAN(orders.order_total_costs)>,
<Feature: MEAN(orders.on_time_reliability)>,
<Feature: PERCENT_TRUE(orders.order_arrived_late)>,
<Feature: PERCENT_TRUE(orders.rush_shipping)>,
<Feature: PERCENT_TRUE(orders.ships_international)>,
...
<Feature: N_MOST_COMMON(orders.promo_code)>,
<Feature: N_MOST_COMMON(orders.type)>,
<Feature: MONTH(created_date)>,
<Feature: WEEKDAY(created_date)>,
<Feature: PERCENTILE(area_id)>,
<Feature: PERCENTILE(population)>,
<Feature: client_id.isin(None)>,
...
<Feature: LOG(population)>,
<Feature: SQUARE_ROOT(area_id)>,
<Feature: SQUARE_ROOT(population)>,
<Feature: PERCENT_TRUE(orders.shipping_costs <= 0.0)>,
<Feature: PERCENT_TRUE(orders.cost_of_items > 8000.0)>,
<Feature: LAST(orders.shipping_costs <= 0.0)>,
<Feature: LAST(orders.cost_of_items > 8000.0)>,
<Feature: population > 1500000.isin(None)>]

How many features if we went one more level deep? Simply change the max_depth parameter to 2, and see what is returned. Now the features will start to be stacked on each other. E.g. we’ll have the log of one feature minus the max of another, and other transformations.

This is the beauty of the package and the result of all the previous work of creating and specifying seed_features and primitives, where you get to see some interesting features created. Such that we’ve created a feature where the standard deviation of on_time_reliability is divided by the percentage true of cost_of_items.

This might yield some help with predicting, and it might not. We don’t know until we test the model.

I would have never thought about creating <Feature: PERCENT_TRUE(orders.shipping_costs <= 0.0) / PERCENT_TRUE(orders.cost_of_items > 8000)> and it just might be the feature needed for the model to correctly classify if a client provides large orders.

Built 4135 features
Elapsed: 01:23 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 10/10 chunks
>>
...
<Feature: MAX(orders.shipping_costs) - LAST(orders.order_total_costs)>,
<Feature: PERCENT_TRUE(orders.order_arrived_late) - STD(orders.cost_of_items)>,
<Feature: PERCENT_TRUE(orders.order_arrived_late) - MIN(orders.order_total_costs)>,
<Feature: PERCENT_TRUE(orders.order_arrived_late) - MIN(orders.cost_of_items)>,
<Feature: COUNT(orders) - MEAN(orders.order_total_costs)>,
<Feature: PERCENT_TRUE(orders.order_arrived_late) - LAST(orders.order_total_costs)>,
...
<Feature: STD(orders.on_time_reliability) / PERCENT_TRUE(orders.cost_of_items > 8000)>,
<Feature: LOG(PERCENT_TRUE(orders.shipping_costs <= 0.0))>,
<Feature: LOG(PERCENT_TRUE(orders.cost_of_items > 8000.0))>,
<Feature: SQUARE_ROOT(PERCENT_TRUE(orders.shipping_costs <= 0.0))>,
<Feature: SQUARE_ROOT(PERCENT_TRUE(orders.cost_of_items > 8000.0))>
...

And, what about our actual data? This is important to check, as if you specify any of your indexes incorrectly you might find duplicate values for the primary key in thetarget_entity.

So far looking good

More Features, More Problems

Some of the features created might seem as ones we should include, while others will seem completely unnecessary. But for the most part, we won’t know what to include or what to remove without first including the feature in our model.

But we have a problem: dimensionality.

As more features (dimensions) are added, models can become less effective and harder to calculate. This problem is trivial as we don’t have that much data, but as the datasets get to 1gb and higher, computations will slow down quickly and classification / prediction activities will become difficult due to a lack of records.

So how many records are necessary for the number of features? Well, according to CalTech it is roughly 10. So if we have 10 features total features in our model, we should have at least 100 records. Given our max_depth=2 gives us 4135 features to work with, we should have 41,350 for any sort of prediction to be drawn from this experiment. We only have 21,000 and given this is just drawn from randomly generated data, I won’t adhere too much to the recommended number of records.

However according to the sklearn docs, you’re going to need much much more data than that. And, here is why.

Assume you have 3 features (p = 3) and each of the features are binary, where p_i ∈ {0, 1}. So if we wanted to get all the possible combinations of these three features we would need at least, 8 (2*2*2) records to satisfy our constraints.

# All the samples to exhaust all combinations of the features
[0, 0, 0]; [1, 0, 0]; [0, 1, 0]; [0, 0, 1]; [1, 1, 0]; [1, 0, 1]; [0, 1, 1]; [1, 1, 1]

But what if we were not using binary variables but instead integer numbers, and our three features were now p_1 ∈ {1,2,3,4,5}, p_2 ∈ {1,2,3,4,5,6,7,8,9,10}, and p_3 ∈ {0, 1}. Now the minimum number of records we should be aiming to run is 100 (5 * 10 * 2). You can see how this can get out of hand quite quickly. And, with 4135 features we would need the number of records to be at least in the millions.

How Long Is This Going to Take?

When I started getting into some of the harder computations, my MacBook slowed to a crawl. Especially when using cutoff_time=cutoff_times in the dfs calculations. Including them when max_depth=2 increased the computation time from around 1 minute to 30 minutes. And, in some of my earlier runs where max_depth=3 I had over 10,000 features being calculated it took over 36 hours.

Taking notes directly from Will Koehrsen’s notebook, here are the steps I took to optimize my featuretools selections.

  • Convert object data types to category (done above; functions included from his notebook);
  • Experiment with joining all the dataframes together, instead of linking them all through the entity_set relationship function;
  • Create and save partitions to disk, and;
  • Create an entity_set on each partition.

If you’re interested in how to set up the distributed computing locally on your computer, checkout the notebook from above. Not everything is guaranteed to improve your performance, and it is suggested you try different techniques on your project to see what works and what does not.

Actual Machine Learning

Finally, we’re at the last step where we can get into the actual machine learning. We now have all the features we want to look at to find out if we can predict large_orders from a client, but also figure out which features actually matter to our classification.

The first model built will be a vanilla random forest model and next a logistic regression model, to both be used within an ensemble model.

We need to clean up some of our columns, especially those that created features from cost_of_items and order_total_costs, as those will directly affect if a client is a larger orderer. It is possible you will get NaN, inf, and -inf from some of the calculations, so make sure they are just removed.

I’ve also coded in our y to be 0 or 1, based on if they have any large orders or not. The column large_ordering_client will be the dependant variable for our classification model.

# Create binary variable for large_ordering_client
features['large_ordering_client'] = np.where(features['PERCENT_TRUE(orders.cost_of_items > 8000.0)'] > 0.00, 1, 0)# Create train and test sets
from sklearn.ensemble import RandomForestClassifier
X = features.copy()
X = X.reset_index()
# Drop the columns with inf and nan
X = X[X.columns[~X.columns.str.contains(
'cost_of_items|order_total_costs|client_id')]]
X = X.replace([np.inf, -np.inf], np.nan)
X = X.dropna(axis=1, how='any')
# Get dependant variable
y = X.pop('large_ordering_client').values
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
clf.fit(X, y)
scores = cross_val_score(estimator=clf, X=X, y=y, cv=5,
scoring="roc_auc", verbose=True)
print("AUC {:.4f} +/- {:.4f}".format(scores.mean(), scores.std()))
>>
AUC 0.8817 +/- 0.1349

So 88% seems good? I’m just going to put in some simple hyperparameter tuning. Nothing too extensive, as this is just for demonstration.

from sklearn.model_selection import cross_val_score
import itertools
# Do some simple hypter parameter tuning
n_estimators = [1, 3, 5, 10, 25, 50, 100, 250, 500]
max_depth = [1, 3, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None]
scores_list = []
for est in n_estimators:
for depth in max_depth:
clf = RandomForestClassifier(
n_estimators=est, max_depth=depth, n_jobs=-1)
scores = cross_val_score(
clf,
X=X,
y=y,
scoring='roc_auc',
cv=10,
n_jobs=-1
)
scores_list.append([est, depth, scores.mean(), scores.std()])
top_scores_df = pd.DataFrame(scores_list, columns=[
'n_estimators', 'max_depth', 'mean', 'std'])
>>
# Best parameters are n_estimators=25, max_depth=1.0
# AUC = 0.903657

Great. Our score has improved, so what were the features that were the top five best at predicting the outcome for large orders?

index                                            name  score
0 MIN(orders.shipping_costs) - COUNT(orders) 0.16
1 PERCENT_TRUE(orders.shipping_costs <= 0.0) - C... 0.12
2 PERCENTILE(MAX(orders.shipping_costs)) 0.08
3 COUNT(orders) - PERCENT_TRUE(orders.order_arri... 0.08
4 MEAN(orders.shipping_costs) - COUNT(orders) 0.08
5 MAX(orders.SQUARE_ROOT(shipping_costs)) 0.04

Unsurprisingly COUNT(orders) made its way into the results, as those clients with more orders probably order larger items on a more frequent basis. Also, shipping_costs made it into the top 5, as those clients who have large orders probably incur larger shipping costs when getting their items.

But what is really interesting is the interplay with the aggregate and transformation primitives, and the features created from each. E.g. the MEAN(orders.shipping_costs) — COUNT(orders) feature is something I would have never even thought was an option for creating, but it turns out it was a good idea to include in our model.

Quick Voting and Ensemble

So we did a little better with some basic tuning. Let’s see how it does when added to a simple voting classifier by adding in a vanilla logistic regression. For simplicity, I’ve kept the same hyperparameters on the random forest classifier.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1075)
lr = LogisticRegression(random_state=1075)
eclf1 = VotingClassifier(estimators=[('lr', lr),
('rf', clf)],
voting='hard')
eclf1 = eclf1.fit(X_train, y_train)
print("Score of voting classifier: {:.4f}".format(eclf1.score(X_test, y_test)))
>>
Score of voting classifier: 0.9067

So our classifier when we have an ensemble model is just slightly better(0.9067 > 0.9036) than it was before, and we could probably improve that if we had tuned our logistic regression. But that will be for a different article.

So that was a lot to get to a simple answer, however, many topics were covered and I think the article is better for it. There are never any free lunches with data science, and even the simplest of answers involve many different layers and approaches.

Predicting Sales

Additionally, we could a try to predict if specific clients are going to have large orders in the future, not just classify them as we did in this examples

There is a great notebook that goes over just that, but it is on how to predict banana sales. If you’re interested in learning more about prediction, you should check out the notebook.

Final Thought

When I first came across featuretools I thought it would solve many of my feature engineering problems, however using it also came with its own set of problems. I enjoyed the package and it helped me get into a better understanding of how to do much of my pre-processing from a conceptual basis, but it was not the magic bullet I initially thought it was.

There are some syntax considerations to take into account, and tooling knowledge that slowed me down at the start. If you know how to use featuretools effectively and have domain knowledge on how to ask it the right questions, you’ll get a lot out of it.

I am excited to see how the development of this package occurs as I can see the data science community getting behind it as it becomes more robust to use as development increases on it.

As always I hope you’ve learned something new.

Additional Reading