Encoding Categorical Variables in Practice

Published in

EPFL Extension School

8 min readApr 27, 2018

Context: Many machine learning models require categorical variables to be encoded with numerical values. For instance, using one-hot encoding which creates a binary variable (0/1) for each possible value, e.g., if weather is a categorical variable with two possible values: sunny and rainy, then this encoding will lead to two new variables: is_sunny and is_rainy.

In a typical ML workflow, we fit, tune and evaluate a model using some training data and then deploy the model and make predictions for new data points. In this setup, encoding categorical variables can be tricky because the output of this one-hot encoding step will depend on the set of values present in those future data points, which might lead to compatibility issues between our model and the new data.

In this notebook, we will illustrate the issue with a toy example and see different ways to solve it. We will then go through a concrete example with the Ames house prices data set published by Dean De Cock from Truman State University.

# Import the necessary libraries
import pandas as pd
print(‘Pandas version:’, pd.__version__)import numpy as np
print(‘Numpy version:’, np.__version__)import sklearn
print(‘Sklearn version:’, sklearn.__version__)#Output
Pandas version: 0.22.0
Numpy version: 1.14.1
Sklearn version: 0.19.1

Toy example

Scenario: predict the number of customers of our new ice cream truck business based on the weather situation.

In this example, we collected a small “training set” with 6 data points and two features. The idea is to use this data to fit our ML model and use it to predict the number of customers for future data points.

# Define our “training set”
train_features = pd.DataFrame.from_items([
 # 1st feature: weather
 (‘weather’, [‘sunny’, ‘cloudy’, ‘rainy’, ‘rainy’,     ‘sunny’,‘sunny’]),
 # → It’s a categorical variable: will need one-hot encoding
 
 # 2nd feature: rainfall intensity
 (‘rainfall_intensity’, [0, 16, 34, 28, 4, 0]),
 # → We can keep the integer values for this one
])
train_features

# The associated target values
train_target = pd.DataFrame.from_items([
 (‘number_of_customers’, [32, 18, 12, 16, 28, 36])
])
train_target

Create our ML model

Now, let’s try to generalize from our (small) training set. We will use a linear regression model which needs the weather values to be encoded with integers. To achieve this, we can use the get_dummies() function from the Pandas library which applies one-hot encoding to the desired variables.

# Apply one-hot encoding to the “weather” variable
train_features_encoded = pd.get_dummies(
 train_features, columns=[‘weather’])
train_features_encoded

As you can see, it created a new variable for each weather value. We can now fit our model to the encoded data using Scikit-learn estimators.from sklearn.linear_model import LinearRegression# Create linear regression estimator
lr = LinearRegression()# Fit the model
lr.fit(train_features_encoded, train_target)print(‘Model fitted’)#Output
Model fitted

Make predictions

We now want to estimate the number of customers for the next five days based on the weather forecast. Let’s create a DataFrame with the new data points.

new_data = pd.DataFrame.from_items([
 # 1st feature
 (‘weather’, [‘sunny’, ‘rainy’, ‘windy’, ‘sunny’, ‘stormy’]),
 
 # Second feature
 (‘rainfall_intensity’, [2, 42, 8, 0, 12])
])
new_data

Note that the set of weather values are different from the ones in our training set!

# Set of values from the training set
train_features.weather.unique()#Output
array(['sunny', 'cloudy', 'rainy'], dtype=object)# Values in the new data points
new_data.weather.unique()#Output
array(['sunny', 'rainy', 'windy', 'stormy'], dtype=object)

Before making predictions, we need to encode the weather values.

# Encode new data
new_data_encoded = pd.get_dummies(new_data, columns=[‘weather’])
new_data_encoded

Issue: The new data points have a different encoding! There is no weather_cloudy feature, and there are two new variables: weather_stormy, weather_windy!

For this reason, Scikit-learn will return an error if we try to make predictions using this DataFrame.

# Try to make predictions for the new data points
try:
    lr.predict(new_data_encoded)
except ValueError as e:
    print(‘Value error:’, e)#Output
Value error: shapes (5,5) and (4,1) not aligned: 5 (dim 1) != 4 (dim 0)

Solution: reindex the DataFrame with Pandas

To solve the issue, we need to:

Remove the features that are not present in our train set, and hence not used by our model: weather_stormy, weather_windy
Add missing columns and fill them with 0 entries: weather_cloudy

We can address both issues with the reindex() function from Pandas.

# Reindex DataFrame with columns from the train set
new_data_reindexed = new_data_encoded.reindex(
 columns=train_features_encoded.columns
)
new_data_reindexed

Observation: New columns were removed and missing ones created with NaN (not a number) values!

We can replace them with 0s using the fill_value argument.

new_data_reindexed = new_data_encoded.reindex(
 columns=train_features_encoded.columns,
 fill_value=0 # Fill with 0s
)
new_data_reindexed

All good! Our new data has the same format as our training DataFrame, and we can make predictions.

# Make predictions for new data
predictions = lr.predict(new_data_reindexed)# Number of customers should be greater than zero
predictions = np.maximum(predictions, 0)# Convert to integer values
predictions.astype(int)#Output
array([[31],
       [ 3],
       [29],
       [33],
       [25]])

Alternative solution — categorical data type

Reindexing the DataFrame is a simple way to handle this issue, but there are others. For instance, it’s possible to tell Pandas that a column corresponds to a categorical variable and list the different values.

from pandas.api.types import CategoricalDtype# Create a function to encode the data
def encode_data(df):
    # Work on a copy
    df = df.copy()
 
    # Encode categorical variable
    df[‘weather’] = df[‘weather’].astype(
        # Pandas categorical data type
        CategoricalDtype(
            categories=[ # List possible values
            ‘cloudy’, ‘rainy’, ‘sunny’, ‘stormy’
         ]), 
    )
 
    # Encode categorical variables
    return pd.get_dummies(df)# Encode train data
encode_data(train_features)

Note that we don’t have to tell Pandas which columns are categorical with the columns attribute in the get_dummies() function anymore.

Observation: Pandas created a stormy column even if there are no entries with this value in our train set! This is because stormy is listed in the possible values.

# Encode new data points
encode_data(new_data)

This time, in addition to adding a cloudy column, Pandas didn't create a windy one since this value wasn't specified in the list of possible values!

Concrete case: predict house prices

Scenario: In this example, we will create a linear regression model to estimate the sale price of houses in the city of Ames, Iowa.

First, let’s start by downloading the data.

url = ‘http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls'# Load the file into a Pandas DataFrame
data_df = pd.read_excel(url)
print(‘DataFrame size:’, data_df.shape)#Output
DataFrame size: (2930, 82)

As we can see, the data has 82 variables and 2,930 entries. Let’s print the first five ones.

In this example, we will try to predict the sale price using only the neighborhood and the ground living area variables. You can take a look at the data documentation to learn more about the different variables and their possible values.

# Display the first ten rows
features = [‘Neighborhood’, ‘Gr Liv Area’]
target = [‘SalePrice’]
data_df[features + target].head(10)

We will use the first 2,000 data points to fit our ML model and make predictions for the remaining houses.

train_features = data_df.iloc[:2000][features]
train_target = data_df.iloc[:2000][target]print(‘Input size:’, train_features.shape)#Output
Input size: (2000, 2)

First, we need to preprocess the Neighborhood variable with one-hot encoding.

# One-hot encoding
train_features_encoded = pd.get_dummies(
    train_features, columns=[‘Neighborhood’])print(‘Size when OH encoded:’, train_features_encoded.shape)#Output
Size when OH encoded: (2000, 27)

Observation: The Neighborhood categorical variable was transformed into 26 binary ones (the different neighborhoods) in our 2,000 train data points.

We can now fit a linear regression model.

# Create linear regression estimator
lr = LinearRegression()# Fit the model
lr.fit(train_features_encoded, train_target)print(‘Model fitted’)#Output
Model fitted

Let’s make predictions for the remaining data points.

# Select remaining data points
new_data = data_df.iloc[2000:][features]# One-hot encoding
new_data_encoded = pd.get_dummies(
    new_data, columns=[‘Neighborhood’])print(‘New data after encoding:’, new_data_encoded.shape)#Output
New data after encoding: (930, 29)

Issue: The sizes don’t match! The train DataFrame has 27 columns, and this one has 29!

Let’s compare the sets of features in both DataFrames.

train_features = set(train_features_encoded.columns)
new_data_features = set(new_data_encoded.columns)# New features
new_data_features - train_features#Output
{'Neighborhood_GrnHill', 'Neighborhood_Landmrk'}

Observation: There are two new neighborhoods! Green Hills (GrnHill) and Landmark (Landmrk).

# Missing features from the train set
train_features — new_data_features#Output
set()

So, the new data points have all the possible neighborhood values from the train set and two additional ones. Let’s fix the issue by reindexing the DataFrame.

new_data_reindexed = new_data_encoded.reindex(
    columns=train_features_encoded.columns,
    fill_value=0 # Fill with 0s
)
print(‘DataFrame shape:’, new_data_reindexed.shape)#Output
DataFrame shape: (930, 27)

The DataFrames now have the same format.

# Features from the train set
train_features_encoded.columns#Output
Index(['Gr Liv Area', 'Neighborhood_Blmngtn',     'Neighborhood_Blueste',
       'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_Greens', 'Neighborhood_IDOTRR',
       'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
       'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU',
       'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker'],
      dtype='object')# Same order, same features!
new_data_reindexed.columns#Output
Index(['Gr Liv Area', 'Neighborhood_Blmngtn', 'Neighborhood_Blueste',
       'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_Greens', 'Neighborhood_IDOTRR',
       'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
       'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU',
       'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker'],
      dtype='object')

We can now make predictions using our model.

# Make predictions using our linear regression model
predictions = lr.predict(new_data_reindexed)

In this scenario, we can evaluate the accuracy of our predictions because we have access to the true/observed house prices. For instance, we could estimate how far our predictions are from the actual prices with the Mean Absolute Error (MAE) metric.

true_prices = data_df[2000:][target].values
mae = np.abs(true_prices — predictions).mean()print(‘Mean absolute error: {:,.0f} dollars’.format(mae))#Output
Mean absolute error: 27,039 dollars

As you can see, our predictions are, on average, around 27 thousand dollars away from the observed value! This is not bad compared to a median baseline (i.e. always predicting the median price).

# Compute the median price in our train set
median_price = np.median(data_df[:2000][target])
print(‘Median price: {:,.0f} dollars’.format(median_price))mae_baseline = np.abs(true_prices — median_price).mean()
print(‘Median baseline: {:,.0f} dollars’.format(mae_baseline))#Output
Median price: 162,500 dollars
Median baseline: 54,773 dollars

Summary: In this notebook, we saw how to handle categorical variables encoding when making predictions for new data points. Feel free to add more variables (e.g. Overall Qual and House Style) and extend the analysis with the appropriate preprocessing steps (handle missing values, outliers, feature engineering).

Author

Fred Ouwehand, frederic.ouwehand@epfl.ch — April 2018

About

Interested in more content like this? The EPFL Extension School is an online school teaching digital skills in data science, web application development, and other areas. EPFL is a world-leading institution in Switzerland, and ranked the Nr. 1 young University in the Times Higher Education Ranking. Learn more about the EPFL Extension School at http://exts.epfl.ch.