Target Encoding in Feature Engineering

Abhinaba Banerjee
Geek Culture
Published in
5 min readDec 6, 2022

--

Photo by Arseny Togulev on Unsplash

This article will explain the concept of Target encoding, its significance in Feature Engineering, and code implementation.

This is the last part of the Feature Engineering series which I have been uploading for the last 2 weeks. In the last part of Target encoding, we will deal with categorical features instead of numerical features. It’s a technique of encoding categories as numbers, like one-hot or label encoding, and it also uses the target to create the encoding. That’s why it falls in the category of a supervised feature engineering technique.

Use Cases for Target Encoding

Target encoding is great for:

High-cardinality features: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature’s most important property: its relationship with the target.

Domain-motivated features: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. Target encoding can help reveal a feature’s true informativeness.

Code implementation of Target encoding

We will explore the ames housing prices dataset to understand the application of target encoding.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
from category_encoders import MEstimateEncoder
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
"axes",
labelweight="bold",
labelsize="large",
titleweight="bold",
titlesize=14,
titlepad=10,
)
warnings.filterwarnings('ignore')


# Model scoring
def score_dataset(X, y, model=XGBRegressor()):
# Label encoding for categoricals
for colname in X.select_dtypes(["category", "object"]):
X[colname], _ = X[colname].factorize()
# Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
score = cross_val_score(
model, X, y, cv=5, scoring="neg_mean_squared_log_error",
)
score = -1 * score.mean()
score = np.sqrt(score)
return score


df = pd.read_csv("ames.csv")

The code above is the one we had in the previous part of PCA. The model scoring is done and RMSLE is used also in this case. Let’s choose which features target encoding can be applied to. Categorical features with a large number of categories (high cardinality) are often good candidates.

df.select_dtypes(["object"]).nunique()


MSSubClass 16
MSZoning 7
Street 2
Alley 3
LotShape 4
LandContour 4
Utilities 3
LotConfig 5
LandSlope 3
Neighborhood 28
Condition1 9
Condition2 8
BldgType 5
HouseStyle 8
OverallQual 10
OverallCond 9
RoofStyle 6
RoofMatl 8
Exterior1st 16
Exterior2nd 17
MasVnrType 5
ExterQual 4
ExterCond 5
Foundation 6
BsmtQual 6
BsmtCond 6
BsmtExposure 5
BsmtFinType1 7
BsmtFinType2 7
Heating 6
HeatingQC 5
CentralAir 2
Electrical 6
KitchenQual 5
Functional 8
FireplaceQu 6
GarageType 7
GarageFinish 4
GarageQual 6
GarageCond 6
PavedDrive 3
PoolQC 5
Fence 5
MiscFeature 6
SaleType 10
SaleCondition 6
dtype: int64

Here the features with high cardinality are Neighborhood, MSSubClass, Exterior2nd, Exterior1st, and SaleType. Now to check the number of categories of one of these features (let’s say SaleType) the code is.

df["SaleType"].value_counts()

WD 2536
New 239
COD 87
ConLD 26
CWD 12
ConLI 9
ConLw 8
Oth 7
Con 5
VWD 1
Name: SaleType, dtype: int64

Target encoding is applied to the feature. Now to avoid overfitting, we need to fit the encoder on data held out from the training set.

# Encoding split
X_encode = df.sample(frac=0.20, random_state=0)
y_encode = X_encode.pop("SalePrice")

# Training split
X_pretrain = df.drop(X_encode.index)
y_train = X_pretrain.pop("SalePrice")

Applying target encoding to the choice of categorical features. A smoothing parameter m=1 is used in this case.

# Create the MEstimateEncoder
# Choose a set of features to encode and a value for m
encoder = MEstimateEncoder(
cols=["Neighborhood"],
m=1.0,
)


# Fit the encoder on the encoding split
encoder.fit(X_encode, y_encode)


# Encode the training split
X_train = encoder.transform(X_pretrain, y_train)
feature = encoder.cols

plt.figure(dpi=90)
ax = sns.distplot(y_train, kde=True, hist=False)
ax = sns.distplot(X_train[feature], color='r', ax=ax, hist=True, kde=False, norm_hist=True)
ax.set_xlabel("SalePrice");
Distribution plot of encoding feature in comparison to the target(SalePrice) (image from Kaggle)

RMSLE score of the encoded dataset in comparison to the baseline dataset.

X = df.copy()
y = X.pop("SalePrice")
score_base = score_dataset(X, y)
score_new = score_dataset(X_train, y_train)

print(f"Baseline Score: {score_base:.4f} RMSLE")
print(f"Score with Encoding: {score_new:.4f} RMSLE")


Baseline Score: 0.1428 RMSLE
Score with Encoding: 0.1402 RMSLE

Depending on which feature or features are chosen, the score is worse than that of the baseline score. In that case, it’s likely the extra information gained by the encoding couldn’t make up for the data loss used for the encoding.

Here we will explore the overfitting problem of target encodings. This will illustrate the importance of training fitting target encoders on data held out from the training set.

So let’s see what happens when we fit the encoder and the model on the same dataset. To emphasize how dramatic the overfitting can be, we’ll mean-encode a feature that should have no relationship with SalePrice.

# Try experimenting with the smoothing parameter m
# Try 0, 1, 5, 50
m = 5

X = df.copy()
y = X.pop('SalePrice')

# Create an uninformative feature
X["Count"] = range(len(X))
X["Count"][1] = 0 # actually need one duplicate value to circumvent error-checking in MEstimateEncoder

# fit and transform on the same dataset
encoder = MEstimateEncoder(cols="Count", m=m)
X = encoder.fit_transform(X, y)

# Results
score = score_dataset(X, y)
print(f"Score: {score:.4f} RMSLE")

Score: 0.0291 RMSLE
plt.figure(dpi=90)
ax = sns.distplot(y, kde=True, hist=False)
ax = sns.distplot(X["Count"], color='r', ax=ax, hist=True, kde=False, norm_hist=True)
ax.set_xlabel("SalePrice");
Distribution plot of Count in comparison to SalePrice (Image from Kaggle)

The RMSLE has improved this time and the distribution plot also fits well in comparison to the previous case.

Since Count never has any duplicate values, the mean-encoded Count is essentially an exact copy of the target. In other words, mean encoding turned a completely meaningless feature into a perfect feature.

Now, the only reason this worked is that we trained XGBoost on the same set we used to train the encoder. If we had used a hold-out set instead, none of this “fake” encoding would have transferred to the training data.

So, in a nutshell, when using a target encoder it’s very important to use separate data sets for training the encoder and training the model. Otherwise, the results can be very poor!

This is the end of the whole Feature Engineering course series from Kaggle. The target encoding part is here.

Do go through this notebook and experiment to understand Feature Engineering more. This will give an edge to try out real-life recent datasets.

Please check out my other articles, and say hi. Also, check out my GitHub. You can donate me some cups of coffee if you like my work and I can keep improving the quality of the content daily as I move forward in this writing journey.

--

--