Categorical Features Encoding Advanced Techniques: An Overview

12 min readNov 23, 2022

Image from Kirell Benzi, storytelling by art and data

The performance of machine learning models not only depends on the models and the hyper-parameters and optimization techniques but also on how we process and feed different types of features to the model. Since most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.

In the 1940s, Stanley Smith Stevens introduced four scales of measurement: nominal, ordinal, interval, and ratio. These are still widely used today as a way to describe the characteristics of a variable. Knowing the scale of measurement for a variable is an important aspect in choosing the right statistical analysis.

Usually there are two kinds of categorical data:

Ordinal : The categories have an inherent order like: socio economic status (low income,middle income, high income), education level (high school,BS,MS,PhD), income level (less than 50K, 50K-100K, over 100K), satisfaction rating (extremely dislike, dislike, neutral, like, extremely like).
Nominal : The categories do not have an inherent order like: blood type, zip code, gender, race, ethnicity. Also binary data could be nominal or ordinal.

So how should we select encoding methods is depends algorithm(s) we apply :

Some algorithms can work with categorical data directly e.g CatBoost , or For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation and library we use).
Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.
Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.
We should consider that generally linear models are sensitive to order of ordinal data so we should select appropriate encoding methods.

There are many methods for categorical encoding. in this article we will review some of them with python and also built in methods provided by CatBoost and Lightgbm:

One Hot Encoding
Label Encoding
Ordinal Encoding
Helmert Encoding
Binary Encoding
Frequency Encoding
Mean Encoding
Weight of Evidence Encoding
Probability Ratio Encoding
Hashing Encoding
Backward Difference Encoding
Leave One Out Encoding
James-Stein Encoding
M-estimator Encoding
Thermometer Encoder

For reviewing categorical encoding techniques we will use a kaggle competition dataset . With this dataset, we will predict a continuous target based on a number of feature columns given in the data. All of the feature columns, cat0 - cat9 are categorical, and the feature columns cont0 - cont13 are continuous.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import *

import category_encoders as ce

train = pd.read_csv('../train.csv')
test = pd.read_csv('../test.csv')
del train['id']
del test['id']
train.shape, test.shape

Output:

((300000, 25), (200000, 24))

Lets have a quick EDA on categorical features:

cats = [c for c in train.columns if train[c].dtypes=='object']
cats

Output:

['cat0',
 'cat1',
 'cat2',
 'cat3',
 'cat4',
 'cat5',
 'cat6',
 'cat7',
 'cat8',
 'cat9']

def analyse_cats(df, cat_cols):
    d = pd.DataFrame()
    cl = [];u = [];s =[]; nans =[]
    for c in cat_cols:
        #print("column:" , c ,"--Uniques:" , train[c].unique(), "--Cardinality:", train[c].unique().size)
        cl.append(c); u.append(df[c].unique());s.append(df[c].unique().size);nans.append(df[c].isnull().sum())
        
    d["feat"] = cl;d["uniques"] = u; d["cardinality"] = s; d["nans"] = nans
    return d


catanadf = analyse_cats(train, cats)
catanadf

Output:

Frequency Encoding

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset.

Advantages of Count or Frequency encoding

Straightforward to implement.
Does not expand the feature space.
Can work well with tree-based algorithms.

Limitations of Count or Frequency encoding

Does not handle new categories in the test set automatically.
We can lose valuable information if there are two different categories with the same amount of observations count — this is because we replace them with the same number.

Here we will add frequency encoded features to labeled encoded features:


target = train.pop('target')

daset = pd.concat([train, test], axis=0)


for c in (cats):
    daset[c+'_freq'] = daset[c].map(daset.groupby(c).size() / daset.shape[0])
    indexer = pd.factorize(daset[c], sort=True)[1]
    daset[c] = indexer.get_indexer(daset[c])

train= daset.iloc[:len(train) , ]
test= daset.iloc[len(train): , ]
cols=train.columns

# Scale data for Ridge (Linear Model)
ss = StandardScaler()
train = ss.fit_transform(train)
test  = ss.transform(test)

# Ridge KFold
score = []

oof_rg = np.zeros(len(train))
pred_rg = np.zeros(len(test))

folds = KFold(n_splits=5, shuffle=True, random_state=42)

for fold_ , (train_ind, val_ind) in enumerate(folds.split(train, target)):
    print('fold:', fold_, '  - Starting ...')
    trn_data, val_data = train[train_ind], train[val_ind]
    y_train, y_val = target.iloc[train_ind], target.iloc[val_ind]
    
    rg = Ridge(alpha=0.1, random_state=2021)
    rg.fit(trn_data, y_train)
    oof_rg[val_ind] = rg.predict(val_data)
    y = rg.predict(trn_data)
    print('train rmse:' , np.sqrt(mean_squared_error(y_train, y)),'val rmse:' , np.sqrt(mean_squared_error(y_val, oof_rg[val_ind])))
    
    score.append(np.sqrt(mean_squared_error(y_val, oof_rg[val_ind])))
    pred_rg += rg.predict(test)/folds.n_splits
    
print('-'*50)
print(' Ridge rmse:  ', np.mean(score))

Output:

fold: 0   - Starting ...
train rmse: 0.8669466058171931 val rmse: 0.8669048548301733
fold: 1   - Starting ...
train rmse: 0.8667430371692436 val rmse: 0.8677225787826589
fold: 2   - Starting ...
train rmse: 0.8666054510007822 val rmse: 0.8682431741538739
fold: 3   - Starting ...
train rmse: 0.8670735343520586 val rmse: 0.8663906582662209
fold: 4   - Starting ...
train rmse: 0.8672040302748487 val rmse: 0.8658517679934317
--------------------------------------------------
 Ridge rmse:   0.8670226068052717

Target Encoding — Mean Likelihood Encoding

Mean encoding means replacing the category with the mean target value for that category. We start by grouping each category alone, and for each group, we calculate the mean of the target in the corresponding observations. Then we assign that mean to that category. Thus, we encoded the category with the mean of the target. Here’s a detailed illustration of mean encoding:

Advantages of Mean encoding

Does not expand the feature space.
Creates a monotonic relationship between categories and the target.

Limitations of Mean encoding

May lead to overfitting.
May lead to a possible loss of value if two categories have the same mean as the target — in these cases, the same number replaces the original.

According to:

we have better implement target encoding through KFold and with smoothing.

min_samples_leaf define a threshold where prior and target mean (for a given category value) have the same weight. Below the threshold prior becomes more important and above mean becomes more important.How weight behaves against value counts is controlled by smoothing parameter

Sample code in my kaggle notebook : https://www.kaggle.com/code/arashnic/cats-on-a-hot-tin-roof-cats-encoding-methods

Hash Encoding

To understand Hash encoding it is necessary to know about hashing. Hashing is the transformation of arbitrary size input in the form of a fixed-size value. We use hashing algorithms to perform hashing operations i.e to generate the hash value of an input. Further, hashing is a one-way process, in other words, one can not generate original input from the hash representation.

Hashing has several applications like data retrieval, checking data corruption, and in data encryption also. We have multiple hash functions available for example Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.

Just like one-hot encoding, the Hash encoder represents categorical features using the new dimensions. Here, the user can fix the number of dimensions after transformation using n_component argument. Here is what we mean — A feature with 5 categories can be represented using N new features similarly, a feature with 100 categories can also be transformed using N new features.

By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any algorithm of his choice.

data = pd.concat([train, test], axis=0)

encoder=ce.HashingEncoder(cols=cats,n_components=6)
data = encoder.fit_transform(data)
train = data.iloc[:len(train), ]
test = data.iloc[len(train):, ]

CatBoost and Cats

When running machine learning algorithms, simply assigning numbers to categorical variables work if a category has only two levels. This is the case for gender (male/female), bought a product (yes/no), attended a course (yes/no). When a category has several levels, as with nationality, assigning numbers to each level implies an order of the levels. This means that one level of the category has a lower rank than another level. While this makes sense for ordinal variables (e.g., preferences of food items or educational degree), it is a wrong assumption for nominal variables such as color preferences, nationality, residential city specially when we use linear Algorithms. Algorithms like CatBoost have different perspective to solve this problem.

We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features.

In detail , Catboost calculates for every category of a nominal variable , a value (target-based statistic). This is done using a number of steps: We begin with one categorical feature (e.g., Nationality). This is called x. In one randomly chosen row (k-th row in the training data set), we exchange one random level of this categorical feature (i-th level of x) with a number (e.g., Dutch by 5) This number (in our example 5) is usually based on the target variable (the one we want to predict) conditional on the category level. In other words, the target number is based on the expected outcome variable. A splitting attribute is used to create two sets of the training data: One set that has all categories (e.g., German, French, Indian etc) who will have greater target variable than the one computed in step 3, and the other set with smaller target variables.

In their paper authors describe how catboost is dealing with categorical features. The standard way is to compute some statistics, such as median, based on the label values of the category. However, this creates problems if there is only one example for a label value. In this case, the numerical value of the category would be the same than the label value. For example if in our example with nationalities, the category Belgian is assigned the value 2, and there is only 1 Belgian student, this student would get the value 2 for nationality. This can create problems of overfitting.

To avoid this problem, the authors designed a solution which involves randomly changing the order of rows in the complete data set. We perform a random permutation of the data set and for each example we compute average label value for the example with the same category value placed before the given one in the permutation .In their paper they also describe how different features are combined to create a new feature. Think about it, every individual observations of categorical and numerical data points describe one observation. The chances that two observations are exactly identical is slim. Hence, different categorical values and numerical values could be combined to create a unique merged categorical variable which contains all the different individual choices. While this might sound easy, doing this for all potential types of combinations will be computational intensive. Another way to combine different features is to do a greedy search at every tree split. Catboost does this by combining all categorical and numerical values at the current tree with all categorical values in the data set.

Transforming categorical features to numerical features methods are:

Borders
Buckets
BinarizedTargetMeanValue
Counter

You can read more about it here.

categorical_features_indices = np.where(train.dtypes == 'object')[0]
categorical_features_indices

Output:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

cat_score = []
# Split data with kfold
kfolds = KFold(n_splits=3, shuffle=True, random_state=2018)
train_features = train.columns
# Make importance dataframe
importances = pd.DataFrame()

oof_preds = np.zeros(train.shape[0])
sub_preds = np.zeros(test.shape[0])

for n_fold, (trn_idx, val_idx) in enumerate(kfolds.split(train, target)):
    X_train, y_train = train.iloc[trn_idx], target.iloc[trn_idx]
    X_valid, y_valid = train.iloc[val_idx], target.iloc[val_idx]
    
    # CatBoost Regressor estimator
    model = cb.CatBoostRegressor(
        learning_rate = 0.1,
        iterations = 2000,
        eval_metric = 'RMSE',
        allow_writing_files = False,
        od_type = 'Iter',
        bagging_temperature = 0.8,
        depth = 6,
        od_wait = 20,
        silent = False
    )
    
    # Fit
    model.fit(
        X_train, y_train,
        cat_features=categorical_features_indices,
        eval_set=[(X_train, y_train), (X_valid, y_valid)],
        verbose=100,
        early_stopping_rounds=100
    )
    
    # Feature importance
    imp_df = pd.DataFrame()
    imp_df['feature'] = train_features
    imp_df['gain'] = model.get_feature_importance()
    imp_df['fold'] = n_fold + 1
    importances = pd.concat([importances, imp_df], axis=0, sort=False)
    
    oof_preds[val_idx] = model.predict(X_valid)
    cat_score.append(np.sqrt(mean_squared_error(y_valid, oof_preds[val_idx])))
    test_preds = model.predict(test)
    sub_preds += test_preds / kfolds.n_splits
    
print(np.mean(cat_score))

0.8439344702373242

LightGBM and Cats

Lgb sorts the categories according to the training objective at each split. More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram. So the split can be made based on the variable being of one specific level or any subset of levels, so you have 2^N splits available in comparision with e.g of 4 for OHE.

The algorithm behind above mechanism is Fisher (1958) to find the optimal split over categories. http://www.csiss.org/SPACE/workshops/2004/SAC/files/fisher.pdf

Folowing we specified categorical feats for lgb . As lgb is using target encoding i used 2 additive parameters ‘min_data_per_group’ and ‘cat_smooth’ and changed default values. Thease parameters help to prevent overfitting, similar to what we did with target encoding for Ridge throug KFold.

As lgb is inherently tree based we can convert ordinals by LabelEncoder like nominals and then specify them for lgb:

for c in cats:
    le=LabelEncoder()
    le.fit(list(train[c].astype('str')) + list(test[c].astype('str')))
    train[c] = le.transform(list(train[c].astype(str))) 
    test[c] = le.transform(list(test[c].astype(str)))

lgb_params = {
    
 'objective': 'rmse', 
 'boosting': 'gbdt', 
 'bagging_fraction': 0.7,
 'bagging_frequency': 1,
 'cat_smooth': 200,
 'feature_fraction': 0.7,
 'learning_rate': 0.01,
 'min_child_samples': 50,
 'min_data_per_group': 200,
 'num_leaves': 10,
 'reg_alpha': 2.,
 'reg_lambda': 3., 
 'metric':'rmse', 
 }
    
    


oof_lgb = np.zeros(len(train))
pred_lgb = np.zeros(len(test))

scores = []

feature_importances_gain = pd.DataFrame()
feature_importances_gain['feature'] = train.columns

feature_importances_split = pd.DataFrame()
feature_importances_split['feature'] = train.columns


folds = KFold(n_splits=3, shuffle=True, random_state=42)

for fold_, (train_ind, val_ind) in enumerate(folds.split(train, target)):
    
    trn_data = lgb.Dataset(train.iloc[train_ind], label=target.iloc[train_ind], categorical_feature=cats) #-------> Specify Categorical feature for lgb
    val_data= lgb.Dataset(train.iloc[val_ind], label=target.iloc[val_ind], categorical_feature=cats)  #-------> Specify Categorical feature for lgb
    
    lgb_clf = lgb.train(lgb_params, trn_data, num_boost_round=3000, valid_sets=(trn_data, val_data), verbose_eval=100, early_stopping_rounds=100)
    oof_lgb[val_ind] = lgb_clf.predict(train.iloc[val_ind], num_iteration= lgb_clf.best_iteration)
    
    scores.append(np.sqrt(mean_squared_error(target.iloc[val_ind], oof_lgb[val_ind])))
    
    feature_importances_gain['fold_{}'.format(fold_ + 1)] = lgb_clf.feature_importance(importance_type='gain')
    feature_importances_split['fold_{}'.format(fold_ + 1)] = lgb_clf.feature_importance(importance_type='split')
    
    pred_lgb += lgb_clf.predict(test, num_iteration=lgb_clf.best_iteration)/folds.n_splits
    
print('rmse = ', np.mean(scores))

rmse = 0.844948714893312

feature_importances_gain['average'] = feature_importances_gain[['fold_{}'.format(fold + 1) for fold in range(folds.n_splits)]].mean(axis=1)
feature_importances_gain.to_csv('feature_importances.csv')

plt.figure(figsize=(20, 8))
sns.barplot(data=feature_importances_gain.sort_values(by='average', ascending=False).head(100),palette='Reds_r',  x='average', y='feature');
plt.title('TOP n feature importance over {} folds average'.format(folds.n_splits));

Endnote

To summarize, encoding categorical data is an unavoidable part of the preprocessing and feature engineering. It is more important to know what coding scheme should we use. Having into consideration the dataset we are working with and the model we are going to use. In this article, we have seen some advanced encoding techniques along with their issues and suitable use cases.