Categorical Variable Encoding Techniques

Published in

Analytics Vidhya

6 min readFeb 23, 2020

A categorical variable is one that has two or more categories (values). There are two types of categorical variable, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (male and female) with no intrinsic ordering to the categories. An ordinal variable has a clear ordering.

Many ML algorithms are unable to operate on categorical or label data directly. However, Decision tree can directly learn from such data. Hence, they require all input variables and output variables to be numeric. This means that categorical data must be converted to a numerical form.

Few types of categorical variable encoding are:

One hot encoding: Encoding each categorical variable with different Boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

2. Integer Encoding / Label Encoding: Replace the categories by a number from 1 to n (or 0 to n-1, depending the implementation), where n is the number of distinct categories of the variable.

3. Count or frequency encoding: Replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the color blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency.

4. Ordered Integer Encoding: Categories are replaced by integer 1 to k, where k is the distinct categories in variable, but this numbering is decided by mean of target of each category. In example shown below,Green has target mean of 0, Red has target mean of 0.5, yellow has target mean of 1. Hence yellow is replaced by 1, Red by 2 and Green by 3.

5. Encoding using “Weight of Evidence”: Each category will be replaced by natural log of [p(1)/p(0)], where p(1) is the probability of good target variable and p(0) is the probability of bad target variable of each category in categorical variable. In famous “titanic” dataset, one of the categorical variables, “Cabin” can encoded as shown below, given that “Survived” as target variable. p(1) is the probability of surviving for each category and p(0) is the probability of death. Note: WoE is well suited for Logistic Regression, because the Logit transformation is simply the log of the odds, i.e., ln(P(Goods)/P(Bads)).

Let’s move on to implementation of above mentioned techniques in python:

Import the required libraries and do some pre-processing to remove nulls after loading the “titanic” dataset

import numpy as np #for numpy operations
import pandas as pd #for creating DataFrame using Pandas 
# to split the dataset using sklearn 
from sklearn.model_selection import train_test_split
# load titanic dataset
data = pd.read_csv('titanic.csv',
                   usecols=['sex', 'embarked', 'cabin', 'survived'])
# let's capture only the first letter of the 
# cabin for this demonstration
data['cabin'] = data['cabin'].fillna('Missing')
data['cabin'] = data['cabin'].str[0]data.head()

One Hot Encoding

get_dummies in Pandas library would do the job of encoding as shown below. It would create extra columns for each category using 0 and 1 indicating if the category is present. If category is present it would be indicated by 1 else indicated by 0.

pd.get_dummies(data)

Limitation: it expands the dimension as the number of columns increased which may lead to over-fitting of data while training.

Integer Encoding / Label Encoding

To replace each category in column, we have to create dictionary having key as each category and value as arbitrary number for that category. Then, each category can be mapped to the number defined in dictionary in column. This can be achieved by using two functions below:

# Returns dictionary having key as category and values as number
def find_category_mappings(data, variable):
    return {k: i for i, k in enumerate(data[variable].unique())}# Returns the column after mapping with dictionary
def integer_encode(data,variable, ordinal_mapping):
    data[variable] = data[variable].map(ordinal_mapping)for variable in ['sex','cabin','embarked']:
    mappings = find_category_mappings(data,variable)
    integer_encode(data, variable, mappings)data.head()

Function “find_category_mapping” would return dictionaries:

{‘male’: 0, ‘female’: 1} => 'sex'
{‘M’: 0, ‘C’: 1, ‘E’: 2, ‘G’: 3, ‘D’: 4, ‘A’: 5, ‘B’: 6, ‘F’: 7, ‘T’: 8}  => 'cabin'
{‘S’: 0, ‘C’: 1, ‘Q’: 2, nan: 3}  => 'embarked'

Limitation: Label encoding is not suitable for linear models like Logistic Regression.

Count or frequency encoding

First step is to create the dictionary with key as category and values as frequency(or count) of that category. Then, replace the categories by counts using dictionary

# create the dictionary
count_map_sex = data['sex'].value_counts().to_dict()
count_map_cabin = data['cabin'].value_counts().to_dict()
count_map_embark = data['embarked'].value_counts().to_dict()
# Map the column with dictionary
data['sex'] = data['sex'].map(count_map_sex)
data['cabin'] = data['cabin'].map(count_map_cabin)
data['embarked'] = data['embarked'].map(count_map_embark)
data.head()

Limitation: If two different categories appear the same amount of times in the dataset, that is, they appear in the same number of observations, they will be replaced by the same number,hence, may lose valuable information.

Ordered Integer Encoding

First, calculate the target mean of each category (use groupby() in Pandas) in column and sort them. Assign numerical value in ascending order to target mean. Lower the target mean, lower the numerical value and vice-versa.

def find_category_mappings(data, variable, target):
    # first  we generate an ordered list with the labels
    ordered_labels = data.groupby([variable])[target].mean().sort_values().index
    # return the dictionary with mappings
    return {k: i for i, k in enumerate(ordered_labels, 0)}def integer_encode(data,variable, ordinal_mapping):
    data[variable] = data[variable].map(ordinal_mapping)data.head()

Function find_category_mapping() will return following dictionaries depending on ordered Target mean of each category

{'male': 0, 'female': 1} => 'sex'
{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}  => 'cabin'
{'S': 0, 'Q': 1, 'C': 2} => 'embarked'

Limitation: It is prone to cause over-fitting

Encoding using “Weight of Evidence”

Calculate probability of survived = 1 and survived = 0 per category for column ‘cabin’ (similarly for other categorical column).

#probability of survived = 1
prob_df = data.groupby(['cabin'])['survived'].mean()
# and capture it into a dataframe
prob_df = pd.DataFrame(prob_df)# and now the probability of survived = 0
# (probability of non-events or p(0))prob_df['died'] = 1-prob_df['survived']

p(1) and p(0) for each category in ‘cabin’ column

Calculate the ‘Weight of Evidence’:

prob_df['ratio'] = np.log( prob_df['survived'] / prob_df['died'] )

After calculating WoE, it can be captured in dictionary and can be mapped to ‘cabin’ column.

I have defined two functions to perform all above mentioned steps:

# Encoding using WoE
def find_category_mappings(data, variable, target):
    tmp = pd.DataFrame(data.groupby([variable])[target].mean())
    tmp['non-target'] = 1 - tmp[target]
    tmp['ratio'] = np.log( tmp[target] / tmp['non-target'] )
    return tmp['ratio'].to_dict()def integer_encode(data, variable, ordinal_mapping):
    data[variable] = data[variable].map(ordinal_mapping)for variable in ['sex','cabin','embarked']:
    mappings = find_category_mappings(data, variable, 'survived')
    integer_encode(data,variable, mappings)

Limitation: It is also prone to cause over-fitting

Conclusion

As handling categorical variables in any dataset is crucial step in feature engineering, any of the above techniques can be applied depending upon type of model. Some techniques works better with linear models such as Logistic Regression and some with non-linear models such as Decision trees. If there are lesser categories and it is nominal categorical data, then one-hot encoding works just fine. If the relationship between any categorical column as independent variable and dependent variable (Target Variable) is important, then Ordered Integer Encoding can be applied. For ordinal categorical data, simply Label Encoding can be used.

Hope you like the article :)