A Comprehensive Guide to Categorical Variable Encoding: Hands-on Examples and Visualizations with the Titanic Dataset

Seharfatima Ds
4 min readJun 25, 2023

--

Categorical variables pose unique challenges in data science and machine learning tasks. Encoding methods provide solutions to transform categorical data into numerical representations. In this article, we will explore popular encoding methods, explain how each method works, and provide code examples using the Titanic dataset. We will also visualize the results using plots.

Dataset Preparation:

Before applying encoding methods, let’s load and preprocess the Titanic dataset:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load Titanic dataset
data = sns.load_dataset('titanic')

# Fill missing values
data['age'] = data['age'].fillna(data['age'].median())
data['embarked'] = data['embarked'].fillna(data['embarked'].mode()[0])

One-Hot Encoding:

One-Hot Encoding is ideal for non-ordinal categorical variables, representing each category as a binary feature. We can use the get_dummies function from pandas to perform one-hot encoding:

one_hot_encoded = pd.get_dummies(data, columns=['Sex', 'Embarked'])

# Plotting
sns.countplot(x='sex', data=data)
plt.title('One-Hot Encoded - sex')
plt.show()
sns.countplot(x='embarked', data=data)
plt.title('One-Hot Encoded - Embarked')
plt.show()

Label Encoding:

Label Encoding assigns a unique numerical label to each category. It is suitable for categorical variables with an ordinal relationship. We can use the LabelEncoder from scikit-learn for label encoding:

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data['sex_encoded'] = label_encoder.fit_transform(data['sex'])

# Plotting
sns.countplot(x='sex_encoded', data=data)
plt.title('Label Encoded - sex')
plt.show()

Ordinal Encoding:

Ordinal Encoding assigns numerical values to categories based on their order. It is used when categories have a predefined order. We can create a mapping dictionary and replace the categories with their corresponding values:

ordinal_mapping = {'female': 0, 'male': 1}
data['sex_encoded'] = data['sex'].map(ordinal_mapping)

# Plotting
sns.countplot(x='sex_encoded', data=data)
plt.title('Ordinal Encoded - sex')
plt.show()

Binary Encoding:

Binary Encoding represents each category with a binary code and is effective for handling high-cardinality variables. We can use the category_encoders library to perform binary encoding:

import category_encoders as ce

binary_encoder = ce.BinaryEncoder(cols=['sex'])
binary_encoded = binary_encoder.fit_transform(data)

# Plotting
sns.countplot(x='sex_0', data=binary_encoded)
plt.title('Binary Encoded - sex (Bit 0)')
plt.show()

Count Encoding:

Count Encoding replaces categories with the count of their occurrences in the dataset. We can use the category_encoders library to perform count encoding:

count_encoder = ce.CountEncoder(cols=['embarked'])
count_encoded = count_encoder.fit_transform(data)

# Plotting
sns.countplot(x='embarked', data=count_encoded)
plt.title('Count Encoded - embarked')
plt.show()

Target Encoding:

Target Encoding replaces categories with the mean or median value of the target variable for each category. We can use the category_encoders library to perform target encoding:

target_encoder = ce.TargetEncoder(cols=['embarked'])
target_encoded = target_encoder.fit_transform(data, data['survived'])

# Plotting
sns.barplot(x='embarked', y='survived', data=target_encoded)
plt.title('Target Encoded - embarked')
plt.show()

Hashing Encoding:

Hashing Encoding maps categorical variables into a fixed-size feature space using a hash function. We can use the category_encoders library to perform hashing encoding:

hash_encoder = ce.HashingEncoder(cols=['embarked'], n_components=8)
hash_encoded = hash_encoder.fit_transform(data)

# Print column names
print(hash_encoded.columns)

# Plotting
sns.countplot(x=('col_0'), data=hash_encoded)
plt.title('Hashing Encoded - embarked (Bit 0)')
plt.show()

Following comprehensive table serves as a quick reference guide for understanding and selecting the most suitable encoding method for different data science and machine learning tasks.

Encoding methods provide valuable tools for transforming categorical variables into numerical representations. In this article, we explored popular encoding methods and provided code examples using the Titanic dataset. By understanding how these encoding methods work and visualizing the results, you can effectively preprocess and represent categorical data for various data science and machine learning tasks.

--

--