Bridging the Gap: Transforming Categorical Data for Superior Models

Shubham Sangole

Published in

CodeX

3 min readMay 21, 2024

Introduction
Understanding Categorical Features: Nominal vs. Ordinal, Challenges with Categorical Data
Techniques for Handling Categorical Features: One-Hot Encoding, Label Encoding, Ordinal Encoding, Target Encoding, Feature Hashing
Mathematical Formulations: One-Hot Encoding Formula, Label Encoding Formula, Target Encoding Formula
Practical Implementation in Python: Data Preparation, Encoding Techniques, Example Dataset, Code Implementation
Conclusion
References
Further Reading

1. Introduction

Handling categorical features is a crucial aspect of data preprocessing in machine learning. Categorical data, such as gender, colour, or product type, requires special treatment to be effectively used in predictive models. This guide will delve into various techniques for handling categorical features, providing mathematical formulations, implementation in Python, and practical examples.

2. Understanding Categorical Features

Nominal vs. Ordinal

Categorical features can be broadly categorized into nominal and ordinal types. Nominal features have no inherent order or ranking, such as colour or country names. On the other hand, ordinal features have a specific order or ranking, such as educational levels or customer satisfaction ratings.

Challenges with Categorical Data

Traditional machine learning models cannot directly handle categorical data. They require numerical inputs, which necessitates encoding categorical features into a numerical format. However, improper encoding can lead to biases or incorrect interpretations by the model.

3. Techniques for Handling Categorical Features

One-Hot Encoding

One-Hot Encoding converts each category into a binary vector, where each category is represented by a binary value (0 or 1) in a separate column. This technique is suitable for nominal data without an inherent order.

Label Encoding

Label Encoding assigns a unique numerical label to each category, converting categorical data into ordinal form. It is suitable for ordinal data where the categories have a meaningful order.

Ordinal Encoding

Ordinal Encoding maps ordinal categories to numerical values based on their order. It preserves the ordinal relationship between categories and is suitable for ordinal data with a clear ranking.

Target Encoding

Target Encoding uses the target variable’s information to encode categorical features. It replaces categories with the mean or median of the target variable for that category, helping capture target-related information.

Feature Hashing

Feature Hashing is a dimensionality reduction technique that converts categorical features into a lower-dimensional space using hash functions. It is useful for handling high-cardinality categorical features.

4. Mathematical Formulations

One-Hot Encoding Formula

One-Hot Encoding(xi) = 1 if xi = category; 0 otherwise

Label Encoding Formula

Label Encoding(xi) = numerical label assigned to xi

Target Encoding Formula

where xi is the categorical feature, yj is the target variable, n is the total number of samples, and I(⋅) is the indicator function.

5. Practical Implementation in Python

Data Preparation

Let’s start by importing the necessary libraries and loading a sample dataset.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from category_encoders import TargetEncoder
from sklearn.feature_extraction import FeatureHasher

# Sample dataset
data = {'color': ['red', 'blue', 'green', 'red', 'green'],
        'size': ['S', 'M', 'L', 'XL', 'M'],
        'target': [1, 0, 1, 1, 0]}
df = pd.DataFrame(data)

Encoding Techniques

Let’s apply different encoding techniques to our dataset.

# One-Hot Encoding
one_hot_encoder = OneHotEncoder()
one_hot_encoded = one_hot_encoder.fit_transform(df[['color', 'size']])

# Label Encoding
label_encoder = LabelEncoder()
label_encoded = df['size'].apply(label_encoder.fit_transform)

# Target Encoding
target_encoder = TargetEncoder()
target_encoded = target_encoder.fit_transform(df['color'], df['target'])

# Feature Hashing
feature_hasher = FeatureHasher(n_features=3, input_type='string')
hashed_features = feature_hasher.fit_transform(df['color']).toarray()

6. Conclusion

Handling categorical features is essential for building accurate and robust machine learning models. By understanding various encoding techniques and their mathematical formulations, you can preprocess categorical data effectively and improve model performance.

7. References

Scikit-Learn Documentation: https://scikit-learn.org/stable/
Category Encoders Documentation: https://contrib.scikit-learn.org/category_encoders/