Different Categorical Encoding Techniques

2 min readOct 3, 2023

“My first attempt at writing the medium post”

Introduction:

When you have a dataset, there can be numerical attributes and Categorical attributes. As most ML models can work with numerical data and not textual data , we need to encode the Categorical attributes into numerical data .

There are two types of Categorical Attributes :

Nominal
Ordinal

In Simple terms ,

Nominal data doesn’t have any rank or order.

Ex : Names of Countries, Zip code

Ordinal data has inherent ranking and order.

Ex : Education level, Ranks in the class

Different Encoding Techniques:

1. One-hot Encoding :

A binary column is created for each of the unique values for a given attribute.

Pros: Preserves category information, no assumption of ordinality, suitable for most algorithms.
Cons: High dimensionality, multicollinearity, memory usage.

2. Label Encoding :

Assign a unique integer label to each category in the categorical variable.

Pros: Reduces dimensionality, simple to implement.
Cons: Implies ordinal relationships, may not be suitable for nominal data.

3. Count Encoding (Frequency Encoding) :

Replace each category with its frequency (count) in the dataset.

Pros: Captures information about category occurrence, can be useful for tree-based models.
Cons: Doesn’t work well for rare categories.

4. Target Encoding :

Replace each category with the mean of the target variable for that category.

Pros: Captures information related to the target variable, useful for many algorithms.
Cons: Potential leakage, may overfit if not used carefully.

Different Categorical Encoding Techniques

Written by Sai Niharika Naidu Gandham