Different Categorical Encoding Techniques

Sai Niharika Naidu Gandham
2 min readOct 3, 2023

--

“My first attempt at writing the medium post”

Taken from lecture Notes

Introduction:

When you have a dataset, there can be numerical attributes and Categorical attributes. As most ML models can work with numerical data and not textual data , we need to encode the Categorical attributes into numerical data .

There are two types of Categorical Attributes :

  1. Nominal
  2. Ordinal

In Simple terms ,

Nominal data doesn’t have any rank or order.

Ex : Names of Countries, Zip code

Ordinal data has inherent ranking and order.

Ex : Education level, Ranks in the class

Different Encoding Techniques:

1. One-hot Encoding :

A binary column is created for each of the unique values for a given attribute.

  • Pros: Preserves category information, no assumption of ordinality, suitable for most algorithms.
  • Cons: High dimensionality, multicollinearity, memory usage.

2. Label Encoding :

Assign a unique integer label to each category in the categorical variable.

  • Pros: Reduces dimensionality, simple to implement.
  • Cons: Implies ordinal relationships, may not be suitable for nominal data.

3. Count Encoding (Frequency Encoding) :

Replace each category with its frequency (count) in the dataset.

  • Pros: Captures information about category occurrence, can be useful for tree-based models.
  • Cons: Doesn’t work well for rare categories.

4. Target Encoding :

Replace each category with the mean of the target variable for that category.

  • Pros: Captures information related to the target variable, useful for many algorithms.
  • Cons: Potential leakage, may overfit if not used carefully.

--

--