Different Categorical Encoding Techniques
“My first attempt at writing the medium post”
Introduction:
When you have a dataset, there can be numerical attributes and Categorical attributes. As most ML models can work with numerical data and not textual data , we need to encode the Categorical attributes into numerical data .
There are two types of Categorical Attributes :
- Nominal
- Ordinal
In Simple terms ,
Nominal data doesn’t have any rank or order.
Ex : Names of Countries, Zip code
Ordinal data has inherent ranking and order.
Ex : Education level, Ranks in the class
Different Encoding Techniques:
1. One-hot Encoding :
A binary column is created for each of the unique values for a given attribute.
- Pros: Preserves category information, no assumption of ordinality, suitable for most algorithms.
- Cons: High dimensionality, multicollinearity, memory usage.
2. Label Encoding :
Assign a unique integer label to each category in the categorical variable.
- Pros: Reduces dimensionality, simple to implement.
- Cons: Implies ordinal relationships, may not be suitable for nominal data.
3. Count Encoding (Frequency Encoding) :
Replace each category with its frequency (count) in the dataset.
- Pros: Captures information about category occurrence, can be useful for tree-based models.
- Cons: Doesn’t work well for rare categories.
4. Target Encoding :
Replace each category with the mean of the target variable for that category.
- Pros: Captures information related to the target variable, useful for many algorithms.
- Cons: Potential leakage, may overfit if not used carefully.