Hands-on Categorical Feature Encoding in Machine Learning
Categorical Feature Encoding: A Walkthrough in python!
Categorical Feature Encoding
Categorical feature encoding is an important step in data preprocessing. Categorical feature encoding is the process of converting the categorical features into numeric features. Categorical variables are also called as Qualitative variables. The results produced by the model varies when different encoding techniques are used.
Two types of categorical features exist,
- Ordinal Features
- Nominal Features
Ordinal Features:
- Ordinal features are the features that have inherent ordering.
- Eg: Ratings such as Good, Bad.
Nominal Features:
- Nominal features are the features that don’t have any inherent ordering as opposed to Ordinal features.
- Eg: Names of persons, gender, yes, or no.
Need for categorical feature encoding
- Categorical features must be encoded before feeding it to the model because many Machine Learning algorithms don’t support categorical features as their input.
- Machine Learning algorithms and Deep Learning algorithms would support only numerical variables/ quantitative variables.
Ordinal Encoding Techniques
- Label Encoding or Ordinal Encoding
Nominal Encoding Techniques
- Frequency Encoding
- Target Encoding
- One-hot Encoding
- Leave One Out Encoding
- M-Estimate Encoding
Load the required libraries
Ordinal Encoding
1. Ordinal Encoding using OrdinalEncoder in sci-kit learn
OrdinalEncoder is used to assign numerical values to the categories in the ordinal features.
2. Ordinal Encoding using LabelEncoder in sci-kit learn
LabelEncoder would also produce the same result produced by the OrdinalEncoder.
Nominal Encoding
1. Frequency Encoding
In frequency encoding, each of the categories in the feature is replaced with the frequencies of categories.
Category refers to each of the unique values in a feature.
- Frequency(category) = Number of values in that category
- Size(data) = Size of the entire dataset.
Disadvantage: If two categories have the same frequency then it is hard to distinguish between them.
2. Target Encoding
In target encoding, each of the categories is replaced with the mean of the target variable. Target encoding is one of the most used categorical encoding techniques in Kaggle.
- Target Encoding = mean(target of a category)
Disadvantage: Tends to overfit the data if some of the categories have a low number of occurrences.
3. One-Hot Encoding
One Hot Encoding replaces the categories with binary values. ‘N’ number of features is created if the unique values in a feature are equal to ‘N’.
Disadvantages:
- Tree algorithms cannot be applied to one-hot encoded data since it creates a sparse matrix.
- When the feature contains too many unique values, that many features are created which may result in overfitting.
4. Leave One Out Encoding
Leave One Out Encoding(LOOE) is very similar to Target Encoding but the difference is LOOE doesn’t consider the current row while calculating the mean of the target.
Disadvantage: Tends to overfit to the data.
5. M-Estimate Encoding
M-Estimate Encoding is also called as additive smoothing overcomes the disadvantages of the Target Encoding (overfitting) by considering a smoothing factor M to encode.
Find this post in my Kaggle notebook: https://www.kaggle.com/srivignesh/categorical-feature-encoding-techniques
References:
[1] Patricio Cerda, Ga¨el Varoquaux, and Bal´azs K´egl, Similarity encoding for learning with dirty categorical variables (2018).