Introduction:
In these article i present Feature Encoding from Basic to Advance.
Feature Encoding:
Feature Encoding is the process of converting categories into Numerical values.
As we all know most of the algorithms cannot handle the categorical variables unless they are converted into a numerical value. so by using methods in the feature encoding to convert the categories into Numerical.
Categorical features are generally divided into 2 types:
1.Nominal catagorical features:
Variable or features contains Finite set of discrete values with no relationship blw discrete values like no ordering blw discrete values.
2.Ordinal Categorical features:
Variable or feature contains Finite set of discrete value with relationship blw discrete values or ranked ordering blw discrete values.
There are many types of Feature Encodings to handle the Nominal and Ordinal Categorical Features:
For Handling Ordinal Categorical Features:
1.Label Encoding or Integer Encoding
2.Ordinal Encoding
3.Target Guided Ordinal Encoding or Ordered Encoding
4.Mean Guided Ordinal Encoding
Note: Target Guided and Mean Guided comes under the Guided Ordinal Encoding
5.Rare Label Encoding
6.LabelCount encoding
For Handling Nominal Catagorical Features:
1.One Hot Encoding
2.One Hot Encoding with Multiple Catagories
3.Dummy Encoding
4.Binary Encoding
5.Frequency or count Encoding
6.Mean or Target Encoding or Likelihood encoding
7.Leave one out Encoding
8.K Fold Target Encoding or Leave Fold out Encoding
9.M-estimator Encoding
10.Weight of evidence Encoding
11.Hashing Encoding
12.james-stein Encoding
13.Base N Encoding
14.CatBoost Encoding
15.QuantileEncoder
16.SumEncoder
17.PolynomialEncoder
18.HelmertEncoder
19.GLMMEncoder
20.BackwardDifferenceEncoder
For Handling Ordinal Categorical Features:
1.Label Encoding:
Label Encoding is a popular encoding technique for handling categorical variables, this method assign a unique integer for each label in that column based on alphabetical ordering of labels(that means, it assign unique integer based on the starting letter of the labels)
In label encoding, each category or label is assigned a value from 0 to n-1.
whereas n is number of category present in that column.
Eg: Let as considered a example given below and apply the LabelEncoder.
[Correct or actual Ordering :very cold, cold, warm]
After applied the Label Encoding on the above column, we will get
[but Label Encoder Ordering: cold, very cold, warm]
So Label Encoder will assign value from 0 to 2(n-1 = 3–1 =2) based on the First letter of the each label(alphabetical order).
Drawbacks or Problem:
1.It assign the values based on the Alphabetical order, so results in misleading the information by assigning values based on the alphabetic order instead of actual or correct ordering.
Lets take an same above example, in that correct or actual ordering is very cold, cold, warm, but Label Encoder ordering is cold, very cold, warm results in misleading or destroy the information contained by the Labels by assigning the values based on the Alphabetical order.
2.It never works very well for tree based models.
Solution for first problem:
To Rectify the above problem, we are using the “Ordinal Encoding in the Category_encoders liberary” not from the Sklearn library.
2.Ordinal Encoding:
Ordinal Encoding is a also popular encoding technique for handling categorical variables, this method assign a User given integer for each label in that column, so we have to specify or give the integer for each label as a parameter in the Ordinal Encoder, if you are using category_encoders library.
suppose if you use ordinal encoder from Sklearn, then working of ordinal encoding is same as the Label Encoding in the sklearn.
NOTE:In category_encoders library does not contains the “Label encoder”, label encoder only present in the sklearn.
3.Target Guided Encoding or Ordered Encoding:
As the Name suggest, it uses the Target or Output variable to encode the categories or Labels. It is used for ordinal categorical features.
It assign values from 0 to N-1 for the labels in the column.
Algorithm:
1. First it will calculate the Mean for each labels in the feature by using the Target values.
2.After finded the Mean values for the each labels, then it sort the labels in the descending order or Ascending order based on the mean values.
3.Then we will assign the values from N-1 to 0(if you are using Descending order) or 0 to N-1(if you are using Ascending order) for the sorted order.
Eg : Lets take an below dataset and here we are using the Descending order.
step 1:
Mean for Label 1(B.E) = 200 + 150 /2 = 175
Mean for Label 2(Masters) = 500 /1 = 500
Mean for Label 1(B.E) = 15000 + 999 + 1000 /3 = 1116
step 2:
Descending order based on the Mean value is
PHD, Masters, B.E
step 3:Then we will assign the values from N-1 to 0(if you are using Descending order):
Here N is 3.
So from above we can see by calculating the mean of salary for each label, we can provide the correct order to the Labels, because the education qualification increases the salary also increases, so if take the mean of salary for highest education qualification, then definitely mean value is higher for that highest education qualification. so Target Guided Encoding Based on the above intuition.
Advantages:
1.Straightforward to implement
2.Does not expand the feature space
3.Creates monotonic relationship between categories and target
Disadvantages:
1.It does not works well for categorical Target features, that means we cannot gives correct ordering for labels.so when we have categorical target feature, at the time we cannot use this method.
2.Due to the presence of more feature in the dataset will influence the target value results in order will not correctly captured.
For example: Lets take an above same example along with one more feature called as “Location”.
so if you calculate the mean of salary of each label, you will get
Mean for Label 1(B.E) = 200 + 20000 /2 = 10100
Mean for Label 2(Masters) = 500 /1 = 500
Mean for Label 1(B.E) = 15000 + 999 + 1000 /3 = 1116
And if you assign the values, then you will get
So in the above case, our result is misleading or not in the correct ordering.
Due to the presence of other feature will influence the salary, so we are misleading.
3.There is no Library to perform this method.
4.May lead to over-fitting, due to information from the target leaked to the independent variables it is called as “Target Data Leakage”, results in overfitting.
3.Mean Guided Target Encoding:
As the name suggest we are using the mean values of the target variable to encode the Categories or label in the column.
Target Guided and Mean Guided are almost similar, but it shows difference in applying or encoding the categories or labels.
Algorithm:
1.First it will calculate the Mean for each labels in the feature by using the Target values.
2.After finded the Mean values for the each labels, then instead of sorted in ascending or descending order and give ranking, it will assign the mean value of the correspond label to each label.
Eg: Lets take an below dataset.
step 1:
Mean for Label 1(B.E) = 200 + 150 /2 = 175
Mean for Label 2(Masters) = 500 /1 = 500
Mean for Label 1(B.E) = 15000 + 999 + 1000 /3 = 1116
step 2:Then assign the mean value of corresponding label to the each label.
Disadvantages:
1.Same disadvantages of Target Guided Ordinal encoding here also.
5.Rare Label Encoding:
As the name suggest it gives name as “Rare or user given name” to the labels which occurance is small than threshold(which is set by us) in that categorical feature and then we perform any other categorical encoding on it.
6.LabelCount encoding:
As the name suggest it count the label in that categorical feature and then based on the count of each labels it assign the integer values that means for label with lower count assign 0, like that based on the count of the label it assign the integer value.
That’s all Folks in next article we will see the Nominal Categorical methods,if any wrong correct me also Please give feedback,Thank you.