Feature Encoding Basic to Advance-Part 2

Banarajay
9 min readDec 22, 2022

--

Introduction:

In these article we will see Feature Encoding for Nominal Categorical Features.

Part 1 Link : https://medium.com/@banarajay/feature-encoding-basic-to-advance-part-1-5fb72e415561

1.One-Hot encoding:

One-Hot Encoding is another popular technique for treating categorical variables. This type of encoding creates a new binary feature for each possible category or label in the column called as “Dummy variable” and then assigns a value of 1 to the dummy feature of each sample that corresponds to its original category or label and for other labels
assign 0.

One-Hot Encoding is the process of creating dummy variables and then assign 1 to the particular label dummy variable of data point contained label and for other labels dummy variables of the same data point assign 0,like that do for all the data points, results in we are getting the dummy variable with 0’s and 1's.

No of Dummy variables created in the one-hot encoding is equal to the no of labels in the column or feature.

Eg: Lets take a below dataset:

In the edu column contains 3 labels or categories,in one-hot encoding 3 dummy variables are created.

so lets take first data point, we simply apply or assign 1 to the B.E dummy variable(Dum_B.E) for 1st data point, because 1st data point contained B.E label and for other dummy variables like Dum_Master and Dum_PHD assign 0, because 1st does not contains those labels.

so like that do for all other data points also.

Advantages:

1.One Hot Encoding capture the dependencies blw the categorical features, but other methods like Mean encoding does not have this advantages.

Problem or Disadvantages:

1.Dummy variable Trap:

Dummy variable trap means outcome of one variable can easily be predicted with the help of the other remaining variables.

or

Dummy variable trap is a scenario in which variables are highly correlated to each other, that means lets take three feature in that first feature will be
highly correlated to other two variable and also second variable is highly correlated to other two variable and like that for third variable.so we can find any variable by using the other two variable.

so Dummy variable trap creates a “Multicollinearity problem”.

2.If the feature or variable contains large no of catagories or levels, then we endup with larger no of dummy variable, so it will creates as follows

1.High dimentional problem, it can lead to high memory consumption.

2.Due to higher dimention, slow down the learning process results in reduced the speed of the algorithm.

3.Also decrease the Accuracy.

4.Also introduce the Sparity in the model, because several column contains 0 and few contains 1, so results in these dummy variables does not provide
any useful information to the model.

NOTE: We can worry about Dummy variable trap problem, when we apply one-hot encoding on the Linear model like linear, logistic because in Linear model algorithm suffered from multicollinearity.

We cannot worry about Dummy variable trap problem when we apply one-hot encoding on the non-Linear model like tree based model(Decision tree etc..),svm, Knn etc.. because non-linear model algorithm does not suffered from multicollinearity, eventhough multicollinearity present among the independent variables.

Solution:

To rectify the first problem, we have three solution we can use either one

1.Drop one dummy variable by using drop parameter in the OneHotEncoder

2.Drop one dummy variable based on the VIF

3.Dummy encoding

1.Drop one dummy variable by using drop parameter in the OneHotEncoder:

In oneHotEncoder contains drop parameter, through we can drop dummy variable.

drop parameter contains : “first”, “if binary”

“first” : onehotencoder will always drop the first categorie in the variable.

“if binary” :onehotencoder will drop the first categorie in the Two categories variable.

array : drop[i] is the category in feature X[:, i] that should be dropped.

so by using the drop parameter in the onehotencoder, we can drop the any dummy variable, but if you use Dummy encoder, it always drop the first dummy variable or first categorie in the variable.

2.Drop one dummy variable based on the VIF:

From these method we can find the VIF value for all the Dummy variable and then drop the dummy variable which has highest VIF value and then we can stop the process, no need to drop any more variable.

3.Dummy encoding:

Dummy encoding scheme is similar to one-hot encoding. That means it is also create the dummy variable which is in the binary format. But Dummy encoding automatically drop any one dummy variable to rectify the first problem. As a result no of dummy variable created is N-1.

whereas N is the No of categories or Labels in the column.

Dummy variable always drop first variable, through we set drop_first = True and it does not drop any other variable.

suppose if we set False, then dummy encoding does not drop any dummy variables, results in N dummy variables.

Disadvantages of Dummy encoding and Drop one dummy variable by using drop parameter in the OneHotEncoder:

1.Dropped feature may contains more information with respect to Output variable. so in most of the time use 2nd method to drop the dummy feature.

The above solution done only for the Linear model algorithm, does not do for non linear model algorithm like svm, knn, tree model etc..

To Rectify the second problem, we are using

1.One Hot Encoding with Multiple Categories

2.other encoding methods also used like binary encoding etc…..

1.One Hot Encoding with Multiple Categories:

In these method we find the top N categories in the variable based on the No of times repeated or accured in the variable. Then by using only the N categories, we are bulding the dummy variables or do OneHotEncoding.

Eg : Lets take an below dataset an here i will select n = 2:

  1. First find the no of times the categories accured in the variable.

From above i will select the top 2 categories with most no of times repeated or accured in these case PHD and B.E are most no of times repeated in the column.

2.Then by using that 2 categories, i will perform the One Hot encoding results in

In the above case, it does not keep the information of ignored label, because in these case highschool and masters are ignored, but for both assign same 0 only in the Dum_B.E and Dum_PHD. so if you are using the above dummy variable to predict the future value of 0,0 in Dum_B.E and Dum_PHD then model will not provide any valid predicted value ,because it does not keep the information of ignored label.

Advantages:

1.Straightforward to implement.

2.Does not require hrs of variable exploration.

3.Does not expand massively the feature space (number of columns in the dataset).

Disadvantages

  1. Does not add any information that may make the variable more predictive.
  2. Does not keep the information of the ignored labels.

2.Binary Encoding:

Binary encoding converts a category into binary digits and then each binary digit creates one feature column.

It is also used to solve the second problem in the onehotencoding.

It is similar to the onehotencoding, but only difference is no of features created in the Binary encoding is less than OneHotEncoding.

Algorithm:

1.The categories or labels in the column are first converted to integers starting from 1 to U(integer value is provided to unique categories or label appeared order in a column or we give intergers starting from 1 to U to the unique catagories appeared order in the column, that means we do not considered the any relationship blw categories for giving ranked ordering blw labels, we simply provide intergers to the unique categories in the appeared order in the column).

whereas U is the No of Unique categorie or labels in the column.

2.Then those integers are converted into binary .

3.After converted the integers to binary digit, then the digits of the binary number form separate columns.

Here No of variable created is based on the maximum integer’s binary value provided for the categorical column.

For Example: Maximum integer provided for the categorical column is 10, that means we have 10 unique labels in the column, for representing the 10 in binary is 1010 ,so for given example creates 4 column because no of digits in the binary value of 10 is 4, so it requires 4 features to represent the 10 in the form of binary.

Eg: Lets take an below dataset:

Here no of unique labels is 4.

Step 1:The categories or labels in the column are first converted to integers starting from 1 to U in the appeared order.

In the above case, we give integers from 0 to 4 to the unique categories appeared order in the column. so B.E comes first we will give 1 and Masters comes second we will give 2 and PHD comes third we will give 3 and again B.E comes so we will give same integer as we previously given and like that for values and finally for highschool we will give 4.

so based on the appeared order in the column ,we will provide the integers for the categories or labels.

Step 2:Then convert the integers into binary.

Here the maximum integer is 4 representation of 4 in binary is 100, so we require 3 features or column(because no of digits in the 4 binary value is 3) to represent the all the binary values.

suppose if maximum integer is 50 representation of 50 in binary is 110010, so we require 6 features or column(because no of digits in the 50 binary value is 6) to represent the all the binary values.

Step 3:Then provide seperate column to each digit in the binary.

Disadvantages:

  1. when categorical variable with many labels, then it dramatically increases the dimentional of the data.

3.Frequency or count Encoding:

In frequency encoding, each of the categories in the feature is replaced with the frequencie or percentage of categorie in that feature.

Algorithm:

1.Select a categorical variable you would like to encode.

2.Then Group the each categories and obtain counts of each category.

3.Finally by using the count of each category to find the percentage or frequency.

4.Then replace the each categorie with respective percentage categorie.

Eg: Lets take an below dataset.

Step 1 : Select a categorical variable you would like to encode.

In these case Edu categorical variable is selected.

Step 2 : Then Group the each categories and obtain counts of each category.

Step 3 : Finally by using the count of each category to find the percentage or frequency.

Step 4 : Then replace the each categorie with respective percentage categorie.

Advantages:

1.Straightforward to implement.

2.Does not expand the feature space

3.Can work well enough with tree based algorithms

Disadvantages:

1.Not suitable for linear models

2.Does not handle new categories in test set automatically

3.If two different categories appear with the same frequency within the dataset, they will be replaced by the same number — may lose valuable information. This is called as “Collition” because both categorie has same frequency.

Ok folks remaining encodings we will see in next article thank you.

--

--