Feature Encoding Basic to Advance-Part 3

Banarajay
17 min readDec 23, 2022

--

Introduction:

In these article we will see remaining Feature Encoding for Nominal Categorical

Part 1 Link : https://medium.com/@banarajay/feature-encoding-basic-to-advance-part-1-5fb72e415561

Part 2 Link : https://medium.com/@banarajay/feature-encoding-basic-to-advance-part-1-5fb72e415561

4.Mean or Target Encoding or Likelihood encoding:

As the name suggest, we compute mean for the target variable for each category and encode that category with the target mean.

This technique works for both binary classification and regression.

Algorithm:

1.First it will calculate the Mean for each labels in the feature by using the Target values.

2.After find the Mean values for the each labels, it will assign the mean value of the corresponding label.

Eg :Lets take an below dataset:

Step 1:

Mean for Label 1(B.E) = 1 + 0 /2 = o.5

Mean for Label 2(Masters) = 1 /1 = 1

Mean for Label 1(PHD) = 1 + 1 + 0 /3 = 0.6

Step 2 :Then assign the mean value of corresponding label to the each label.

[ NOTE: In these case, we do not need to calculate the mean for each target like B.E for target_0 and B.E for traget_1 ]

Differnece blw percentage and mean?

1.Percentage:

When we find the percentage for a datapoint, it considered all the data points for the calculation that means denominator value is total no of datapoints.

eg :Lets take an above dataset and in there B.E is accured in two times and total no of datapoints present in the above dataset is 6.

probabillity of B.E = 2 / 6 = 0.33

So we considered the total no of datapoints for calculating the percentage for any datapoint.

2.Mean:

When we find the mean for a datapoint, it considered no of times the datapoints comes in the feature for the calculation that means denominator value is total no of times datapoint accured in the feature.

eg :Lets take an above dataset and in there B.E values are 200 and 150 and also B.E is accured in two times in the Edu feature.

Mean of B.E = 200 + 150 / 2 = 175

So we considered the total no of times datapoints accured in the feature for calculating the mean for any datapoint.

Problems or disadvantages:

When we perform the target encoding, we should face the problem of “Overfitting”.

In Target encoding, overfitting occurs in two ways:

1.Target Leakage:

Information from the target leaked to the independent variables, it is called as “Target Data Leakage” results in overfitting problem.

2.Imbalance categories in the feature:

Due to the presence or occurance of categorie datapoint in smaller number on the feature, at the time we cannot rely(hope) the average of the smaller
categorie because we donot have enough categories datapoint to ensure the average value and also we cannot compare the average of smaller catagorie with higher categorie(occurance of categorie in larger number on the feature).

So if imbalanace categories in the feature, results in “Overfitting”.

Solution:

To rectify the first problem, we have two solution

1.Leave one out Encoding

2.Leave fold out Encoding

1.Leave One out Encoding:

Leave One Out Target Encoding involves calculate the mean of specific datapoint categorie in the feature by using target value of same categorie datapoint except the current row(above specific datapoint categorie) to avoid the target leakage. Like that do for other same categorie datapoint individually,so results in mean value of same categorie in the feature are different, through we can avoid the “target Leakage” results in no overfitting problem.

That means suppose take 1st datapoint B.E categorie in the Edu feature which is occured in 2 times in the dataset.so when we use Leave one out encoding to find the mean for 1st B.E categorie, then i considered or taken all the same B.E categorical datapoint target value in the mean calculation, but except the current row or i do not taken the 1st B.E categorie datapoint for calculating the mean for 1st B.E categorie datapoint. Like that i will do for other same categorie datapoint in feature, as a result we are getting different mean value for all same catagorie datapoints.

so same process is done other categories in the feature.

This is particularly used when we have small dataset, because it takes more time to compute so we cannot use for large dataset.

But this technique can also lead to overfitting in some cases.

Eg :Lets take an below dataset:

Step 1 : Find the mean value for each categorie datapoint(represent each datapoint in the specific categorie) in the each categorie(represent one whole categorie or specific categorie like B.E categorie,Master categorie etc..) on the feature.

1.B.E(specific categorie contains 2 categorie datapoints):

Mean for 1st datapoint B.E categorie datapoint= 0/1 = 0

Mean for 4th datapoint of same catagorie of B.E categorie datapoint= 1/1 = 1

so in these case for finding the mean for 1st datapoint B.E datapoint, i take the other B.E categorie datapoint in these case 4th datapoint, except the current row.

For finding the mean for 4th B.E datapoint, i take the other B.E categorie datapoint in these case 1th datapoint except the current row.

2.Masters(specific categorie,contains 1 categorie datapoints):

Mean for 2nd datapoint masters categorie datapoint= 0/0 = 0

so in these case for finding the mean for 2nd datapoint Master categorie datapoint,i take the other Masters categorie datapoint but in these there is no other Masters categorie datapoint,so it is 0 and we except the current row also.

3.PHD(specific categorie contains 3 catagorie datapoints):

Mean for 3rd datapoint PHD categorie datapoint= 1 + 1 /2 = 1

Mean for 5th datapoint of same categorie of PHD categorie datapoint= 0 + 1/2 = 0.5

Mean for 6th datapoint of same categorie of PHD categorie datapoint= 0 + 1/2 = 0.5

so in these case for finding the mean for 3rd datapoint PHD datapoint, i take the other PHD catagorie datapoint in these case 5th and 6th datapoint, except the current row.

For finding the mean for 5th datapoint PHD datapoint, i take the other PHD categorie datapoint in these case 3rd and 6th datapoint, except the current row.

For finding the mean for 6th datapoint PHD datapoint, i take the other PHD categorie datapoint in these case 3rd and 5th datapoint, except the current row.

As a result we are getting,

Mean for 1st datapoint B.E categorie datapoint= 0

Mean for 4th datapoint of B.E categorie datapoint= 1

Mean for 2nd datapoint masters categorie datapoint= 0

Mean for 3rd datapoint PHD categorie datapoint= 1

Mean for 5th datapoint of PHD categorie datapoint= 0.5

Mean for 6th datapoint of PHD categorie datapoint= 0.5

so in the above case we are getting the different mean value for same categorie datapoint.

Step 2 :Apply the mean value in categorical column.

Through Leave One out Encoding we can avoid the Overfitting problem.

2.Leave fold out Encodingor K fold Encoding:

As the name suggest, we divide the categorical feature into the k-folds and then calculate the mean value for each categorie datapoints in the first fold by using the same categorie datapoints in the other folds. Like that for find the mean value for each categorie datapoints in the second fold by using the same categorie datapoint in the other folds and also do same process for all other folds.

It is same as the Leave one out encoding, but only difference is it uses the Folds.

Eg :Lets take an below dataset and here no of fold is 3:

Step 1:First divide the above feature into 3 folds.

Step 2 :Then finds the mean value for each categorie datapoint in each fold by using the other fold same categorie datapoint.

In Fold 1:(contains two datapoint categories, both are different)

mean value for 1st datapoint in the 1st fold of B.E categorie datapoint = 0 / 1 = 0

mean value for 2nd datapoint in the 1st fold of Master categorie datapoint = 0 / 1 = 0

so in the above case for finding the mean of B.E categorie in the 1st fold, we are using the other same B.E categorie datapoint in the other two fold in these case 4th categorie datapoint in 2nd fold.

For finding the mean of Master categorie in the 1st fold, we are using the other same Master categorie datapoint in the other two fold in these case 7th catagorie datapoint in 3nd fold.

In Fold 2:(contains three datapoint categories,some are same,some are different)

mean value for 3rd datapoint in the 2nd fold of PHD categorie datapoint = 1 + 1 / 2 = 1

mean value for 4th datapoint in the 2nd fold of B.E categorie datapoint = 1 / 1 = 1

mean value for 5th datapoint in the 2nd fold of PHD categorie datapoint = 1 + 1 / 2 = 1

so in the above case for finding the mean of PHD categorie in the 2nd fold, we are using the other same PHD categorie datapoint in the other two fold in these case 6th and 8th categorie datapoint in 3nd fold.

For finding the mean of B.E categorie in the 2nd fold, we are using the other same B.E categorie datapoint in the other two fold in these case 1st categorie datapoint in 1st fold.

For finding the mean of one more PHD categorie in the 2nd fold, we are using the other same PHD categorie datapoint in the other two fold in these case 6th and 8th categorie datapoint in 3nd fold.

In Fold 3:(contains three datapoint categories,some are same,some are different)

mean value for 6th datapoint in the 3rd fold of PHD categorie datapoint = 0 + 1 / 2 = 0.5

mean value for 7th datapoint in the 3rd fold of B.E categorie datapoint = 1 / 1 = 1

mean value for 8th datapoint in the 3rd fold of PHD categorie datapoint = 0 + 1 / 2 = 0.5

so in the above case for finding the mean of PHD categorie in the 3rd fold, we are using the other same PHD categorie datapoint in the other two fold in these case 3rd and 5th categorie datapoint in 2nd fold.

For finding the mean of Masters categorie in the 3rd fold, we are using the other same B.E categorie datapoint in the other two fold in these case 2nd catagorie datapoint in 1st fold.

For finding the mean of one more PHD categorie in the 3rd fold, we are using the other same PHD categorie datapoint in the other two fold in these case 3rd and 5th categorie datapoint in 2nd fold.

As a result we are getting the,

mean value for 1st datapoint in the 1st fold of B.E categorie datapoint = 0

mean value for 2nd datapoint in the 1st fold of Master categorie datapoint = 0

mean value for 3rd datapoint in the 2nd fold of PHD categorie datapoint = 1

mean value for 4th datapoint in the 2nd fold of B.E categorie datapoint = 1

mean value for 5th datapoint in the 2nd fold of PHD categorie datapoint = 1

mean value for 6th datapoint in the 3rd fold of PHD categorie datapoint = 0.5

mean value for 7th datapoint in the 3rd fold of B.E categorie datapoint = 1

mean value for 8h datapoint in the 3rd fold of PHD categorie datapoint = 0.5

Step 3 :Apply the mean value in catagorical column.

K fold encoding outperform very well in many scenario, but sometimes it wouldn’t works.

To rectify the second problem,we are using the “Smoothing techniques”:

1.Smoothing Technique or Additive smoothing Technique:

This technique is particularly useful to handle situations when there are very few datapoints of some categorie in the feature.

At this imbalance situation, we cannot rely(hope) the mean of the small categorie because we do not have enough categorie datapoint to ensure the correct mean calculation, so at the time we are using the “Global mean” for the small categorie which can be achieved by using the Smoothing technique to avoid the overfitting problem.

It combines the original mean value for the categorie with the certain amount of global mean for data point to rely on the mean calculation here the amount of global mean is mixed with original mean is decided by the parameter w.

In Navie Bayas algorithm also we are using these tecniques for dealing with occurance of small no of datapoints.

Formula for Additive smoothing is

whereas

mu is the mean we’re trying to compute (the one that’s going to replace our categorical values)

n is the number of catagories(finded mu catagorie) you have.
_
x is your estimated or original mean of catagorie(finded mu catagorie).

w is the “weight” -“how much our new catagorie mean(mu) will projected or similar to overall or global mean”.

m is the overall mean or global mean.

Here w is the parameter we have to set,

if we give higer value for the w parameter, then our new categorie mean(mu) value is more similar to global mean.

if we give 0 for w parameter,then our new categorie mean is same as original catagorie mean value.

if you give infinte value for w parameter, then our new categorie mean doesnot exceeds or larger than global mean, always equal to global mean or less than global mean.

Here denominator term is used to make sure new catagorie mean(mu) is always equal or less than global mean.

Numerator is the mixing of original mean with global mean.

Eg :Lets take an example of dataset with imbalace features and w = 10:

If you calculate the mean for each categorie in the Edu feature,then

Here global mean = 1 + 0 + 0 + 0 + 1 + 1 / 6 = 0.50

In the above case Master categorie appears in smaller number line single time and we cannot rely the mean value of the Masters categorie,so we move the original mean value for Masters towards global mean value.

Step 1 :calculate the new categorical(mu) value for all the categorie datapoints in the Edu feature.

B.E:

Here

n = 2(2 times B.E is comes in Edu feature)
_
x = 0.5(B.E mean value)

m = 0.5(global mean)

w = 50(user given)

Masters:

Here

n = 1(1 times Masters is comes in Edu feature)
_
x = 0(Masters mean value)

m = 0.5(global mean)

w = 50(user given)

PHD:

Here

n = 3(3 times PHD is comes in Edu feature)
_
x = 0.66(PHD mean value)

m = 0.5(global mean)

w = 50(user given)

From above you can see Master occures in 0 times, so in these case we do not considered the original mean(0), instead we takes the mean towards global mean and also mean of the Master does not exceeds than the global mean.

These smoothing can be performed by using the Mean encoding in categorie encoder library and also done by using the M-estimator class,which we will see in below.

5.M-estimator encoder:

In M-estimator perform the above smoothing operation and it has only one parameter “M”.

And M is the “Weight” that we have to given to these encoder.

6.Weight of evidence Encoding(WOE):

Weight of evidence is generally described as measure the strength or abillity to separation the event and non-event.so we getting confidence or evidence in the prediction of the event by the particular categorie(in case of discrete) or bin(in case of continous).

It is find for each categorie in the variable to show how each categorie seperate the event and non-event in the dependent variable.

Weight of evidence is used to find the “Information Value”.so we can also define the WOE as “The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable”.

It is used to transform the categories into continous variable and also used to convert continous to categories.but in these case we are focus on coverting catagories to continous.

It is only used for binary class.

It is developed mainly for financial Domain,in that we use for Credit score.

It can calculated by using the below formula

It ranges from -1 to +1.

whereas

ln is the Natural Log

% of Non-event is the percentage of non-events with respect to particular catagorie or bin.

% of event is the percentage of event with respect to particular catagorie or bin.

if Positive WOE for the categorie or bin, that means Distribution of event(% of event) > Distribution of Non-event(% of Non-event).so this particular categorie or bin seperate the event more from the non-event and also it gives the confidence or evidence that this particular categorie or bin prediction value is event.

if Negative WOE for the categorie or bin,that means Distribution of event(% of event) < Distribution of Non-event(% of Non-event)
so this particular categorie or bin seperate the non-event more from the event and also it gives the confidence or evidence that this particular catagorie or bin prediction value is non-event.

if Log of a number > 1 means positive value.

if less than 1, it means negative value.

Algorithm:

1.Calculate the number of events and non-events in each categorie(bin).

2.Calculate the % of events and % of non-events in each categorie.

3.Calculate WOE by taking natural log of division of % of non-events and % of events(by using the above formula).

4.Then replace the each categorie with the corresponding WOE value.

Eg :Lets take an occurance Matrix:

Step 1 : Calculate the number of events and non-events in each catagorie(bin):

From above we can see the Occ_Imputed categorical variable contains 5 catagories like “Missing,Prof,Sal,Self-Emp,Senp” and cnt_resp,cnt_non_resp are the two label.

And we can also see the count of each categorie with respect to each label.

Step 2 :Calculate the % of events and % of non-events in each categorie(bin):

Step 3: calculate the WOE:

Step 4 :Replace the value of each catagorie with corresponding WOE.

Information Value(IV):

Information value is one of the most useful technique to select important variables in a predictive model.

Information Value helps to quantify the predictive power of a variable with respect to dependent variable or in separating the events from the non-events.

It helps to rank variables on the basis of their importance.

It ranges from 0–1.

The IV is calculated using the following formula:

Whereas

N is the total no of categories in the Categorical features.

Here when take WOE,it is always positive only.

Rules related to Information Value:

Information Value Variable Predictiveness

Less than 0.02 Not useful for prediction

0.02 to 0.1 Weak predictive Power

0.1 to 0.3 Medium predictive Power

0.3 to 0.5 Strong predictive Power

>0.5 Suspicious Predictive Power

The above is used for financial Domain.Based on the domain value can be change.

Eg :Lets take same example:

Information Value for above case is 0.4574957676857.

Higher IV mean higher predictive power with respect to output variable.

Diff blw WOE and IV?

WOE is used to quantify the predictive power of a variable in separating the events from the non-events by each categorie.so it looks each categorie in the variable to see the predictive power to seperate the event from non-event.

But IV is used to quantify the predictive power of a variable in separating the events from the non-events by overall variable.so it looks overall variable to see the predictive power to seperate the event from non-event.

7.Hash Encoding:

As the name suggest hash encoding uses the hash function to perform the feature encodings.

Hashing is the process of transformation of arbitary size inputs(Total no of categories) into the fixed-sized values or inputs.

or

Hashing is the process of converting the string of characters(catagorie) into hash value and then we take the modulo for hash value to map into fixed value.

It is the powerfull technique to handle the sparse(that means when we have high dimentional data,if we use one hot encoding then we get higher spare data,that can be avoided by using this tecnique) and also handle the higher dimentional.

It is fast,memory-efficient and simple.

But it results in loss of information,because large no of dimention are reduced into small fixed size.

Eg :1000 reduced into 10 dimention,then results in the loss of information.

Algorithm:

1.First we have to choose the no_of_component(n), that means how many features you want.

2.Then we have to find the hash value for the each categories by using hash function(Default:MD5).

Hash Functions are:

1.Message Digest (MD, MD2, MD5),

2.Secure Hash Function (SHA0, SHA1, SHA2),

3.Murmurhash 3 and many more.

NOTE:Use murmurhash 3 hash function that works very well in the “Collision problem”.

3.And then we map the hash value to the index of hash table by using the modulo of n(no_of_component).

what is hash table?

Hash table contains two parts

1.Index part

2.Values part

Index contains columns in hash table no_of_component(n) column or values are there in the index and each column has the name starting from 0 to no_of_component-1,then by using these column name or index value to map the hash value by using the modulo of n.

Values part contains binary values,if hash value maps to the particular index value or column,then in that position put 1 and in other position put 0.

why we need to use modulo n?

Because here we want to convert any no of inputs(no of catagories) into n(no_of_component) dimention that can be achieved by taking the modulo of n for the hash value to make any no of inputs into fixed dimention(User given),in other words we use modulo of n to map the hash value to the index of the hash table.

4.After mapping the hash values into the hash table we will get the final numerical converted data.

Eg :Lets take an example below catagorie feature:

Step 1 :choose the no_of_component(n).

In these case i was taken no_of_component(n) as 3.

Step 2 :Then we have to find the hash value for the each catagories by using hash function.

In these case i choose murmurhash 3.

Step 3 :Then we map the hash value to the index of hash table by using the modulo of n(no_of_component).

Step 4 :Then map the corresponding modulo value to the corresponding index value or column.

Problem in above example:

If you see john,movie and work has same modulo value results in the same indication in the binary values,these problem is called as “”Collition”(that means when more than one categories has same modulo values,so they are landup in the same binary notation to collide to each other).

Solution:

For rectify the above problem,we do not have any other methods,but this problem can be solved by choosing the no_of_components parameter.

That means,

when we choose the value for the parameter no_of_components,we have to focus on “Tradeoff blw No of dimention and collition”

if we choose no_of_components is small,then we get small no of dimention and more no of collition.

if we choose no_of_components is large,then we get more no of dimention and small no of collition.

suppose if we choose no_of_components is medium,then we get small no of dimention and also small no of collition.

so through setting the no_of_component to rectify the above problem.

Advantages:

1.It is fast,memory-efficient and simple.

2.It deals with high dimentional dataset.

Disadvantages:

1.But it results in loss of information,because large no of dimention are reduced into small fixed size.

Eg :1000 reduced into 10 dimention,then results in the high loss of information.

suppose if i reduce same 1000 into 600 then results in small loss of information.

Ok folks remaining encodings we will see in next article thank you.

--

--