Feature Encoding

Deniz Gunay
19 min readAug 17, 2023

Although some machine learning models are able to deal with categorical(non numerical) values, most of the machine learning models can only work with numerical values. For example, K-Nearest Neighbours Algorithm calculates the Euclidean distance between the two observations, of a feature. So, the input should be of numerical type before feeding data to an algorithm. For this reason, it is necessary to transform the categorical values of the relevant features into numerical ones. This process is called feature encoding.

There are two different type of categorical variable:

  • Ordinal Data: Data that comprises a finite set of discrete values with an order or level of preferences. Example — [Low, Medium, High], [Positive, Negative], [True, False]
Ordinal Data
  • Nominal Data: Data that comprises a finite set of discrete values with no relationship between them. Example — [“India”, “America”, “England”], [“Lion”, “Monkey”, “Zebra”]
Nominal Data

For Ordinal Data, after encoding the data and training the model, we need to transform it to its original form again, as it is needed to properly predict the value. But for Nominal Data, it is not required as here preference doesn’t matter, we just need the information.

NOTE: While coding, you can see some categorical variables that seem numerical. Although they are categorical, we don’t apply encoding for these variables since they are already in a numerical form.

For example, ‘Sex’ variable below is a categorical variable even though it seems numerical. Therefore, we don’t need to apply any encoding.

Male:0 , Female: 1

Label Encoding

Label Encoding is very simple and it basically converts each non numerical value in a column to a number. When we deal with binary data and ordinal data, we use label encoding. For example,

Binary data
Ordinal data

Let’s do some coding to understand better!

from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, RobustScaler
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)





def load():
data = pd.read_csv("titanic.csv")
return data

Encoding is actually means changing. When you encode something, you change it. For example, there are two sex: Male and Female. You can write 1 and 0, instead of writing Male and Female. We do this because most of the time working with numbers is easier when it comes to machine learning algorithms.

df = load()
#Sex is a binary variable since it consists of Male and Female.
print(df["Sex"].head())
'''
0 male
1 female
2 female
3 female
4 male
Name: Sex, dtype: object
'''





#Now, encode Sex column by using label encoder.
#For this, we will use LabelEncoder() class and fit_transfrom() method.
le = LabelEncoder()
print(le.fit_transform(df["Sex"])[0:5]) # [1 0 0 0 1] (1: male, 0: female)





#If you don't know what those 0s and 1s represent, with inverse_transform()
#method, you can learn their meanings.
print(le.inverse_transform([0, 1])) # ['female' 'male']

NOTE : Encoding takes place alphabetically. The word Female is represented by a 0 because the word Female is alphabetically before the word Male.

What is the difference between dataframe[column].nunique() and len(dataframe[column].unique()) ?

Let’s say ‘Embarked’ column consists of ‘S’ , ‘Q’ and ‘C’ , but also there are some NaN values. In this case, using unique() method will treat NaN values as a category. For example,

print(df['Embarked'].unique()) # ['S' 'C' 'Q' nan]


#Therefore, len(df['Embarked'].unique()) will be 4
print(len(df['Embarked'].unique())) # 4

On the other hand, nunique() will not treat NaN values as a category. It will only care about ’S’ , ‘C’ and ‘Q’ . Therefore, it will return 3.

print(df['Embarked'].nunique())    # 3

A generalized label encoder function

As you know, we can apply Label Encoder to binary variable or ordinal variable. Therefore, we can write a function for that purpose.

def label_encoder(dataframe, binary_or_ordinal_col):
labelencoder = LabelEncoder()
dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_or_ordinal_col])
return dataframe




#Let's use this function with our titanic dataset. In titanic dataset,
#we don't have any ordinal columns.
print(df.head())
'''
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
'''





#But we have a binary non numerical column which is 'Sex'.
binary_cols = [col for col in df.columns if df[col].dtype not in ['int64', 'float64']
and df[col].nunique() == 2]

print(binary_cols) # ['Sex']





#Apply Label Encoder for binary non numerical columns
for col in binary_cols:
label_encoder(df, col)





#Let's see these binary columns after encoding.
print(df[binary_cols].head())
'''
Sex
0 1
1 0
2 0
3 0
4 1
'''

Now, let’s try Label Encoder on a different dataset.

def load_application_train():
data = pd.read_csv("application_train.csv")
return data




#Now, let's import application_train data.
dff = load_application_train()



#Since we don't have any ordinal variable, we will dealt with binary non
#numerical variables only.
binary_cols = [col for col in dff.columns if dff[col].dtype not in ['int64', 'float64']
and dff[col].nunique() == 2]




#These are binary columns.
print(binary_cols)
# ['NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'EMERGENCYSTATE_MODE']
print(dff[binary_cols].head())
'''
NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY EMERGENCYSTATE_MODE
0 Cash loans N Y No
1 Cash loans N N No
2 Revolving loans Y Y NaN
3 Cash loans N Y NaN
4 Cash loans N Y NaN
'''





#Use Label Encoder for encoding thesee binary columns.
for col in binary_cols:
label_encoder(dff, col)

NOTE: As long as you don’t specify, NaN values are also encoded.

#For example,
#EMEGENCYSTATE_MODE is a binary column but it has 0s , 1s and 2s.
#These 2s represent NaN values for this column.
print(dff[binary_cols].head())
'''
NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY EMERGENCYSTATE_MODE
0 0 0 1 0
1 0 0 0 0
2 1 1 1 2
3 0 0 1 2
4 0 0 1 2
'''

One Hot Encoding

When you have nominal data, one hot encoding maps each category with binary numbers (0 or 1). In other words, each category is transformed to a binary numerical variable. These new variables are called dummy variables. For example, this is a nominal data:

Nominal data

If we apply one hot encoding to nominal data above, we will get something like that:

Encoded data

Additionally, in order to prevent collinearity issues, you can drop the first column above, which is GS column:

Encoded data — first column dropped

However, you don’t need to drop the first column always. Most of the time, collinearity does not lead to a huge problem.

NOTE: If a nominal variable contains some NaN values, you can create a NaN column by specifying dummy_na parameter as True during one hot encoding. You will see how to do it just below.

One hot encoding is useful for data that has no relationship to each other. Machine learning algorithms treat the order of numbers as an attribute of significance. In other words, they will read a higher number as better or more important than a lower number. While this is helpful forordinal situations, when it is a nominal situation, this can lead to issues with prediction and poor performance. Now, let’s do some coding!

#Let's look at the titanic's Embarked column. 
#It has 3 categories: 'S' , 'C' and 'Q'
df = load()
print(df["Embarked"].value_counts())
'''
S 644
C 168
Q 77
Name: Embarked, dtype: int64
'''





#Now we will convert 'Embarked' column to a numerical column. Embarked column
#has 3 categories. Therefore, each category will be a column. We will not drop
#the first column, and won't add any NaN column. Therefore, drop_first=False
#and dummy_na=False
print(pd.get_dummies(df, columns=["Embarked"], drop_first=False, dummy_na=False).head())
'''
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_C Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN 0 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 1 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 0 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN 0 0 1
'''






#If we say drop_first=True, then Embarked_C column will drop. Let's see.
print(pd.get_dummies(df, columns=["Embarked"], drop_first=True, dummy_na=False).head())
'''
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN 0 1
'''






#If we say dummy_na=True, then there will be an extra column for NaN values..
print(pd.get_dummies(df, columns=["Embarked"], drop_first=True, dummy_na=True).head())
'''
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_Q Embarked_S Embarked_nan
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN 0 1 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 0 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 1 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 0 1 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN 0 1 0
'''

NOTE: As default, dummy_na=False and drop_first=False in get_dummies() function.

print(pd.get_dummies(df, columns=["Embarked"]).head())
'''
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_C Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN 0 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 1 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 0 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN 0 0 1
'''






#Let's write drop_first=True
#Embarked_C column will drop.
print(pd.get_dummies(df, columns=["Embarked"], drop_first=True).head())
'''
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN 0 1
'''

NOTE: You can apply one hot encoder to nominal numerical variables. For example, let’s say you have a Pclass variable which consists of 1, 2 and 3s. If these numbers are just a nominal category, then you can apply one hot encoding here:

Nominal numerical data

A generalized one hot encoder function

Now let’s write a function that does what we have done so far. But before that, we need to grab the columns that can be applied one hot encoding.

def one_hot_encoder(dataframe, categorical_cols, drop_first=True, dummy_na=False):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first, dummy_na=dummy_na)
return dataframe





#Let's find colums to be applied one hot encoding.
#We will ignore cardinal columns and binary columns by
#setting thresholds. One hot encoding is not applied to
#cardinal columns. For binary columns, we typically use
#label encoding instead of one hot encoding. So we will
#apply one hot encoding to categorical columns that are
#not binary or cardinal.
ohe_cols = [col for col in df.columns if 10 >= df[col].nunique() > 2]
print(ohe_cols) # ['Pclass', 'SibSp', 'Parch', 'Embarked']






#After using one hot encoding, let's see new columns.
print(one_hot_encoder(df, ohe_cols).head())
'''
PassengerId Survived Name Sex Age Ticket Fare Cabin Pclass_2 Pclass_3 SibSp_1 SibSp_2 SibSp_3 SibSp_4 SibSp_5 SibSp_8 Parch_1 Parch_2 Parch_3 Parch_4 Parch_5 Parch_6 Embarked_Q Embarked_S
0 1 0 Braund, Mr. Owen Harris male 22.0 A/5 21171 7.2500 NaN 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 PC 17599 71.2833 C85 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 3 1 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 113803 53.1000 C123 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
4 5 0 Allen, Mr. William Henry male 35.0 373450 8.0500 NaN 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
'''

Rare Encoding

The process of grouping labels that show a small number of observations in the dataset into a new category -”Rare”. We need to specify the minimum percentage of observations to be considered ‘RARE’.

There are 3 steps for rare encoding:
1) Analyze the frequency of categorical variables.

2) Analyze the relationship between the rare categories and the dependent variable.

3) Write a rare encoder function.

#Let's import application_train data. This data set indicates the 
#characteristics of individuals and whether they have paid their bank debts.
dff = load_application_train()




#We need to work on categorical columns. So we will use grab_col_names()
#function.
def grab_col_names(dataframe, cat_th=10, car_th=20):
'''
Returns categorical columns list, numerical columns list and categorical but cardinal column list.

Parameters
----------
dataframe: dataframe
main dataframe
cat_th: int, float
threshold for the number of unique variable of a column that seems numerical but actually categorical
car_th: int, float
threshold for the number of unique variable of a column that seems categorical but actually cardinal

Returns
-------
cat_cols: list
list of categorical columns
num_cols: list
list of numerical columns
cat_but_car: list
list of of cardinal columns

Notes
------
-> cat_cols + num_cols + cat_but_car = the number of columns of dataframe
-> cat_cols includes num_but_cat
-> Categorical variables with numerical appearance are also included in categorical variables.

Examples
------
import seaborn as sns
df = sns.load_dataset("iris")
print(grab_col_names(df))
'''


cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and dataframe[col].dtypes != "O"]
cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and dataframe[col].dtypes == "O"]
cat_cols = cat_cols + num_but_cat
cat_cols = [col for col in cat_cols if col not in cat_but_car]
num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O" and col not in num_but_cat]

print(f"Observations: {dataframe.shape[0]}")
print(f"Variables: {dataframe.shape[1]}")
print(f"cat_cols: {len(cat_cols)}")
print(f"num_cols: {len(num_cols)}")
print(f"cat_but_car: {len(cat_but_car)}")
print(f"num_but_cat: {len(num_but_cat)}")

return cat_cols, num_cols, cat_but_car





#Let's use grab_col_names() function.
cat_cols, num_cols, cat_but_car = grab_col_names(dff)
'''
Observations: 307511
Variables: 122
cat_cols: 54
num_cols: 67
cat_but_car: 1
num_but_cat: 39
'''

We will apply rare encoding to non numerical columns. In other words, categorical columns and cardinal columns are considered.

cat_car_cols = cat_cols + cat_but_car





#This function show us the count and ratio of every category.
def cat_summary(dataframe, col_name, plot=False):
print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
"Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
print("##########################################")
if plot:
sns.countplot(x=dataframe[col_name], data=dataframe)
plt.show()





for col in cat_car_cols:
cat_summary(dff, col)
'''
NAME_CONTRACT_TYPE Ratio
Cash loans 278232 90.478715
Revolving loans 29279 9.521285
##########################################
CODE_GENDER Ratio
F 202448 65.834393
M 105059 34.164306
XNA 4 0.001301
##########################################
FLAG_OWN_CAR Ratio
N 202924 65.989184
Y 104587 34.010816
##########################################
ORGANIZATION_TYPE Ratio
Business Entity Type 3 67992 22.110429
XNA 55374 18.007161
Self-employed 38412 12.491260
Other 16683 5.425172
Medicine 11193 3.639870
Business Entity Type 2 10553 3.431747
Government 10404 3.383294
School 8893 2.891929
Trade: type 7 7831 2.546576
Kindergarten 6880 2.237318
Construction 6721 2.185613
Business Entity Type 1 5984 1.945947
Transport: type 4 5398 1.755384
Trade: type 3 3492 1.135569
Industry: type 9 3368 1.095245
Industry: type 3 3278 1.065978
Security 3247 1.055897
Housing 2958 0.961917
Industry: type 11 2704 0.879318
Military 2634 0.856555
Bank 2507 0.815255
Agriculture 2454 0.798020
Police 2341 0.761274
Transport: type 2 2204 0.716722
Postal 2157 0.701438
Security Ministries 1974 0.641928
Trade: type 2 1900 0.617864
Restaurant 1811 0.588922
Services 1575 0.512177
University 1327 0.431529
Industry: type 7 1307 0.425025
Transport: type 3 1187 0.386002
Industry: type 1 1039 0.337874
Hotel 966 0.314135
Electricity 950 0.308932
Industry: type 4 877 0.285193
Trade: type 6 631 0.205196
Industry: type 5 599 0.194790
Insurance 597 0.194139
Telecom 577 0.187636
Emergency 560 0.182107
Industry: type 2 458 0.148938
Advertising 429 0.139507
Realtor 396 0.128776
Culture 379 0.123248
Industry: type 12 369 0.119996
Trade: type 1 348 0.113167
Mobile 317 0.103086
Legal Services 305 0.099183
Cleaning 260 0.084550
Transport: type 1 201 0.065364
Industry: type 6 112 0.036421
Industry: type 10 109 0.035446
Religion 85 0.027641
Industry: type 13 67 0.021788
Trade: type 4 64 0.020812
Trade: type 5 49 0.015934
Industry: type 8 24 0.007805
...
...
Output is truncated.
'''

Now let’s look at a categorical column,

#Look at "NAME_INCOME_TYPE" column
print(dff["NAME_INCOME_TYPE"].value_counts())
'''
Working 158774
Commercial associate 71617
Pensioner 55362
State servant 21703
Unemployed 22
Student 18
Businessman 10
Maternity leave 5
Name: NAME_INCOME_TYPE, dtype: int64
'''
#As can be seen, "Unemployed" "Student" "Businessman" and "Maternity leave"
#categories have very low counts. So we can create a new category called
#RARE which consists of these columns.






#BUT, let's look at the relationship between target column("TARGET") and
#"NAME_INCOME_TYPE". Remember, if TARGET=0 then the debt has been paid,
#if TARGET=1 then the debt has not been paid.
print(dff.groupby("NAME_INCOME_TYPE").agg({'TARGET':'mean'}))
'''
TARGET
NAME_INCOME_TYPE
Businessman 0.000000
Commercial associate 0.074843
Maternity leave 0.400000
Pensioner 0.053864
State servant 0.057550
Student 0.000000
Unemployed 0.363636
Working 0.095885
'''
#As you can see, all Businessman has paid their debts but 36% of unemployed
#people has not paid their debts. So if we create a RARE column and then put
#Unemployed and Businessman category into this new RARE category, then we can
#do something wrong. Because there is a huge difference between Businessman
#and Unemployed people regarding to their ability to pay the debt.
#In other words, we should be careful before applying rare encoding.

A generalized function for rare encoder

Let’s write a function that does what we have done so far!

def rare_analyser(dataframe, target, cat_cols):
for col in cat_cols:
print(col, " total number of categories : ", len(dataframe[col].value_counts()))
print(pd.DataFrame({"COUNT": dataframe[col].value_counts(),
"RATIO": dataframe[col].value_counts() / len(dataframe),
"TARGET_MEAN": dataframe.groupby(col)[target].mean()}), end="\n\n\n")




rare_analyser(dff, "TARGET", cat_car_cols)
'''
NAME_CONTRACT_TYPE total number of categories : 2
COUNT RATIO TARGET_MEAN
Cash loans 278232 0.904787 0.083459
Revolving loans 29279 0.095213 0.054783


CODE_GENDER total number of categories : 3
COUNT RATIO TARGET_MEAN
F 202448 0.658344 0.069993
M 105059 0.341643 0.101419
XNA 4 0.000013 0.000000


FLAG_OWN_CAR total number of categories : 2
COUNT RATIO TARGET_MEAN
N 202924 0.659892 0.085002
Y 104587 0.340108 0.072437


ORGANIZATION_TYPE total number of categories : 58
COUNT RATIO TARGET_MEAN
Advertising 429 0.001395 0.081585
Agriculture 2454 0.007980 0.104727
Bank 2507 0.008153 0.051855
Business Entity Type 1 5984 0.019459 0.081384
Business Entity Type 2 10553 0.034317 0.085284
Business Entity Type 3 67992 0.221104 0.092996
Cleaning 260 0.000845 0.111538
Construction 6721 0.021856 0.116798
Culture 379 0.001232 0.055409
Electricity 950 0.003089 0.066316
Emergency 560 0.001821 0.071429
Government 10404 0.033833 0.069781
Hotel 966 0.003141 0.064182
Housing 2958 0.009619 0.079446
Industry: type 1 1039 0.003379 0.110683
Industry: type 10 109 0.000354 0.064220
Industry: type 11 2704 0.008793 0.086538
Industry: type 12 369 0.001200 0.037940
Industry: type 13 67 0.000218 0.134328
Industry: type 2 458 0.001489 0.072052
Industry: type 3 3278 0.010660 0.106162
Industry: type 4 877 0.002852 0.101482
Industry: type 5 599 0.001948 0.068447
Industry: type 6 112 0.000364 0.071429
Industry: type 7 1307 0.004250 0.080337
Industry: type 8 24 0.000078 0.125000
Industry: type 9 3368 0.010952 0.066805
Insurance 597 0.001941 0.056951
Kindergarten 6880 0.022373 0.070349
Legal Services 305 0.000992 0.078689
Medicine 11193 0.036399 0.065845
Military 2634 0.008566 0.051253
Mobile 317 0.001031 0.091483
Other 16683 0.054252 0.076425
Police 2341 0.007613 0.049979
Postal 2157 0.007014 0.084376
Realtor 396 0.001288 0.106061
Religion 85 0.000276 0.058824
Restaurant 1811 0.005889 0.117062
School 8893 0.028919 0.059148
Security 3247 0.010559 0.099784
Security Ministries 1974 0.006419 0.048632
Self-employed 38412 0.124913 0.101739
Services 1575 0.005122 0.066032
Telecom 577 0.001876 0.076256
Trade: type 1 348 0.001132 0.089080
Trade: type 2 1900 0.006179 0.070000
Trade: type 3 3492 0.011356 0.103379
Trade: type 4 64 0.000208 0.031250
Trade: type 5 49 0.000159 0.061224
Trade: type 6 631 0.002052 0.045959
Trade: type 7 7831 0.025466 0.094496
Transport: type 1 201 0.000654 0.044776
Transport: type 2 2204 0.007167 0.078040
Transport: type 3 1187 0.003860 0.157540
Transport: type 4 5398 0.017554 0.092812
University 1327 0.004315 0.048983
XNA 55374 0.180072 0.053996
...
Output is truncated.
'''






#Now, we can finally write rare encoder function.
#We should determine a rarity threshold(rare_perc).
#And if a category show up less than this threshold
#then we can add it to the new RARE category.
def rare_encoder(dataframe, rare_perc):
temp_df = dataframe.copy()

rare_columns = [col for col in temp_df.columns if temp_df[col].dtypes == 'O'
and (temp_df[col].value_counts() / len(temp_df) < rare_perc).any(axis=None)]

for var in rare_columns:
tmp = temp_df[var].value_counts() / len(temp_df)
rare_labels = tmp[tmp < rare_perc].index
temp_df[var] = np.where(temp_df[var].isin(rare_labels), 'Rare', temp_df[var])

return temp_df




new_df = rare_encoder(dff, 0.01)
#Now we changed some of categories as 'Rare'.
#For example, let's look at 'ORGANIZATION_TYPE' column, which is actually
#a cardinal column.
print(new_df[['ORGANIZATION_TYPE']].head())
'''
ORGANIZATION_TYPE
0 Business Entity Type 3
1 School
2 Government
3 Business Entity Type 3
4 Rare
'''





#Use rare_analyzer() again to see the effect of 'Rare' category on
#TARGET variable.
rare_analyser(new_df, "TARGET", cat_car_cols)
'''
NAME_CONTRACT_TYPE total number of categories : 2
COUNT RATIO TARGET_MEAN
Cash loans 278232 0.904787 0.083459
Revolving loans 29279 0.095213 0.054783


CODE_GENDER total number of categories : 3
COUNT RATIO TARGET_MEAN
F 202448 0.658344 0.069993
M 105059 0.341643 0.101419
Rare 4 0.000013 0.000000


FLAG_OWN_CAR total number of categories : 2
COUNT RATIO TARGET_MEAN
N 202924 0.659892 0.085002
Y 104587 0.340108 0.072437


ORGANIZATION_TYPE total number of categories : 18
COUNT RATIO TARGET_MEAN
Business Entity Type 1 5984 0.019459 0.081384
Business Entity Type 2 10553 0.034317 0.085284
Business Entity Type 3 67992 0.221104 0.092996
Construction 6721 0.021856 0.116798
Government 10404 0.033833 0.069781
Industry: type 3 3278 0.010660 0.106162
Industry: type 9 3368 0.010952 0.066805
Kindergarten 6880 0.022373 0.070349
Medicine 11193 0.036399 0.065845
Other 16683 0.054252 0.076425
Rare 41808 0.135956 0.076182
School 8893 0.028919 0.059148
Security 3247 0.010559 0.099784
Self-employed 38412 0.124913 0.101739
Trade: type 3 3492 0.011356 0.103379
Trade: type 7 7831 0.025466 0.094496
Transport: type 4 5398 0.017554 0.092812
XNA 55374 0.180072 0.053996
...
Output is truncated.
'''
#As you can see above, in every column, the ratio of 'Rare' category is
#very low as we expected.






#Let's try to see which categories are added to 'Rare' category
#for "OCCUPATION_TYPE". First we should see the dataset before
#doing Rare encoding:
print(dff["OCCUPATION_TYPE"].value_counts())
'''
Laborers 55186
Sales staff 32102
Core staff 27570
Managers 21371
Drivers 18603
High skill tech staff 11380
Accountants 9813
Medicine staff 8537
Security staff 6721
Cooking staff 5946
Cleaning staff 4653
Private service staff 2652
Low-skill Laborers 2093
Waiters/barmen staff 1348
Secretaries 1305
Realty agents 751
HR staff 563
IT staff 526
Name: OCCUPATION_TYPE, dtype: int64
'''






#And after doing Rare encoding, we have this(output of rare_analyzer() ):
'''
...
OCCUPATION_TYPE total number of categories : 12
COUNT RATIO TARGET_MEAN
Accountants 9813 0.031911 0.048303
Cleaning staff 4653 0.015131 0.096067
Cooking staff 5946 0.019336 0.104440
Core staff 27570 0.089655 0.063040
Drivers 18603 0.060495 0.113261
High skill tech staff 11380 0.037007 0.061599
Laborers 55186 0.179460 0.105788
Managers 21371 0.069497 0.062140
Medicine staff 8537 0.027762 0.067002
Rare 9238 0.030041 0.098181
Sales staff 32102 0.104393 0.096318
Security staff 6721 0.021856 0.107424
...
'''

Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization. We will discover three main feature scaling methods.

1) Standardization: It is a technique in which the mean will be equal to zero and the standard deviation equal to one.

Standardization

2) Robust Scalar: Although it is not very common method, robust scaling is one of the best scaling techniques when we have outliers present in our dataset. It scales the data accordingly to the interquartile range (IQR = 75 Quartile — 25 Quartile).

Robust scalar

3) Min-Max Normalization: Min-Max Normalization is a scaling technique in which the values are rescaled between the range 0 to 1 as default. To normalize our data, we use MinMaxScalar from the Sci-Kit learn library and apply it to our dataset. After applying the MinMaxScalar, the minimum value will be zero and the maximum value will be one.

Min-Max Normalization

Let’s do some coding!

####################################
#Standardization
####################################

df = load()
ss = StandardScaler()
df["Age_standard_scaler"] = ss.fit_transform(df[["Age"]])
print(df[['Age', 'Age_standard_scaler']].head())
'''
Age Age_standard_scaler
0 22.0 -0.530377
1 38.0 0.571831
2 26.0 -0.254825
3 35.0 0.365167
4 35.0 0.365167
'''






####################################
#Robust Scaler
####################################
rs = RobustScaler()
df["Age_robuts_scaler"] = rs.fit_transform(df[["Age"]])
print(df[['Age','Age_robuts_scaler']].head())
'''
Age Age_robuts_scaler
0 22.0 -0.335664
1 38.0 0.559441
2 26.0 -0.111888
3 35.0 0.391608
4 35.0 0.391608
'''







####################################
#Min-Max Normalization
####################################
mms = MinMaxScaler()
df["Age_min_max_scaler"] = mms.fit_transform(df[["Age"]])
print(df[['Age','Age_min_max_scaler']].head())
'''
Age Age_min_max_scaler
0 22.0 0.271174
1 38.0 0.472229
2 26.0 0.321438
3 35.0 0.434531
4 35.0 0.434531
'''






#-------------------------------------------------------------
#-------------------------------------------------------------






#Look at Age_standard_scaler, Age_robuts_scaler and Age_min_max_scaler below.
#Mean of standard scaler is almost 0,
#Median(50%) of robust scaler is 0,
#Min and max values of min-max scaler is 0 and 1, respectively.
print(df.describe().T)
'''
count mean std min 25% 50% 75% max
PassengerId 891.0 4.460000e+02 257.353842 1.000000 223.500000 446.000000 668.500000 891.000000
Survived 891.0 3.838384e-01 0.486592 0.000000 0.000000 0.000000 1.000000 1.000000
Pclass 891.0 2.308642e+00 0.836071 1.000000 2.000000 3.000000 3.000000 3.000000
Age 714.0 2.969912e+01 14.526497 0.420000 20.125000 28.000000 38.000000 80.000000
SibSp 891.0 5.230079e-01 1.102743 0.000000 0.000000 0.000000 1.000000 8.000000
Parch 891.0 3.815937e-01 0.806057 0.000000 0.000000 0.000000 0.000000 6.000000
Fare 891.0 3.220421e+01 49.693429 0.000000 7.910400 14.454200 31.000000 512.329200
Age_standard_scaler 714.0 2.388379e-16 1.000701 -2.016979 -0.659542 -0.117049 0.571831 3.465126
Age_robuts_scaler 714.0 9.505553e-02 0.812671 -1.542937 -0.440559 0.000000 0.559441 2.909091
Age_min_max_scaler 714.0 3.679206e-01 0.182540 0.000000 0.247612 0.346569 0.472229 1.000000
'''

Data Binning

Data binning is a way of pre-processing, summarizing, and analyzing data used to group continuous data into discrete bins or categories. In other words, it converts numerical data to categorical data. It offers several benefits, such as simplifying data analysis and mitigating the impact of outliers in datasets. Let’s see an example:

#Do an equal-frequency binning with 5 bins.

df["Age_qcut"] = pd.qcut(df['Age'], 5)
print(df[['Age','Age_qcut']].head(10))
'''
Age Age_qcut
0 22.0 (19.0, 25.0]
1 38.0 (31.8, 41.0]
2 26.0 (25.0, 31.8]
3 35.0 (31.8, 41.0]
4 35.0 (31.8, 41.0]
5 NaN NaN
6 54.0 (41.0, 80.0]
7 2.0 (0.419, 19.0]
8 27.0 (25.0, 31.8]
9 14.0 (0.419, 19.0]
'''

Thanks for reading…

--

--