Feature Encoding

19 min readAug 17, 2023

Although some machine learning models are able to deal with categorical(non numerical) values, most of the machine learning models can only work with numerical values. For example, K-Nearest Neighbours Algorithm calculates the Euclidean distance between the two observations, of a feature. So, the input should be of numerical type before feeding data to an algorithm. For this reason, it is necessary to transform the categorical values of the relevant features into numerical ones. This process is called feature encoding.

There are two different type of categorical variable:

Ordinal Data: Data that comprises a finite set of discrete values with an order or level of preferences. Example — [Low, Medium, High], [Positive, Negative], [True, False]

Nominal Data: Data that comprises a finite set of discrete values with no relationship between them. Example — [“India”, “America”, “England”], [“Lion”, “Monkey”, “Zebra”]

For Ordinal Data, after encoding the data and training the model, we need to transform it to its original form again, as it is needed to properly predict the value. But for Nominal Data, it is not required as here preference doesn’t matter, we just need the information.

NOTE: While coding, you can see some categorical variables that seem numerical. Although they are categorical, we don’t apply encoding for these variables since they are already in a numerical form.

For example, ‘Sex’ variable below is a categorical variable even though it seems numerical. Therefore, we don’t need to apply any encoding.

Label Encoding

Label Encoding is very simple and it basically converts each non numerical value in a column to a number. When we deal with binary data and ordinal data, we use label encoding. For example,

Let’s do some coding to understand better!

titanic.csv

Edit description

drive.google.com

from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, RobustScaler
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)





def load():
    data = pd.read_csv("titanic.csv")
    return data

Encoding is actually means changing. When you encode something, you change it. For example, there are two sex: Male and Female. You can write 1 and 0, instead of writing Male and Female. We do this because most of the time working with numbers is easier when it comes to machine learning algorithms.

df = load()
#Sex is a binary variable since it consists of Male and Female.
print(df["Sex"].head())
'''
0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object
'''





#Now, encode Sex column by using label encoder. 
#For this, we will use LabelEncoder() class and fit_transfrom() method.
le = LabelEncoder()
print(le.fit_transform(df["Sex"])[0:5])  # [1 0 0 0 1]  (1: male, 0: female)





#If you don't know what those 0s and 1s represent, with inverse_transform() 
#method, you can learn their meanings.
print(le.inverse_transform([0, 1]))     # ['female' 'male']

NOTE : Encoding takes place alphabetically. The word Female is represented by a 0 because the word Female is alphabetically before the word Male.

What is the difference between dataframe[column].nunique() and len(dataframe[column].unique()) ?

Let’s say ‘Embarked’ column consists of ‘S’ , ‘Q’ and ‘C’ , but also there are some NaN values. In this case, using unique() method will treat NaN values as a category. For example,

print(df['Embarked'].unique()) # ['S' 'C' 'Q' nan]


#Therefore, len(df['Embarked'].unique()) will be 4
print(len(df['Embarked'].unique()))   # 4

On the other hand, nunique() will not treat NaN values as a category. It will only care about ’S’ , ‘C’ and ‘Q’ . Therefore, it will return 3.

print(df['Embarked'].nunique())    # 3

A generalized label encoder function

As you know, we can apply Label Encoder to binary variable or ordinal variable. Therefore, we can write a function for that purpose.

def label_encoder(dataframe, binary_or_ordinal_col):
    labelencoder = LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_or_ordinal_col])
    return dataframe 




#Let's use this function with our titanic dataset. In titanic dataset,
#we don't have any ordinal columns.
print(df.head())
'''
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
'''





#But we have a binary non numerical column which is 'Sex'. 
binary_cols = [col for col in df.columns if df[col].dtype not in ['int64', 'float64']
               and df[col].nunique() == 2]

print(binary_cols)   #   ['Sex']





#Apply Label Encoder for binary non numerical columns
for col in binary_cols:
    label_encoder(df, col)





#Let's see these binary columns after encoding.
print(df[binary_cols].head())
'''
   Sex
0    1
1    0
2    0
3    0
4    1
'''

Now, let’s try Label Encoder on a different dataset.

application_train.csv

Edit description

drive.google.com

def load_application_train():
    data = pd.read_csv("application_train.csv")
    return data




#Now, let's import application_train data.
dff = load_application_train()



#Since we don't have any ordinal variable, we will dealt with binary non
#numerical variables only.
binary_cols = [col for col in dff.columns if dff[col].dtype not in ['int64', 'float64']
               and dff[col].nunique() == 2]




#These are binary columns.
print(binary_cols)           
# ['NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'EMERGENCYSTATE_MODE']
print(dff[binary_cols].head())
'''
  NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY EMERGENCYSTATE_MODE
0         Cash loans            N               Y                  No
1         Cash loans            N               N                  No
2    Revolving loans            Y               Y                 NaN
3         Cash loans            N               Y                 NaN
4         Cash loans            N               Y                 NaN
'''  





#Use Label Encoder for encoding thesee binary columns.
for col in binary_cols:
    label_encoder(dff, col)

NOTE: As long as you don’t specify, NaN values are also encoded.

#For example,
#EMEGENCYSTATE_MODE is a binary column but it has 0s , 1s and 2s. 
#These 2s represent NaN values for this column.
print(dff[binary_cols].head())
'''
   NAME_CONTRACT_TYPE  FLAG_OWN_CAR  FLAG_OWN_REALTY  EMERGENCYSTATE_MODE
0                   0             0                1                    0
1                   0             0                0                    0
2                   1             1                1                    2
3                   0             0                1                    2
4                   0             0                1                    2
'''

One Hot Encoding

When you have nominal data, one hot encoding maps each category with binary numbers (0 or 1). In other words, each category is transformed to a binary numerical variable. These new variables are called dummy variables. For example, this is a nominal data:

If we apply one hot encoding to nominal data above, we will get something like that:

Additionally, in order to prevent collinearity issues, you can drop the first column above, which is GS column:

However, you don’t need to drop the first column always. Most of the time, collinearity does not lead to a huge problem.

NOTE: If a nominal variable contains some NaN values, you can create a NaN column by specifying dummy_na parameter as True during one hot encoding. You will see how to do it just below.

One hot encoding is useful for data that has no relationship to each other. Machine learning algorithms treat the order of numbers as an attribute of significance. In other words, they will read a higher number as better or more important than a lower number. While this is helpful forordinal situations, when it is a nominal situation, this can lead to issues with prediction and poor performance. Now, let’s do some coding!

#Let's look at the titanic's Embarked column. 
#It has 3 categories: 'S' , 'C' and 'Q'
df = load()
print(df["Embarked"].value_counts())
'''
S    644
C    168
Q     77
Name: Embarked, dtype: int64
'''





#Now we will convert 'Embarked' column to a numerical column. Embarked column 
#has 3 categories. Therefore, each category will be a column. We will not drop 
#the first column, and won't add any NaN column. Therefore, drop_first=False 
#and dummy_na=False
print(pd.get_dummies(df, columns=["Embarked"], drop_first=False, dummy_na=False).head())
'''
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin  Embarked_C  Embarked_Q  Embarked_S
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN           0           0           1
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85           1           0           0
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN           0           0           1
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123           0           0           1
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN           0           0           1
'''






#If we say drop_first=True, then Embarked_C column will drop. Let's see.
print(pd.get_dummies(df, columns=["Embarked"], drop_first=True, dummy_na=False).head())
'''
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin  Embarked_Q  Embarked_S
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN           0           1
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85           0           0
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN           0           1
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123           0           1
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN           0           1
'''  






#If we say dummy_na=True, then there will be an extra column for NaN values..
print(pd.get_dummies(df, columns=["Embarked"], drop_first=True, dummy_na=True).head())
'''
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin  Embarked_Q  Embarked_S  Embarked_nan
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN           0           1             0
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85           0           0             0
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN           0           1             0
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123           0           1             0
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN           0           1             0
'''

NOTE: As default, dummy_na=False and drop_first=False in get_dummies() function.

print(pd.get_dummies(df, columns=["Embarked"]).head())
'''
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin  Embarked_C  Embarked_Q  Embarked_S
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN           0           0           1
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85           1           0           0
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN           0           0           1
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123           0           0           1
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN           0           0           1
'''






#Let's write drop_first=True 
#Embarked_C column will drop.
print(pd.get_dummies(df, columns=["Embarked"], drop_first=True).head())
'''
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin  Embarked_Q  Embarked_S
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN           0           1
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85           0           0
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN           0           1
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123           0           1
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN           0           1
'''

NOTE: You can apply one hot encoder to nominal numerical variables. For example, let’s say you have a Pclass variable which consists of 1, 2 and 3s. If these numbers are just a nominal category, then you can apply one hot encoding here:

A generalized one hot encoder function

Now let’s write a function that does what we have done so far. But before that, we need to grab the columns that can be applied one hot encoding.

def one_hot_encoder(dataframe, categorical_cols, drop_first=True, dummy_na=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first, dummy_na=dummy_na)
    return dataframe





#Let's find colums to be applied one hot encoding.
#We will ignore cardinal columns and binary columns by 
#setting thresholds. One hot encoding is not applied to 
#cardinal columns. For binary columns, we typically use 
#label encoding instead of one hot encoding. So we will
#apply one hot encoding to categorical columns that are
#not binary or cardinal.
ohe_cols = [col for col in df.columns if 10 >= df[col].nunique() > 2]
print(ohe_cols)  # ['Pclass', 'SibSp', 'Parch', 'Embarked']






#After using one hot encoding, let's see new columns.
print(one_hot_encoder(df, ohe_cols).head())
'''
   PassengerId  Survived                                               Name     Sex   Age            Ticket     Fare Cabin  Pclass_2  Pclass_3  SibSp_1  SibSp_2  SibSp_3  SibSp_4  SibSp_5  SibSp_8  Parch_1  Parch_2  Parch_3  Parch_4  Parch_5  Parch_6  Embarked_Q  Embarked_S
0            1         0                            Braund, Mr. Owen Harris    male  22.0         A/5 21171   7.2500   NaN         0         1        1        0        0        0        0        0        0        0        0        0        0        0           0           1
1            2         1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0          PC 17599  71.2833   C85         0         0        1        0        0        0        0        0        0        0        0        0        0        0           0           0
2            3         1                             Heikkinen, Miss. Laina  female  26.0  STON/O2. 3101282   7.9250   NaN         0         1        0        0        0        0        0        0        0        0        0        0        0        0           0           1
3            4         1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0            113803  53.1000  C123         0         0        1        0        0        0        0        0        0        0        0        0        0        0           0           1
4            5         0                           Allen, Mr. William Henry    male  35.0            373450   8.0500   NaN         0         1        0        0        0        0        0        0        0        0        0        0        0        0           0           1
'''

Rare Encoding

The process of grouping labels that show a small number of observations in the dataset into a new category -”Rare”. We need to specify the minimum percentage of observations to be considered ‘RARE’.

There are 3 steps for rare encoding:
1) Analyze the frequency of categorical variables.

2) Analyze the relationship between the rare categories and the dependent variable.

3) Write a rare encoder function.

#Let's import application_train data. This data set indicates the 
#characteristics of  individuals and whether they have paid their bank debts.
dff = load_application_train()




#We need to work on categorical columns. So we will use grab_col_names()
#function.
def grab_col_names(dataframe, cat_th=10, car_th=20):
    '''
    Returns categorical columns list, numerical columns list and categorical but cardinal column list.

    Parameters
    ----------
    dataframe: dataframe
        main dataframe
    cat_th: int, float
        threshold for the number of unique variable of a column that seems numerical but actually categorical
    car_th: int, float
        threshold for the number of unique variable of a column that seems categorical but actually cardinal
    
    Returns
    -------
    cat_cols: list
        list of categorical columns
    num_cols: list
        list of numerical columns
    cat_but_car: list
        list of of cardinal columns
    
    Notes
    ------
    -> cat_cols + num_cols + cat_but_car = the number of columns of dataframe
    -> cat_cols includes num_but_cat
    -> Categorical variables with numerical appearance are also included in categorical variables.

    Examples
    ------
        import seaborn as sns
        df = sns.load_dataset("iris")
        print(grab_col_names(df))
    '''


    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and dataframe[col].dtypes != "O"]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and dataframe[col].dtypes == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O" and col not in num_but_cat]

    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f"cat_cols: {len(cat_cols)}")
    print(f"num_cols: {len(num_cols)}")
    print(f"cat_but_car: {len(cat_but_car)}")
    print(f"num_but_cat: {len(num_but_cat)}")

    return cat_cols, num_cols, cat_but_car





#Let's use grab_col_names() function.
cat_cols, num_cols, cat_but_car = grab_col_names(dff)
'''
Observations: 307511
Variables: 122
cat_cols: 54
num_cols: 67
cat_but_car: 1
num_but_cat: 39
'''

We will apply rare encoding to non numerical columns. In other words, categorical columns and cardinal columns are considered.

cat_car_cols = cat_cols + cat_but_car





#This function show us the count and ratio of every category.
def cat_summary(dataframe, col_name, plot=False):
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print("##########################################")
    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show()





for col in cat_car_cols:
    cat_summary(dff, col)
'''
                 NAME_CONTRACT_TYPE      Ratio
Cash loans                   278232  90.478715
Revolving loans               29279   9.521285
##########################################
     CODE_GENDER      Ratio
F         202448  65.834393
M         105059  34.164306
XNA            4   0.001301
##########################################
   FLAG_OWN_CAR      Ratio
N        202924  65.989184
Y        104587  34.010816
##########################################
                        ORGANIZATION_TYPE      Ratio
Business Entity Type 3              67992  22.110429
XNA                                 55374  18.007161
Self-employed                       38412  12.491260
Other                               16683   5.425172
Medicine                            11193   3.639870
Business Entity Type 2              10553   3.431747
Government                          10404   3.383294
School                               8893   2.891929
Trade: type 7                        7831   2.546576
Kindergarten                         6880   2.237318
Construction                         6721   2.185613
Business Entity Type 1               5984   1.945947
Transport: type 4                    5398   1.755384
Trade: type 3                        3492   1.135569
Industry: type 9                     3368   1.095245
Industry: type 3                     3278   1.065978
Security                             3247   1.055897
Housing                              2958   0.961917
Industry: type 11                    2704   0.879318
Military                             2634   0.856555
Bank                                 2507   0.815255
Agriculture                          2454   0.798020
Police                               2341   0.761274
Transport: type 2                    2204   0.716722
Postal                               2157   0.701438
Security Ministries                  1974   0.641928
Trade: type 2                        1900   0.617864
Restaurant                           1811   0.588922
Services                             1575   0.512177
University                           1327   0.431529
Industry: type 7                     1307   0.425025
Transport: type 3                    1187   0.386002
Industry: type 1                     1039   0.337874
Hotel                                 966   0.314135
Electricity                           950   0.308932
Industry: type 4                      877   0.285193
Trade: type 6                         631   0.205196
Industry: type 5                      599   0.194790
Insurance                             597   0.194139
Telecom                               577   0.187636
Emergency                             560   0.182107
Industry: type 2                      458   0.148938
Advertising                           429   0.139507
Realtor                               396   0.128776
Culture                               379   0.123248
Industry: type 12                     369   0.119996
Trade: type 1                         348   0.113167
Mobile                                317   0.103086
Legal Services                        305   0.099183
Cleaning                              260   0.084550
Transport: type 1                     201   0.065364
Industry: type 6                      112   0.036421
Industry: type 10                     109   0.035446
Religion                               85   0.027641
Industry: type 13                      67   0.021788
Trade: type 4                          64   0.020812
Trade: type 5                          49   0.015934
Industry: type 8                       24   0.007805
...
...
Output is truncated.
'''

Now let’s look at a categorical column,

#Look at "NAME_INCOME_TYPE" column
print(dff["NAME_INCOME_TYPE"].value_counts())
'''
Working                 158774
Commercial associate     71617
Pensioner                55362
State servant            21703
Unemployed                  22
Student                     18
Businessman                 10
Maternity leave              5
Name: NAME_INCOME_TYPE, dtype: int64
'''
#As can be seen, "Unemployed" "Student" "Businessman" and "Maternity leave" 
#categories have very low counts. So we can create a new category called 
#RARE which consists of these columns.






#BUT, let's look at the relationship between target column("TARGET") and 
#"NAME_INCOME_TYPE". Remember, if TARGET=0 then the debt has been paid, 
#if TARGET=1 then the debt has not been paid.
print(dff.groupby("NAME_INCOME_TYPE").agg({'TARGET':'mean'}))
'''
                        TARGET
NAME_INCOME_TYPE              
Businessman           0.000000
Commercial associate  0.074843
Maternity leave       0.400000
Pensioner             0.053864
State servant         0.057550
Student               0.000000
Unemployed            0.363636
Working               0.095885
'''
#As you can see, all Businessman has paid their debts but 36% of unemployed 
#people has not paid their debts. So if we create a RARE column and then put 
#Unemployed and Businessman category into this new RARE category, then we can 
#do something wrong. Because there is a huge difference between Businessman 
#and Unemployed people regarding to their ability to pay the debt. 
#In other words, we should be careful before applying rare encoding.

A generalized function for rare encoder

Let’s write a function that does what we have done so far!

def rare_analyser(dataframe, target, cat_cols):
    for col in cat_cols:
        print(col, " total number of categories : ", len(dataframe[col].value_counts()))
        print(pd.DataFrame({"COUNT": dataframe[col].value_counts(),
                            "RATIO": dataframe[col].value_counts() / len(dataframe),
                            "TARGET_MEAN": dataframe.groupby(col)[target].mean()}), end="\n\n\n")




rare_analyser(dff, "TARGET", cat_car_cols)
'''
NAME_CONTRACT_TYPE  total number of categories :  2
                  COUNT     RATIO  TARGET_MEAN
Cash loans       278232  0.904787     0.083459
Revolving loans   29279  0.095213     0.054783


CODE_GENDER  total number of categories :  3
      COUNT     RATIO  TARGET_MEAN
F    202448  0.658344     0.069993
M    105059  0.341643     0.101419
XNA       4  0.000013     0.000000


FLAG_OWN_CAR  total number of categories :  2
    COUNT     RATIO  TARGET_MEAN
N  202924  0.659892     0.085002
Y  104587  0.340108     0.072437


ORGANIZATION_TYPE  total number of categories :  58
                        COUNT     RATIO  TARGET_MEAN
Advertising               429  0.001395     0.081585
Agriculture              2454  0.007980     0.104727
Bank                     2507  0.008153     0.051855
Business Entity Type 1   5984  0.019459     0.081384
Business Entity Type 2  10553  0.034317     0.085284
Business Entity Type 3  67992  0.221104     0.092996
Cleaning                  260  0.000845     0.111538
Construction             6721  0.021856     0.116798
Culture                   379  0.001232     0.055409
Electricity               950  0.003089     0.066316
Emergency                 560  0.001821     0.071429
Government              10404  0.033833     0.069781
Hotel                     966  0.003141     0.064182
Housing                  2958  0.009619     0.079446
Industry: type 1         1039  0.003379     0.110683
Industry: type 10         109  0.000354     0.064220
Industry: type 11        2704  0.008793     0.086538
Industry: type 12         369  0.001200     0.037940
Industry: type 13          67  0.000218     0.134328
Industry: type 2          458  0.001489     0.072052
Industry: type 3         3278  0.010660     0.106162
Industry: type 4          877  0.002852     0.101482
Industry: type 5          599  0.001948     0.068447
Industry: type 6          112  0.000364     0.071429
Industry: type 7         1307  0.004250     0.080337
Industry: type 8           24  0.000078     0.125000
Industry: type 9         3368  0.010952     0.066805
Insurance                 597  0.001941     0.056951
Kindergarten             6880  0.022373     0.070349
Legal Services            305  0.000992     0.078689
Medicine                11193  0.036399     0.065845
Military                 2634  0.008566     0.051253
Mobile                    317  0.001031     0.091483
Other                   16683  0.054252     0.076425
Police                   2341  0.007613     0.049979
Postal                   2157  0.007014     0.084376
Realtor                   396  0.001288     0.106061
Religion                   85  0.000276     0.058824
Restaurant               1811  0.005889     0.117062
School                   8893  0.028919     0.059148
Security                 3247  0.010559     0.099784
Security Ministries      1974  0.006419     0.048632
Self-employed           38412  0.124913     0.101739
Services                 1575  0.005122     0.066032
Telecom                   577  0.001876     0.076256
Trade: type 1             348  0.001132     0.089080
Trade: type 2            1900  0.006179     0.070000
Trade: type 3            3492  0.011356     0.103379
Trade: type 4              64  0.000208     0.031250
Trade: type 5              49  0.000159     0.061224
Trade: type 6             631  0.002052     0.045959
Trade: type 7            7831  0.025466     0.094496
Transport: type 1         201  0.000654     0.044776
Transport: type 2        2204  0.007167     0.078040
Transport: type 3        1187  0.003860     0.157540
Transport: type 4        5398  0.017554     0.092812
University               1327  0.004315     0.048983
XNA                     55374  0.180072     0.053996
...
Output is truncated.
'''






#Now, we can finally write rare encoder function.
#We should determine a rarity threshold(rare_perc). 
#And if a category show up less than this threshold
#then we can add it to the new RARE category. 
def rare_encoder(dataframe, rare_perc):
    temp_df = dataframe.copy()

    rare_columns = [col for col in temp_df.columns if temp_df[col].dtypes == 'O'
                    and (temp_df[col].value_counts() / len(temp_df) < rare_perc).any(axis=None)]

    for var in rare_columns:
        tmp = temp_df[var].value_counts() / len(temp_df)
        rare_labels = tmp[tmp < rare_perc].index
        temp_df[var] = np.where(temp_df[var].isin(rare_labels), 'Rare', temp_df[var])

    return temp_df




new_df = rare_encoder(dff, 0.01)
#Now we changed some of categories as 'Rare'. 
#For example, let's look at 'ORGANIZATION_TYPE' column, which is actually
#a cardinal column.
print(new_df[['ORGANIZATION_TYPE']].head())
'''
        ORGANIZATION_TYPE
0  Business Entity Type 3
1                  School
2              Government
3  Business Entity Type 3
4                    Rare
'''





#Use rare_analyzer() again to see the effect of 'Rare' category on 
#TARGET variable.
rare_analyser(new_df, "TARGET", cat_car_cols)
'''
NAME_CONTRACT_TYPE  total number of categories :  2
                  COUNT     RATIO  TARGET_MEAN
Cash loans       278232  0.904787     0.083459
Revolving loans   29279  0.095213     0.054783


CODE_GENDER  total number of categories :  3
       COUNT     RATIO  TARGET_MEAN
F     202448  0.658344     0.069993
M     105059  0.341643     0.101419
Rare       4  0.000013     0.000000


FLAG_OWN_CAR  total number of categories :  2
    COUNT     RATIO  TARGET_MEAN
N  202924  0.659892     0.085002
Y  104587  0.340108     0.072437


ORGANIZATION_TYPE  total number of categories :  18
                        COUNT     RATIO  TARGET_MEAN
Business Entity Type 1   5984  0.019459     0.081384
Business Entity Type 2  10553  0.034317     0.085284
Business Entity Type 3  67992  0.221104     0.092996
Construction             6721  0.021856     0.116798
Government              10404  0.033833     0.069781
Industry: type 3         3278  0.010660     0.106162
Industry: type 9         3368  0.010952     0.066805
Kindergarten             6880  0.022373     0.070349
Medicine                11193  0.036399     0.065845
Other                   16683  0.054252     0.076425
Rare                    41808  0.135956     0.076182
School                   8893  0.028919     0.059148
Security                 3247  0.010559     0.099784
Self-employed           38412  0.124913     0.101739
Trade: type 3            3492  0.011356     0.103379
Trade: type 7            7831  0.025466     0.094496
Transport: type 4        5398  0.017554     0.092812
XNA                     55374  0.180072     0.053996
...
Output is truncated.
'''
#As you can see above, in every column, the ratio of 'Rare' category is 
#very low as we expected.






#Let's try to see which categories are added to 'Rare' category 
#for "OCCUPATION_TYPE". First we should see the dataset before 
#doing Rare encoding:
print(dff["OCCUPATION_TYPE"].value_counts())
'''
Laborers                 55186
Sales staff              32102
Core staff               27570
Managers                 21371
Drivers                  18603
High skill tech staff    11380
Accountants               9813
Medicine staff            8537
Security staff            6721
Cooking staff             5946
Cleaning staff            4653
Private service staff     2652
Low-skill Laborers        2093
Waiters/barmen staff      1348
Secretaries               1305
Realty agents              751
HR staff                   563
IT staff                   526
Name: OCCUPATION_TYPE, dtype: int64
'''






#And after doing Rare encoding, we have this(output of rare_analyzer() ):
'''
...
OCCUPATION_TYPE  total number of categories :  12
                       COUNT     RATIO  TARGET_MEAN
Accountants             9813  0.031911     0.048303
Cleaning staff          4653  0.015131     0.096067
Cooking staff           5946  0.019336     0.104440
Core staff             27570  0.089655     0.063040
Drivers                18603  0.060495     0.113261
High skill tech staff  11380  0.037007     0.061599
Laborers               55186  0.179460     0.105788
Managers               21371  0.069497     0.062140
Medicine staff          8537  0.027762     0.067002
Rare                    9238  0.030041     0.098181
Sales staff            32102  0.104393     0.096318
Security staff          6721  0.021856     0.107424
...
'''

Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization. We will discover three main feature scaling methods.

1) Standardization: It is a technique in which the mean will be equal to zero and the standard deviation equal to one.

2) Robust Scalar: Although it is not very common method, robust scaling is one of the best scaling techniques when we have outliers present in our dataset. It scales the data accordingly to the interquartile range (IQR = 75 Quartile — 25 Quartile).

3) Min-Max Normalization: Min-Max Normalization is a scaling technique in which the values are rescaled between the range 0 to 1 as default. To normalize our data, we use MinMaxScalar from the Sci-Kit learn library and apply it to our dataset. After applying the MinMaxScalar, the minimum value will be zero and the maximum value will be one.

Let’s do some coding!

####################################
#Standardization
####################################

df = load()
ss = StandardScaler()
df["Age_standard_scaler"] = ss.fit_transform(df[["Age"]])
print(df[['Age', 'Age_standard_scaler']].head())
'''
    Age  Age_standard_scaler
0  22.0            -0.530377
1  38.0             0.571831
2  26.0            -0.254825
3  35.0             0.365167
4  35.0             0.365167
'''






####################################
#Robust Scaler
####################################
rs = RobustScaler()
df["Age_robuts_scaler"] = rs.fit_transform(df[["Age"]])
print(df[['Age','Age_robuts_scaler']].head())
'''
    Age  Age_robuts_scaler
0  22.0          -0.335664
1  38.0           0.559441
2  26.0          -0.111888
3  35.0           0.391608
4  35.0           0.391608
'''







####################################
#Min-Max Normalization
####################################
mms = MinMaxScaler()
df["Age_min_max_scaler"] = mms.fit_transform(df[["Age"]])
print(df[['Age','Age_min_max_scaler']].head())
'''
    Age  Age_min_max_scaler
0  22.0            0.271174
1  38.0            0.472229
2  26.0            0.321438
3  35.0            0.434531
4  35.0            0.434531
'''






#-------------------------------------------------------------
#-------------------------------------------------------------






#Look at Age_standard_scaler, Age_robuts_scaler and Age_min_max_scaler below.
#Mean of standard scaler is almost 0, 
#Median(50%) of robust scaler is 0,
#Min and max values of min-max scaler is 0 and 1, respectively.
print(df.describe().T)
'''
                     count          mean         std       min         25%         50%         75%         max
PassengerId          891.0  4.460000e+02  257.353842  1.000000  223.500000  446.000000  668.500000  891.000000
Survived             891.0  3.838384e-01    0.486592  0.000000    0.000000    0.000000    1.000000    1.000000
Pclass               891.0  2.308642e+00    0.836071  1.000000    2.000000    3.000000    3.000000    3.000000
Age                  714.0  2.969912e+01   14.526497  0.420000   20.125000   28.000000   38.000000   80.000000
SibSp                891.0  5.230079e-01    1.102743  0.000000    0.000000    0.000000    1.000000    8.000000
Parch                891.0  3.815937e-01    0.806057  0.000000    0.000000    0.000000    0.000000    6.000000
Fare                 891.0  3.220421e+01   49.693429  0.000000    7.910400   14.454200   31.000000  512.329200
Age_standard_scaler  714.0  2.388379e-16    1.000701 -2.016979   -0.659542   -0.117049    0.571831    3.465126
Age_robuts_scaler    714.0  9.505553e-02    0.812671 -1.542937   -0.440559    0.000000    0.559441    2.909091
Age_min_max_scaler   714.0  3.679206e-01    0.182540  0.000000    0.247612    0.346569    0.472229    1.000000
'''

Data Binning

Data binning is a way of pre-processing, summarizing, and analyzing data used to group continuous data into discrete bins or categories. In other words, it converts numerical data to categorical data. It offers several benefits, such as simplifying data analysis and mitigating the impact of outliers in datasets. Let’s see an example:

#Do an equal-frequency binning with 5 bins.

df["Age_qcut"] = pd.qcut(df['Age'], 5)
print(df[['Age','Age_qcut']].head(10))
'''
    Age       Age_qcut
0  22.0   (19.0, 25.0]
1  38.0   (31.8, 41.0]
2  26.0   (25.0, 31.8]
3  35.0   (31.8, 41.0]
4  35.0   (31.8, 41.0]
5   NaN            NaN
6  54.0   (41.0, 80.0]
7   2.0  (0.419, 19.0]
8  27.0   (25.0, 31.8]
9  14.0  (0.419, 19.0]
'''

Thanks for reading…

Feature Encoding

Label Encoding

titanic.csv

Edit description

What is the difference between dataframe[column].nunique() and len(dataframe[column].unique()) ?

A generalized label encoder function

application_train.csv

Edit description

One Hot Encoding

A generalized one hot encoder function

Rare Encoding

A generalized function for rare encoder

Feature Scaling

Data Binning

Written by Deniz Gunay