Handling Categorical Features using Encoding Techniques in Python

Published in

Analytics Vidhya

6 min readSep 6, 2020

In this post we are going to discuss categorical features in machine learning and methods to handle these features using two of the most effective methods.

Categorical Features

In machine learning, features can be broadly classified into two main categories:

Numerical features (age, price, area etc.)
Categorical features (gender, marital-status, occupation etc.)

All those features that are composed of a certain number of categories are known as categorical features. Categorical features can be classified into two major types:

Nominal
Ordinal

Nominal features are those having two or more categories, with no specific order. For example, if Gender has two values, male and female, it can be considered as a nominal feature.

Ordinal features on the other hand have categories in a particular order. For example, if we have a feature named Level having values as high, medium and low, it will be considered an ordinal feature, because the order matters here.

Handling Categorical Features

So the first question that arises is why do we need to handle categorical features separately? why don’t we simply pass those as inputs to our model just like the numerical features? Well the answer is that unlike humans, machines and specifically in this case machine learning models, do not understand the text data. We need to convert the text values into relevant number before feeding those into our model.

This process of converting categories into numbers is called encoding. Two of the most effective and widely used encoding methods are:

Label Encoding
One Hot Encoding

Label Encoding

Label encoding is the process of assigning numeric label to each category in the feature. If N is the number of categories, all the category values will be assigned a unique number from 0 to N-1.

If we have a feature named Colors, having values red, blue, green and yellow, it can be converted to numeric mapping as following

Category : Label
"red"    : 0
"blue"   : 1
"green"  : 2
"yellow" : 3

Note: As we can see here, the labels produced for the categories are not normalized, i.e. not between 0 and 1. Because of this limitation, label encoding should not be used with linear models where magnitude of features plays an important role. Since tree based algorithms do not need feature normalization, label encoding can be easily used with these models such as :

Decision trees
Random forest
XGBoost
LighGBM

We can implement label encoding using scikit-learn’s LabelEncoder class. We will see the implementation in the next section.

One Hot Encoding

The limitation of label encoding can be overcome by binarizing the categories, i.e. representing those using only 0’s and 1’s. Here we represent each category by a vector of size N, where N is the number of categories in that feature. Each vector has one 1 and rest all values are 0. Hence it is called one-hot encoding.

Suppose we have a column named temperature. It has four values as Freezing, Cold, Warm and Hot. Each category will be represented as following:

Category        Encoded vector
Freezing        0  0  0  1
Cold            0  0  1  0
Warm            0  1  0  0
Hot             1  0  0  0

As you can see here, each category is represented by a vector of length 4, since 4 is the number of unique categories in the feature. Each vector has single 1 and rest all values are 0.

Since One-hot encoding generates normalized features, it can be used with linear models such as :

Linear regression
Logistic regression

Now as we have the basic understanding of both the encoding techniques, lets look at the python implementation of both of these for a better understanding.

Implementation in Python

Before applying encoding to the categorical features, it is important to handle NaN values. A simple and effective way is to treat NaN values as a separate category. By doing this, we make sure that we are not losing on any important information.

So the steps that we follow while handling categorical features are:

Fill the NaN values with a new categories (such as NONE)
Convert categories to numeric values using Label encoding for tree based models and One hot encoding for linear models.
Build the model using numeric and encoded features.

We will be using a public dataset named Cat in the Dat on kaggle. Link here. This is a binary classification problem that consists of lots of categorical features.

First we will create 5 folds for validation using StratifiedKFold class in scikit-learn. This variant of KFold is used to ensure same ratio of target variables in each fold.

import pandas as pd
from sklearn import model_selection#read training data
df = pd.read_csv('../input/train.csv')#create column for kfolds and fill it with -1
df['kfold'] = -1#randomize the rows
df = df.sample(frac=1).reset_index(drop=True)#fetch the targets
y = df['target'].values#initiatre StratifiedKFold class from model_selection
kf = model_selection.StratifiedKFold(n_splits=5)#fill the new kfold column 
for f,(t_,v_) in enumerate(kf.split(X=df,y=y)):
    df.loc[v_,'kfold'] = f#save the new csv with kfold column
df.to_csv('../input/train_folds.csv',index=False)

Label Encoding

Next lets define function to run training and validation on each fold. We will be using LabelEncoder with Random Forest for this example.

import pandas as pd
from sklearn import ensemble
from sklearn import metrics
from sklearn import preprocessingdef run(fold):
    #read training data with folds
    df = pd.read_csv('../input/train_folds.csv')    #get all relevant features excluding id, target and kfold columns
    features = [feature for feature in df.columns if feature not in ['id','target','kfold']]    #fill all nan values with NONE
    for feature in features:
        df.loc[:,feature] = df[feature].astype(str).fillna('NONE')    #Label encoding the features
    for feature in features:
        #initiate LabelEncoder for each feature
        lbl = preprocessing.LabelEncoder()        #fit the label encoder
        lbl.fit(df[feature])        #transform data
        df.loc[:,feature] = lbl.transform(df[feature])    #get training data using folds
    df_train = df[df['kfold']!=fold].reset_index(drop=True)
    
    #get validation data using folds
    df_valid = df[df['kfold']==fold].reset_index(drop=True)    #get training features
    X_train = df_train[features].values
    
    #get validation features
    X_valid = df_valid[features].values    #initiate Random forest model
    model = ensemble.RandomForestClassifier(n_jobs=-1)    #fit the model on train data
    model.fit(X_train,df_train['target'].values)    #predict the probabilities on validation data
    valid_preds = model.predict_proba(X_valid)[:,1]    #get auc-roc score
    auc = metrics.roc_auc_score(df_valid['target'].values,valid_preds)    #print AUC score for each fold
    print(f'Fold ={fold}, AUC = {auc}')

Finally let’s call this method to execute run method for each fold.

if __name__=='__main__':
    for fold_ in range(5):
        run(fold_)

Executing this code will give an output like below.

Fold =0, AUC = 0.7163772816343564
Fold =1, AUC = 0.7136206487083182
Fold =2, AUC = 0.7171801474337066
Fold =3, AUC = 0.7158938474390842
Fold =4, AUC = 0.7186004462481813

One thing to note here is that we have not done any hyper parameter tuning on the Random forest model. You can tweak the parameters to improve the validation accuracy. Another thing to mention in the above code is that we are using AUC ROC score as metric for validation. This is due to the fact that the target values are skewed and metrics such as Accuracy will not give us correct results.

One Hot Encoding

Now lets see the implementation of One hot encoding with Logistic regression.

Below is the modified version of run method for this approach.

import pandas as pd
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessingdef run(fold):
    #read training data with folds
    df = pd.read_csv('../input/train_folds.csv')    #get all relevant features excluding id, target and folds columns
    features = [feature for feature in df.columns if feature not in ['id','target','kfold']]    #fill all nan values with NONE
    for feature in features:
        df.loc[:,feature] = df[feature].astype(str).fillna('NONE')    #get training data using folds
    df_train = df[df['kfold']!=fold].reset_index(drop=True)
    
    #get validation data using folds
    df_valid = df[df['kfold']==fold].reset_index(drop=True)    #initiate OneHotEncoder from sklearn
    ohe = preprocessing.OneHotEncoder()    #fit ohe on training+validation features
    full_data = pd.concat([df_train[features],df_valid[features]],axis=0)
    ohe.fit(full_data[features])    #transform training data
    X_train = ohe.transform(df_train[features])
    
    #transform validation data
    X_valid = ohe.transform(df_valid[features])    #initiate logistic regression
    model = linear_model.LogisticRegression()    #fit the model on train data
    model.fit(X_train,df_train['target'].values)    #predict the probabilities on validation data
    valid_preds = model.predict_proba(X_valid)[:,1]    #get auc-roc score
    auc = metrics.roc_auc_score(df_valid['target'].values,valid_preds)    #print AUC score for each fold
    print(f'Fold ={fold}, AUC = {auc}')

The method to loop over all folds remains same.

if __name__=='__main__':
    for fold_ in range(5):
        run(fold_)

The output of this code will be like below:

Fold =0, AUC = 0.7872262099199782
Fold =1, AUC = 0.7856877416085041
Fold =2, AUC = 0.7850910855093067
Fold =3, AUC = 0.7842966593706009
Fold =4, AUC = 0.7887711592194284

As we can see here, a simple logistic regression is giving us decent accuracy by just applying feature encoding for categorical features.

One difference to note in the implementation of both methods is that LabelEncoder has to be fitted on each categorical feature separately, while OneHotEncoder can be fitted on all the features together.

Conclusion

In this blog I have discussed what are categorical features in machine learning, why is it important to handle these features. We also covered two most important methods to encode categorical features into numeric, along with the implementation.

I hope I have helped you to get a better understanding of the topics covered here. Please let me know your feedbacks in comments and give it a clap if you liked it. Here is the link to my Linkedin profile if you wish to connect.

Thanks for reading.:)

Handling Categorical Features using Encoding Techniques in Python

Categorical Features

Handling Categorical Features

Implementation in Python

Conclusion

Written by sawan saxena