Day 22 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
4 min readJul 8, 2020

OneHotEncoder. This is one of the techniques which I have spoken about and used in a lot of my earlier blogs but never really spoke about the concept. So what exactly is One Hot Encoding.

OneHotEncoding is basically a technique that is used in order to convert your categorical text data into numerical form. But, why do we need to do this?
So the answer to that is quite straightforward. Machine Learning algorithms work well on numerical data but they can’t really process or understand text data if we give it to the model as it is. I’ll try to explain with an example.
Consider a Gender column in a dataset and the values in this column are [‘Male’, ‘Female’]. Now if I pass this gender column to my model directly, it is not going to be able to process or understand anything out of the model. This is why, we need to use OneHotEncoding which can help with the conversion of this data into numerical columns such as [0,1] where 0 represents Male and 1 represents Female. Now the following process can also be done with the help of labelencoder but they take in only a single column as parameter whereas on the other hand, OneHotEncoder takes in the entire feature columns and converts them into categorical numerical data. Once we have encoded our data, we can easily proceed with MinMaxScaler() in order to improve the accuracy and efficiency of the model that we have. I shall put together a short demo just to understand the basics of OneHotEncoding.

Alright, let us get right to the implementation. I’m running this notebook on Kaggle so the link to the dataset is mentioned below (Previously used dataset):

Lets start by importing our dataset and the required libraries.

import pandas as pd
import numpy as np
data=pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
print("The given data has dimension of {}".format(data.shape))
data.head()
Mall Data head()

If we observe the above column, we have a column called Gender. These contain binary values which are Male and Female so I’m going to try and convert them to binary values such as 0 and 1.
For the implementation, I’m going to import sklearn.preprocessing libraries which are needed for the encoding process.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

Lets select the given Gender column and store it as a variable.

cat_data=data['Gender']

Next thing is to create a object of the LabelEncoder() type which we have imported and fit our data to that object so that the conversion takes place.

labelEncoder=LabelEncoder()
integer_encoded = labelEncoder.fit_transform(cat_data)
print(integer_encoded)

We have saved the converted values into a variable called integer_encoded and when we print the values, they look a little like this:

Integer_encoded

We may notice that the values have already been converted as we needed it but you will understand the difference between this and OneHotEncoded values once the conversion has been made. Make sure to convert the integer_encoded after converting the array into a 2 dimensional array by using the array.reshape() function.

integer_encoded=integer_encoded.reshape(-1,1)
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

When we print the OneHotEncoded values, we obtain the following values:

OneHotEncoded values

The difference is that for every value or every entry in the dataset, the OneHotEncoded values represents 0 in the Male position and 1 in the Female position and each row has the Male and Female attribute. Keep in mind that it isn’t very useful to use two variables for onehotencoder and it is mostly used along with multiclass variables.

From this step, you can attach this OneHotEncoded by attaching this column to your Dataframe or by using the variable in your feature column.

That’s it for today. Keep Learning.

Cheers.

--

--