One Hot Encoding in Machine Learning

Anjali Dwivedi
3 min readAug 28, 2020

--

Recently, I was working on a project based on Machine Learning came across this thing called “One Hot Encoding”.While working on any dataset the foremost requirement is that of pre-processing the data. And encoding is a very big part of pre-processing so that the computer can understand the data.

And the two popular techniques for this is Label-encoding and One-hot encoding.

Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated.

For Example, we have a column of weight in some random dataset.

LABEL ENCODING REPRESENTATION

The figure above shows the label encoding for the given height column.

One Hot Encoding is a process which is used by categorical variables to convert into a form that could be used by ML algorithms for better predictions. In simple terms, it is used to convert every label to a list with 10 elements and the elements at index to the corresponding class will be 1 and the remaining will be set to 0.

For example,consider the data where fruits and their corresponding categorical value and the given one-hot encoding data.

Why Label Encoding is not Enough?

Well, by both the encodings we can see, that label encoding has mainly rows but now there are columns. However, the numerical label variable, price is still the same. Although, it just fixes a problem encountered when working with categorical data. And so we won’t be using it in every situation.

Moreover, RMSE of one hot encoding is less than the Label encoder which means better accuracy.

The problem here is since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 <2.

Code for one hot Encoding

Let’s see how we can code the one hot encoding and use it. Starting with importing the libraries for the preprocessing.

import numpy as np
import pandas as pd

And now with the help of tensorflow it becomes very easy to use this because of a helper function in keras and we’ll use it for both test set and training data set for any given dataset.Let’s suppose we have some random dataset and we input the dataset with pandas ‘read_csv’ feature:

dataset=pd.read_csv('random.csv')
from tensorflow.keras.utils import to_categorical
y_train_encoded=to_categorical(y_train)
y_test_encoded=to_categorical(y_test)

To validate it we print the given labels as 10 dimensional label for example:

1 — [0,1,0,0,0,0,0,0,0,0]

2 — [0,0,1,0,0,0,0,0,0,0]

printf('y_train_encoded shape:',y_train_encoded.shape)
printf('y_test_encoded shape:',y_test_encoded.shape)

To check for the encoded label

y_train_encoded[0]  //let this value be encoded for 5

will give the encoded label array[(0,0,0,0,0,1,0,0,0,0)].

This is how you can perform one hot encoding.

--

--