One hot encoding ? What and why ?

irvan rahadhian
3 min readSep 15, 2019

--

Maybe for some of you that just jump into machine learning “hype train” confused about what is the meaning of one hot encoding in machine learning, and maybe you’ll encounter this term all over the places, especially if you’re working on multi-class classification problem.

So what is “One hot encoding”, if we see the sklearn documentation it says “Encode categorical integer features using a one-hot aka one-of-K scheme”. Do you get it ? for me who just start to learn about this “machine learning” it still confusing to understand what it mean.

After search more about it, i found a good explanation about one hot encoding by Jason Brownlee :

“A one hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1”.

Why this one hot encoding needed in machine learning algorithm ?.

Machine learning cannot deal or work directly with categorical data. So you have to convert categorical data to form data or in other words is categorical data must be converted to numbers that could be provided to ML algorithms to do a better job in classifying data. And that’s what one hot encoding do.

Oke for example, let says we have a simple sequence of labels “animals” with the values “otter”, “owl” and “cat”. We can use integer encoding for these data but it is not enough, because it has no ordinal relationship just like “first”, “second”,”third”, etc.

Using integer encoding we can represent the values like this “otter” = 0, “owl” = 1 and “cat” = 2. But if we still insist using this encoding and allowing the model to assume a natural ordering between categories, it may result in poor performance or unexpected results (predictions halfway between categories).

This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value using one hot encoding method.

In the “animals” sequence example, there are 3 values, therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the animal and “0” values for the other animals.

| otter | owl | cat |
|-------|-----|-----|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |

Example one hot encoding using sklearn :

from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = [‘otter’, ‘owl’, ‘cat’]
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)

And here is the example using Keras :

from numpy import array
from numpy import argmax
from keras.utils import to_categorical
# define example
data = [‘otter’,’owl’,’cat’]
data = array(data)
convert_data = []
# manual change values to integer
for d in data:
if d == ‘otter’:
convert_data.append(0)
elif d == ‘owl’:
convert_data.append(1)
else:
convert_data.append(2)
print(convert_data)
# one hot encode
encoded = to_categorical(convert_data)
print(encoded)
# invert encoding
inverted = argmax(encoded[0])
print(inverted)

Oke that’s what and why one hot encoding needed in machine learning algorithm. I hope this article could help you to understand it and clear the confusion about it.

And i hope you have had as much fun reading it as I had writing in this piece.

ENJOY YOUR CODING!

--

--