If you’re into machine learning, then you’ll inevitably come across this thing called “One Hot Encoding”. However, it’s one of those things that are hard to grasp as a beginner to machine learning, since you kind of need to know some things about machine learning to understand it. To help, I figured I would attempt to provide a beginner explanation.
The first thing you do when you’re making any kind of machine learning program is usually pre-processing. By that, I mean preparing data to be analyzed by your program. After all, you can’t just throw a spreadsheet or some pictures into your program and expect it to know what to do. We’re not at that level of AI yet.
A big part of the preprocessing is something encoding. This means representing each piece of data in a way that the computer can understand, hence the name encode, which literally means “convert to [computer] code”. There’s many different ways of encoding such as Label Encoding, or as you might of guessed, One Hot Encoding. Label encoding is intuitive and easy to understand, so I’ll explain that first. Hopefully from there you’ll be able to fully understand one hot encoding.
Let’s assume we’re working with categorical data, like cats and dogs. Looking at the name of label encoding, you might be able to guess that it encodes labels, where label is just a category (i.e. cat or dog), and encode just means giving them a number to represent that category (1 for cat and 2 for dog). By giving each category a number, the computer now knows how to represent them, since the computer knows how to work with numbers. And now we’ve already finished explaining label encoding. But there’s a problem that makes it often not work for categorical data.
The problem is that with label encoding, the categories now have natural ordered relationships. The computer does this because it’s programmed to treat higher numbers as higher numbers; it will naturally give the higher numbers higher weights. We can see the problem with this in an example:
- Imagine if you had 3 categories of foods: apples, chicken, and broccoli. Using label encoding, you would assign each of these a number to categorize them: apples = 1, chicken = 2, and broccoli = 3. But now, if your model internally needs to calculate the average across categories, it might do do 1+3 = 4/2 = 2. This means that according to your model, the average of apples and chicken together is broccoli.
Obviously that line of thinking by your model is going to lead to it getting correlations completely wrong, so we need to introduce one-hot encoding.
Rather than labeling things as a number starting from 1 and then increasing for each category, we’ll go for more of a binary style of categorizing. You might have been thinking that if you knew what a one-hot is (it relates to binary coding, but don’t worry about it). Let me provide a visualized difference between label and one-hot encoding. See if you can work out the difference:
What’s the difference? Well, our categories were formerly rows, but now they’re columns. Our numerical variable, calories, has however stayed the same. A 1 in a particular column will tell the computer the correct category for that row’s data. In other words, we have created an additional binary column for each category.
It’s not immediately clear why this is better (aside from the problem I mentioned earlier), and that’s because there isn’t a clear reason. Like many things in machine learning, we won’t be using this in every situation; it’s not outright better than label encoding. It just fixes a problem that you’ll encounter with label encoding when working with categorical data
One Hot Encoding in Code (Get it? It’s a pun)
It’s always helpful to see how this is done in code, so let’s do an example. Normally I’m a firm believer that we should do something without any libraries in order to learn it, but just for this tedious pre-processing stuff we don’t really need to. Libraries can make this so simple. We’re going to use numpy, sklearn, and pandas, as you’ll find yourself using those 3 libraries in many of your projects
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
import pandas as pd
Now that we’ve the tools, let’s get started. We’ll work with a made up dataset. Input the dataset with pandas’s .read_csv feature:
dataset = pd.read_csv('made_up_thing.csv')
Hopefully that’s self-explanatory. Next up is a little trickier. The thing about spreadsheets is that you may or may not care about some of the columns. For the sake of simplicity, let’s say we care about everything except the last column. We’re going to use pandas’s feature .iloc, which gets the data at whatever column(s) that you tell it to:
X = dataset.iloc[:, :-1].values
.iloc actually takes in [rows,columns], so we inputted [:, :-1]. The : is because we want all the rows in those columns, and : is just the way you do that. We add the .values to, well, get the values at what segments we have selected. In other words, the first part selects the values, the second part gets the values.
Now let’s do the actual encoding. Sklearn makes it incredibly easy, but there is a catch. You might have noticed we imported both the labelencoder and the one hot encoder. Sklearn’s one hot encoder doesn’t actually know how to convert categories to numbers, it only knows how to convert numbers to binary. We have to use the labelencoder first.
First, we’ll set up a labelencoder just like you would any normal object:
le = LabelEncoder()
Next we have to use sklearn’s .fit_transform function. Let’s say that we need to encode just the first column. We would do:
X[:, 0] = le.fit_transform(X[:, 0])
This function is just a combination of the .fit and .transform commands. .fit takes X (in this case the first column of X because of our X[:, 0]) and converts everything to numerical data. .transform then applies that conversion.
All that’s left is to use the one hot encoder. Thankfully, it’s almost the same as what we just did:
ohe = OneHotEncoder(categorical_features = )
X = ohe.fit_transform(X).toarray()
Categorical_feartures is a parameter that specifies what column we want to one hot encode, and since we want to encode the first column, we put . Finally, we fit_transform into binary, and turn it into an array so we can work with it easily going forward.
And that’s it! It’s pretty simple. One final note though, if you need to do more than just the first column, you do what we just did, but instead of 0 you put whatever column you want. For many columns, you can put it in a for loop:
le = LabelEncoder()#for 10 columns
for i in range(10):
X[:,i] = le.fit_transform(X[:,i])
Good luck on you machine learning adventures!