Powering Up Your Pandas Part II — Label Encoding and One Hot Encoding

Published in

Data Folks Indonesia

7 min readSep 7, 2022

First, I am sorry for continuing this series very late. As wise men said, everything great takes time. Today I would like to continue this series. Did you ever know or at least hear about label encoding and one hot encoding? The label encoding and one hot encoding are not part purely part of pandas. Because this is actually a feature provided by Scikit-Learn, but this concept is fundamental, and you should know it, so I think there is nothing wrong I write as the second part of powering up your pandas here.

If in the last article I used the common dataset “IRIS,” let’s go to the newer dataset called the “Penguin” dataset. Like the name, it contains the penguin’s data consisting of three penguin classes.

The Workflow

I know little about the value of time, so to save your time, see the workflow below before you read till the end of this story.

Yep, in this story, I would like to explain label encoding and one hot encoding and do the practice from it using the penguin’s dataset. What makes this article different, I will keep things simple, as long as you understand that is my purpose. Just don’t forget to download the dataset here by saving as “penguins.csv”.

The Definition

Okay, the second section of this post is the definition part. Label encoding and one hot encoding are just a method/techniques. As we know together, computers learn from numerical values. It couldn’t understand the string characters without any process on it. That’s a way both of these methods come in. It helps data scientists, engineers, and analysts process their data. Okay, let’s go to the first method.

Label Encoding

Label encoding is a technique of converting categorical values inside columns into numerical ones. This method works best on a dataset with hierarchical or ordinal data. There are several examples of ordinal data.

Likert scale: this kind of data is usually used by a researcher in collecting data in a survey. The example is like this.

Very satisfied
Satisfied
Indifferent
Dissatisfied
Very dissatisfied

Interval scale: The response has corresponded with each other values, for example.

Child, teen, youth, adult, old

It still works if you use label encoding in non-hierarchical data. Still, the accuracy will drop very low because it’s not good to use.

Code — Label Encoding

First, you should import the dataset right; use this code below.

import pandas as pd
df = pd.read_csv(“penguins.csv”)

Then after the dataset is imported, you could read the first five of the data using df.head() and get these results.

There are three categorical column types on the penguin’s dataset: species, Island, and sex. In this example, let’s explore the island data. You could
df[“island”].unique()

Next step, you could import the LabelEncoder library from sklearn and then save it into a variable to make it usable. After that, you could transform categorical columns and make them numerical. Here for an example.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(df[“island”])

You could save the LabelEncoder() into a variable so the label encoder can use it at another time.

The new dataset with the label encoded values already created.

One Hot Encoding

One hot encoding is a technique of creating a dummy dataset based on the number of categorical variables. As said before, label encoding works best with hierarchical data (ordinal), so one-hot encoding will work best with non-hierarchical data. Perhaps you know more about nominal data than non-hierarchical data. Briefly, it means that each variable doesn’t correlate with the other. An example of nominal data is already used in this story. I used the Island column. It contains three values, ‘Torgersen,’ ‘Biscoe,’ and ‘Dream.’ as you can see, each value doesn’t depend on the other. What makes one hot encoding differ from the label encoding is in one hot encoding does not simply change the string to numerical. Still, it creates a dummy dataset that interprets the categorical values. Just go to the code to gain more understanding.

Code — One Hot Encoding

In one hot encoding, we create a dummy column following the original categorical values. If we have 3 unique categorical values, it will create three dummy columns. To finish this job, pandas have a role in there. It could make the dummy data. As we already know, we have three unique values inside the island columns; there are ‘Torgersen,’ ‘Biscoe,’ and ‘Dream.’ It will create a dummy dataset like this.

add_columns = pd.get_dummies(df[“island”])
add_columns

For this explanation, look at the dataset’s first row. You will see the values of island columns in Torgersen.

Now see the output of the dummy column for the first row.

The Torgersen will be scored 1, and the rest will be 0, which means that Torgersen exists in that row. This concept is very important to be remembered in one hot encoding. After the dummy column is created, you can join the dummy column to the main columns. In the one hot encoding, you could also remove the original column using the `drop` method provided by pandas.

df = df.join(add_columns)
df.drop([“island”], axis=1, inplace=True)
df.head()

The one hot encoding already created. Congratulations.

When to use?

Most of you will ask, what is the best method to do with the dataset you work on? The answer is quite simple. Just remember a little lesson from the assassin verb.

`Nothing is true, and everything is permitted.`

Yeah, both of the methods are great. You could use it depending on your dataset. If you are a complicated person, you could use both of them, but exactly it will waste and ruin your time. 😆

Ups, I was wrong!

There are several pieces of advice when using the method. You could read the full here. I will quote from his post.

Use Label Encoding when you have ordinal features in your data to get higher accuracy and when there are too many categorical features present in your data because, in such scenarios, One Hot Encoding may perform poorly due to high memory consumption while creating the dummy variables.

Briefly, all is about time. If you have a big dataset with multiple categorical labels, you should prefer label encoding over one hot encoding. Because one hot encoding will double your dataset size, and the label encoding will rename the row. But again, as I mention in the definition part, if your dataset is hierarchical, you should use the label. If non-hierarchical, use one-hot encoding instead.

The Conclusions

In this story, we already explore the definition and implementations of label encoding and one hot encoding. There are pros and cons to each of them, just suit it with your problem, and you probably get better accuracy scores.

The Reference

If you even pass through the above youtube channel, you might think the code I run is exact because you are correct. I follow her method of explanation, and the difference between this article from her is that she is using video, and I am making an article so you could easily duplicate it to learn more easily.

I just want to improve the ways she told on the videos because she was doing more in practice, not in theory, and the new dataset used in these stories. Another reference I used is below.

Types of Data | Introduction to Data Science

Types of data # In empirical research, we collect and interpret data in order to answer questions about the world…

dept.stat.lsa.umich.edu

Difference between Label Encoding and One Hot Encoding

Data Preprocessing is a very crucial step in every Machine learning model creation because the independent and…

www.how2shout.com

See part-I of Powering Pandas series.

Powering Up Your Pandas Part I — Understanding The Basic Function

Pandas are an essential package built for the data scientist to play with his data. Usually, a data scientist will use…

medium.com

Dataset

GitHub - theDreamer911/train-dataset: This repositories refer to dataset that I found in my path on…

This repositories refer to dataset that I found in my path on learning data science. - GitHub …

github.com