One-Hot Encoding

Mathee Prasertkijaphan

Published in

SCB TechX

3 min readOct 20, 2023

One-Hot Encoding

Introduction:

There was a time I’d look at a dataset and wonder, “How can I do to let a machine understand ‘Apple’ or ‘Banana’?” In this vast world of numbers and algorithms, I stumbled upon a gem that changed everything: One-Hot Encoding. If you’ve ever been perplexed about turning words into numbers, let’s follow this journey through the art and magic of One-Hot Encoding.

In today’s era of digital transformation, data-driven decisions have become crucial for businesses, governments, and even individuals. Machine Learning (ML), a subset of artificial intelligence, is a major force driving these decisions. If you’ve ever wondered how Netflix suggests movies you might like, or how Amazon recommends products for you, ML is often at the core.

Before an ML algorithm can work its magic, the data must be prepared properly. One such data preparation technique is ‘One-Hot Encoding’. Although the term sounds technical, its concept is straightforward. Let’s unravel it step by step.

What is One-Hot Encoding?

One-hot encoding is a method of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. Categorical data are variables that contain label values rather than numeric values. Think of one-hot encoding as a translator. Imagine having a friend who only understands “yes” and “no” answers. To communicate complex ideas, you’d have to break them down into a series of yes/no questions. That’s essentially what one-hot encoding does for computers.

Why One-Hot Encoding?

Computers, particularly ML algorithms, work well with numbers. However, when given textual data, like “cat”, “dog”, or “bird”, they can’t process them efficiently as they would with numbers. Now, one might think, “Why not just assign numbers to these categories, like 1 for cat, 2 for dog, and so on?” We unintentionally introduce a relationship among these categories. The machine might assume that ‘dog’ (2) is twice the ‘cat’ (1), which doesn’t make sense. One-hot encoding overcomes this problem by converting each category value into a new column and assigning a 1 or 0 (Yes/No) value to the column. This way, there’s no weird ranking or relationship formed between categories.

Example:

Suppose you have a dataset of the favorite color of five people as follows:

After one-hot encoding, the data will look something like this:

Here, a ‘1’ under a favorite color’s column indicates that it is the person’s favorite color, while ‘0’ means it isn’t.

Implementing One-Hot Encoding in Python:

1. Using Pandas: The Pandas library, a popular data manipulation tool, offers a method called get_dummies that effortlessly handles one-hot encoding.

2. Scikit-Learn’s OneHotEncoder is a more robust tool for the job, and it fits perfectly in machine learning pipelines.

3. TensorFlow & Keras Method: If you’re looking to integrate this data into a deep learning model using TensorFlow, then its high-level API, Keras, can help.

Written by Mathee Prasertkijaphan