One-Hot Encoding — A Brief Explanation

Wojtek Fulmyk, Data Scientist
3 min readJul 25, 2023

--

Article level: Beginner

My clients often ask me about the specifics of certain data pre-processing methods, why they’re needed, and when to use them. I will discuss a few common (and not-so-common) preprocessing methods in a series of articles on the topic.

In this preprocessing series:

Data Standardization — A Brief Explanation — Beginner
Data Normalization — A Brief Explanation — Beginner
One-hot Encoding — A Brief Explanation — Beginner
Ordinal Encoding — A Brief Explanation — Beginner
Missing Values in Dataset Preprocessing — Intermediate
Text Tokenization and Vectorization in NLP — Intermediate

Outlier Detection in Dataset Preprocessing — Intermediate

Feature Selection in Data Preprocessing — Advanced

In this specific short writeup I will explain what One-Hot-Encoding is generally about. This article is not overly technical, but some understanding of specific terms would be helpful, so I attached a short explanation of the more complicated terminology. Give it a go, and if you need more info, just ask in the comments section!

preprocessing technique — Transforming raw data before modeling to improve performance.

categorical data — Data representing categories or discrete values rather than numbers.

numeric vectors — Arrays of numbers representing categorical values for modeling.

neural networks — Models that learn complex patterns from layered interconnected nodes.

regression — Models that predict continuous outcomes from relationships between features.

features — Input variables representing characteristics used by machine learning models to make predictions.

dimensionality — The number of features or variables in a dataset.

One-Hot Encoding

The Why

One-hot encoding is a common preprocessing technique used when working with categorical data in machine learning. It serves two key purposes:

A) Converting categorical values into numeric vectors that algorithms like neural networks and regression can understand. Many models require numeric input.

B) Representing categorical values in a way that captures their uniqueness. Without one-hot encoding, algorithms may incorrectly treat different categories as the same value.

The How

To one-hot encode a categorical variable like “color” with 3 categories (red, green, blue), we create 3 new numeric features:

  • Red: 1 if color is red, else 0
  • Green: 1 if color is green, else 0
  • Blue: 1 if color is blue, else 0

So “red” becomes [1, 0, 0], “green” is [0, 1, 0], and “blue” is [0, 0, 1].

Additional Considerations

  1. The number of new feature columns created equals the number of distinct categories in the original data. So, if there were 10 distinct color values, one-hot encoding would produce 10 new columns.
  2. It can greatly expand the dimensionality of your data. If you one-hot encode several categorical variables, each with many possible values, the total number of features can grow very large.
  3. For categories with a logical ordering (like “small”, “medium”, “large”), ordinal encoding is an alternative that maps values to integers (0, 1, 2) rather than binaries. This provides more information about ordinal relationships.

Useful Python Code

Option 1: Using numpy, and very simple code:

import numpy as np

colors = ['red', 'green', 'blue']

red = [1, 0, 0]
green = [0, 1, 0]
blue = [0, 0, 1]

print("Red encoding:", red)
print("Green encoding:", green)
print("Blue encoding:", blue)

This will output the following:

Red encoding: [1, 0, 0]
Green encoding: [0, 1, 0]
Blue encoding: [0, 0, 1]

Option 2: Using the scikit-learn library (this is the preferred method):

import numpy as np
from sklearn.preprocessing import OneHotEncoder

colors = ['red', 'green', 'blue']
# reshape as 2D array
colors = np.array(colors).reshape(-1,1)

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(colors)

print(encoded)

This will output the following:

[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]

And that’s all! I will leave you with some “fun” trivia 😊

Trivia

  • One-hot encoding is also called “dummy coding” in statistics. Only one encoded value is 1, the rest are 0s or “dummies”.
  • While the general technique has been around for many decades, the specific term “one-hot encoding” probably emerged gradually in machine learning research papers in the 1980s/1990s, and then gained broader popularity in the 2000s. But the original coiner is unclear. It has now become standard terminology.

--

--

Wojtek Fulmyk, Data Scientist

Data Scientist, University Instructor, and Chess enthusiast. ML specialist.