ML Feature Engineering: Dealing with Categorical Features

TC. Lin
3 min readSep 27, 2023

--

Most of the algorithms in traditional ML, namely, algorithms based on statistical equations, work best with numerical values. However, there are times when categorical features are important when building a model to solve a problem.

So, the question is: How do we process these categorical features to make them work with most of the algorithms?

To answer this question, categorical features need a way to be represented as numbers, yet the natural of the meaning is kept.

Types of categorical encoding to be covered in this article are:

  1. Ordinal Encoding
  2. One-hot Encoding
  3. Binary Encoding

To start, we need to first understand what is a categorical feature.

Suppose that we are the owner of a gym, and we are gathering the data of our gym members such as height and weight.

  • As height and weight are measured in numbers, such as 180cm or 80kg, these values are continuous numbers, so they are numerical features.

So, when comes to categorical features, that are features such as gender of a person; number of car doors; etc.

Straight-forward?

Ordinal Encoding

There are times when a categorical feature exhibits an ordinal relationship, which refers to a type of data that has a natural ordering or ranking.

For example, in a survey, we might ask a person to rate their satisfaction using options like: Least, Medium, and High.

In order to preserve the ordering relationship, we normally encode the values such that:

  • 1 to represent Least
  • 2 to represent Medium
  • 3 to represent High

With this type of encoding, the natural ordering is kept. Thus, the meaning of the feature is not lost.

One-hot Encoding

In contrast to ordinal encoding, one-hot encoding is generally applied to the categorical feature that has no natural ordering. For example, blood type (A; B; AB; O).

If we were to represent blood type as numbers using one-hot encoding, we would transform it into a 4-dimensional sparse vector, such that:

  • A represented as (1, 0, 0, 0)
  • B represented as (0, 1, 0, 0)
  • AB represented as (0, 0, 1, 0)
  • O represented as (0, 0, 0, 1)

With this, each dimension corresponds to a type of category, while keeping the natural meaning.

The problem of one-hot encoding

The problem of one-hot encoding arises when there are a large number of categorical values in a categorical feature: It makes the dataset extremely high dimension.

To deal with this problem:

1. We can make use of sparse vectors to save space.

  • In one-hot encoding, there is only one column that has the value 1, while the rest are 0.
  • So, we can make use of this property to efficient save space by representing it as a sparse vector.
  • For example, a vector v = [2, 0, 4, 0] can be represented as (4, [0, 2], (2, 4)), where the first value 4 represents the magnitude of the vector; [0, 2] represents the position that contains values; (2, 4) represents the values.

2. Using feature selection to reduce dimensions.

  • High dimensionality can cause problems in various aspects.
  • In K-Nearest Neighbour, the calculations of distances between data points get less meaningful as the dimensionality increases.
  • In Logistic Regression, with more features, it can easily cause overfitting.
  • Generally, only a limited amount of dimensions are useful for classification and predictions.

Binary Encoding

With the name, binary encoding, it makes use of the ideal from computational binary to perform the encoding.

With binary encoding, the dimensionality is less than one-hot encoding, thus, efficiently saving spaces.

Taking the example previously on blood type (A, B, AB, O), we first assign an ID to each of the categories. Then, using binaries to do the representation of the assigned ID.

As we see from the figure above, with more categories, binary encoding has the ability to decrease the number of dimensions.

--

--

TC. Lin

Predicting the future isn't magic, it's artificial intelligence.