Encoding Categories

Bilal Khan
4 min readAug 16, 2022

--

Categorical features are those that contain data in the form of categories. Machine learning models only handle numerical data, but categorical features are essentially of the string data type. Encoding, which is the act of converting these categories to numeric values, is therefore essential. Let us discuss encoding in detail.

The traditional techniques of encoding involve:

  1. One-Hot Encoding
  2. Count Frequency Encoding
  3. Ordinal / Label Encoding

Another interesting concept of monotonic relationship comes into play, where the feature we are dealing with is directly or inversely proportional to the label (outcome). For example, insurance for elderly people is more expensive. Here there is a direct proportion between the feature (age group) and the label (price of insurance). We use the following techniques when we come across a monotonic relationship:

  1. Order Label Encoding
  2. Mean Encoding
  3. Weight of Evidence

Traditional Techniques

Let us understand the following techniques in detail:

  1. One-Hot Encoding

One-Hot Encoding makes use of binary numbers to represent the presence or absence of categories. It converts each category into a feature itself containing binary values, where the presence of that category is denoted by 1. Let us understand this with an example.

The categorical feature here is color, and each category here is converted to a feature itself, marking its presence and absence with binary numbers.

Advantages: It preserves the entire information

Disadvantages: It increases feature space

2. Count Frequency Encoding

The count or frequency of observations in the dataset replaces the categories. The following example illustrates Count Frequency Encoding.

Each time the red category appears, its frequency of three is substituted for the previous instance. The same is true for yellow and green.

This strategy is widely used in kaggle challenges and is effective with ensemble models. For linear models, it does not fit well. This method has the drawback of causing a collision between two categories if their frequencies are similar.

3. Ordinal / Label Encoding

In label encoding, the categories are changed to numbers between 1 and n, where n is the total number of different categories. The categories are assigned numbers based on alphabetical order.The following example illustrates Label Encoding.

The number of distinct categories are three, and since we follow alphabetical order, green is assigned with 1, red with 2 and yellow with 3. This approach does not extend feature space and works well with ensemble algorithm.

Encoding for Monotonic Relationship

A monotonic relationship, as was previously established, denotes the direct or inverse propositionality of the feature with the label (outcome). Let’s examine the encoding methods.

  1. Order Label Encoding

Integers from 1 to n are used in place of categories, where n is the number of different categories in the variable. The mean of the target for each category, however, informs the numbering. Let us understand this with an example.

We have our feature (color) and target. Now we calculate the mean for each category with respect to its target, so the mean of red would be 0.5 as it occurs twice with one success and one failure in the outcome, whereas the mean of yellow is 1, as it occurs twice with both success in its outcome, and similarly, the mean of green would be 0. Now the categories are assigned numbers based on their mean, with the highest mean category being assigned 1 (yellow), followed by red and green.

With this approach a monotonic relation is established even after encoding and this approach can be used in both linear and non-linear models.

2. Mean Encoding

Mean Encoding implies replacing the category by the mean target value for that category. Let us understand this with an example

Each category’s mean is determined relative to its target and replaced accordingly. In this instance, red’s mean is 0.5 in relation to the target. Likewise with yellow and green. The limitation with this approach is that two categories having the same mean can lead to overlapping of two categories and hence loss of information.

3. Probability Ratio Encoding

In this technique, the probability of the target being positive for a category is calculated, which is P(1). Similarly, the probability of the target being negative is calculated for a category which is P(0). The probability ratio is then calculated using P(1)/P(0) and the resultant value is used to encode the category.

Implementation of each technique can be explored here.

I’m grateful.

--

--

Bilal Khan

Student of technology. Join me in exploring the field of Data Science, Machine Learning and Deep Learning ✨ https://www.linkedin.com/in/bilal-khan-a48b58247