Encoding Categorical Variables

What should we do about them?

Published in

Geek Culture

8 min readMay 31, 2021

Wait what? 😱 What do you mean Machine Learning Algorithms do not understand categorical variables? Aren’t they supposed to be intelligent?

I don’t want to be a mood killer, but… they don’t. So, sit tight and let’s find out how can we make this problem disappear and possibly take advantage of those categorical variables.

Table of contents

Introduction
One Hot Encoding
Label Encoding
Target Encoding
Entity Embeddings
Similarity Encoding
Bonus

Introduction

Most often the case is that datasets include categorical variables. A set of discrete items like occupations or ingredient names. However, most machine learning algorithms do not accept such values as inputs (exceptions: CatBoost & LightGBM) but only numerical ones. This is where encoding steps in.

Regardless of the encoding method, they all aim to replace instances of a categorical variable with a fixed-length vector. Before moving on to the next section, it is important to know that there are two types of categorical variables:

Nominal → Athens, Cairo, Paris, Tokyo, New Delhi etc
Ordinal → High School Diploma, BS, MS, PhD

The difference between them is that the first kind has no particular ordering / direction while the other one does.

One Hot Encoding

This method encodes categorical values into binary vectors; 1 means presence while 0 absence. Here is a before and after example:

From the above, we can see every city is a feature in the resulting vector. But do you see anything strange? Perhaps if I put in some color, it will become evident…

Too much black, right? From a machine learning perspective, it is not the best way to encode our categorical variables for various reasons:

It adds unnecessary dimensionality. As the number of categories grows this method breaks down.
Information loss. The dot product of any two vectors will be zero. If we were to use pre-trained word2vec for example, then cities like: London and Paris will be closer than London and Tokyo.
It introduces multicollinearity. This means that there is dependency between independent variables. The truth is that we can easily predict any of the above variables by using the rest. This is bad news when we talk about Linear or Logistic Regression that assume absence of multicollinearity.

In general, One Hot Encoding is not the best approach to encode variables but it is not the worst either. If you have a small number of categories then it should be fine and if you worry about multicollinearity being present then you can read about how to avoid it on this blog.

Label Encoding

Also known as Ordinal Encoding, this type of encoding assigns an integer to each value.

As you can see, we have encoded the column colors by assigning an integer to each value in alphabetical order. However, this makes zero sense. Why should black be 0 and red 4? This will be misinterpreted by algorithms as having some sort of hierarchy/order in colors, which is not the case. Therefore, Label Encoding must be avoided if there is no specific order of the values. There are various alternatives to go for; One Hot Encoding, Target Encoding. However, if you have values like hate, dislike, neutral, like, love you can do the following:

import category_encoders as ce
import pandas as pddf = pd.DataFrame({'chart': ['dislike', 
                             'neutral', 
                             'neutral', 
                             'hate', 
                             'love', 
                             'like', 
                             'neutral', 
                             'hate']})encoder= ce.OrdinalEncoder(cols=['chart'],
                           return_df=True,
                           mapping=[{'col':'chart',
                                     'mapping':{'hate':0,
                                                'dislike':1,
                                                'neutral':2,
                                                'like':3,
                                                'love':4}}])
df['chart_enc']=encoder.fit_transform(df)

Though Label Encoding may seem quite straightforward, the order is not always that obvious and the user must be very cautious about it.

Target Encoding

It is admittedly one of the most powerful ways to encode your categorical variables — an approach followed by many Kagglers. The idea is quite simple; imagine you have a categorical variable weather and a target variable y , which can be either continuous or binary. Target Encoding will replace each weather category with the average of the corresponding values in y .

import pandas as pd
df = pd.DataFrame({
    'weather': ['rain'] * 7 + ['sunny'] * 5,
    'y': [1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1]
})weather_means = df.groupby('weather')['y'].mean()
df['weather_enc'] = df['weather'].map(weather_means)

This transformation results in a single feature; not adding to the dimensionality of the dataset. Additionally, the final representation preserves most of the predictive power of the original categorical variable.

The main problem with this technique is over-fitting. Think about it for a minute. We are replacing a categorical value with the mean of the target. What if we had only 3 such values, or 2, or even 1! When the number of a value present in the dataset is very low, chances of over-fitting are quite high.

One way to remedy this is to use cross-validation and compute the means in each out-of-fold dataset. Another more sophisticated technique is additive smoothing, which “smooths” the average by including the mean over all samples. In other words, if the number of observations for a categorical value are few, then we rely more on global average for target variable. You can have a look at the implementation here.

Entity Embeddings

Sometimes, we wish to capture the meaning behind the categories while encoding them. In such cases, Entity Embeddings is the way to go. Here, the mapping is learned by a neural network during the standard supervised training process. At the end, we map similar values close to each other in the embedding space, which maintains the intrinsic properties of the original categorical values.

In the following example, we will see how we can easily generate entity embeddings. Imagine we have the following dataset df. There are 2 categorical and 7 continuous variables. The target is binary. As a rule of thumb, the total number of dimensions of the output embedding are given by: min(unique_values//2, 50)

from sklearn.model_selection import train_test_split
from keras.layers import Dense, Dropout, Embedding, Input, Reshape, Concatenate
from keras.models import Modeltrain_cols = list(df.columns)
train_cols.remove('Y')X_train, X_val, y_train, y_val = train_test_split(df[train_cols], 
                                                  df['Y'],
                                                  test_size=0.2, 
                                                  random_state=42, 
                                                  stratify=df['Y'])# Add Categorical
categorical_variables = ['A', 'B']
continuous_variables = ['C', 'D', 'E', 'F', 'G', 'H', 'I']inp_layers = []
out_layers = []
for categorical in categorical_variables:
    num_uniq = df[categorical].nunique()
    out_dim = min(num_uniq//2, 50)
    x = Input((1,))
    inp_layers.append(x)
    x = Embedding(input_dim=num_uniq, 
                  output_dim=out_dim, 
                  input_length=1)(x)
    x = Reshape(target_shape=(out_dim,))(x)
    out_layers.append(x)# Add Numerical
cont_vars = Input((len(continuous_variables),))
inp_layers.append(cont_vars)
out_layers.append(cont_vars)x = Concatenate()(out_layers)
x = Dense(128, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)model = Model(inputs=inp_layers, outputs=out)model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])es = EarlyStopping(monitor='val_loss', mode='min', patience=10)
model.fit(X_train, y_train, epochs=200, 
          validation_data=(X_val, y_val), 
          verbose=2, callbacks=[es])

This method has quite a lot of advantages. It is suggested when you have high-cardinality features, as other methods will tend to over-fit. You should also expect better performance of your machine learning model. Lastly, you can measure the distance of the categorical values based on their corresponding embeddings. This means that you can visualize them or even use them for clustering purposes.

Similarity Embeddings

What if there was a hierarchical relationship between the values of a categorical variable (country-state-city or principal-senior-junior)? Then we might want to use the python package dirty_cat , which computes string similarity in the following way:

Transform a sequence of items (categorical values) to a set of 3-grams, which is embedded in a vector space, using the following vectorizer: vectorizer = CountVectorizer(analyzer='char', ngram_range=(3,3))
The authors claim that 3-grams is a good choice for efficient approximate matching.
Measure the similarity between each category (K) and each sample (N); resulting in a similarity matrix NxK by using the Dice Coefficient as a similarity measure. For instance, 3-grams(Paris) = {Par, ari,ris} and 3-grams(Parisian) = {Par, ari,ris, isi,sia, ian} have three 3-grams in common, and their similarity is sim{3-grams(Paris,Parisian)} = 3/6 .

Bonus

Just because you are given a dataset and a prediction task, it does not mean that you must use all the features. Therefore, you may also consider to drop the categorical variable(s).

Another important step to consider is to convert a continuous variable to a categorical one. Yes, you read that correctly. Although it sounds unorthodox, it might be the last piece to complete the puzzle. This feature engineering technique is called binning and it works like this:

Divide the values into N bins. The bin ranges can be found either by generating the histogram or selected based on certain criteria. For example, if we have a variable about price, we can categorize them as low, average, high based on the ranges 10-200, 201-800, 801-1000
Use One-Hot-Encoding to generate the new features. Of course, we can use any other type of encoding we think is more suitable

import pandas as pddf=pd.DataFrame({'clothes': ['t-shirt', 't-shirt', 'dress',
                             'jacket', 'pants', 'jacket',
                             'pants', 'dress'],
                 'price': [57, 98, 453, 807, 230, 567, 123, 408]})cut_points = [0, 200, 800, 1000]
labels = ['low', 'average', 'high']
df["cat_price"] = pd.cut(df["price"], cut_points, labels=labels)
df = pd.concat([df, pd.get_dummies(df['cat_price'])], axis=1)

Last words…

The are many other techniques to encode categorical variables. I mentioned the ones I found interesting and quite popular. Each and every one of these methods has its own advantages and disadvantages. No approach is deemed the best; it entirely depends on your dataset and task. However, there are some guidelines to follow:

If a variable has a lot of categories (high cardinality), then a One-Hot-Encoding is out of the picture as it will introduce sparsity and dimensionality to your dataset
Say “no” to Label encoding unless the categorical values follow an order
Entity Embeddings appear nice but don’t always work out of the box

I hope you found the above article useful 😊