Categorical Embedder: Encoding Categorical Variables via Neural Networks

Shivanand Roy
Analytics Vidhya
Published in
4 min readFeb 6, 2020

Before you —

pip install categorical_embedder

Let me talk about it first.

We know that machine learning models love numeric data. We convert our input (text, image, speech, etc.) to numbers before feeding it to ML models. Models like CatBoost and LightGBM do handle categorical variables but under the hood, they also convert them to numbers by different techniques before actual training starts. Methods like dummy encoding, label encoding, target encoding, frequency encoding, etc. serve our purpose, but can we do better? Do we have a better way to encode our categorical variables that explains our target variable?

The answer is Yes! We can use neural networks to better represent our categorical variables in the form of embeddings.

What’s an embedding?

“An embedding is a fixed-length vector representation of a category”

Embedding simply means representing a category with a fixed set of numbers. Let’s say I have a dataset and it has a column ‘subject’ which has 5 unique values: Physics, Chemistry, Biology, Geography, and Mathematics. Then, we can represent these categories with a set of numbers (let’s say 3):

Why 3: I’ll come to that later.

Traditional Approach:

Traditionally, we convert categorical variables into numbers by

  • One hot encoding
  • Label encoding

In one hot encoding, we build as many features as the number of unique categories in that feature and for every row, we assign a 1 to the feature representing that row’s category and the rest of features are marked 0. This technique becomes problematic when you have a lot of categories (unique values) in a feature leading to very sparse data. And as each vector is equidistant from every other vector, the relationship between variables is lost.

Another approach to convert categorical features to numbers is to use a technique called label encoding. Label encoding is simply converting each value in that column to an integer. This technique is very simple but induces comparison between feature categories because it uses number sequencing. The model might think Chemistry has higher precedence over physics and similarly, biology has a higher weight than chemistry (which is actually not the case).

How do categorical embeddings work?

First, each categorical variable category is mapped to an n-dimension vector. This mapping is learned by a neural network during a standard supervised training process. Continuing with our example, if we want to use the ‘Subject’ column as a feature, then we will train a neural network in a supervised manner, get vectors for each category and generate a 5x3 matrix as below.

After that, we will replace each category with their corresponding vectors in our data.

Why Categorical Embeddings are a better alternative?

  • We limit the number of columns we need per category. This is useful when columns have high cardinality.
  • The generated embeddings obtained from the neural network reveals the intrinsic properties of categorical variables. This means that similar categories will have similar embeddings.

Package: Categorical Embedder

pip install categorical_embedder

You can generate embeddings for categorical variables in your data with the help of this package:

Below is a simple code to generate categorical embeddings:

import categorical_embedder as ce
from sklearn.model_selection import train_test_split
df = pd.read_csv('HR_Attrition_Data.csv')
X = df.drop(['employee_id', 'is_promoted'], axis=1)
y = df['is_promoted']
embedding_info = ce.get_embedding_info(X)
X_encoded,encoders = ce.get_label_encoded_data(X)
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y)embeddings = ce.get_embeddings(X_train, y_train, categorical_embedding_info=embedding_info,
is_classification=True, epochs=100,batch_size=256)

A more detailed Jupyter Notebook can be found here

Categorical Embedder: Example Notebook

What’s inside categorical embedder?

  • ce.get_embedding_info(data,categorical_variables=None): This function identifies all categorical variables in the data, determines its embedding size. Embedding size of the categorical variables are determined by a minimum of 50 or half of the no. of its unique values i.e. embedding size of a column = Min(50, # unique values in that column) One can pass an explicit list of categorical variables in categorical_variables parameter. If None, this function automatically takes all the variables with data type object
  • ce.get_label_encoded_data(data, categorical_variables=None): This function label encodes (integer encoding) all the categorical variables using sklearn.preprocessing.LabelEncoder and returns a label encoded dataframe for training. Keras/TensorFlow or any other deep learning library would expect the data to be in this format.
  • ce.get_embeddings(X_train, y_train, categorical_embedding_info=embedding_info, is_classification=True, epochs=100,batch_size=256): This function trains a shallow neural network and returns embeddings of categorical variables. Under the hood, It is a 2 layer neural network architecture with 1000 and 500 neurons with 'ReLU' activation. It takes 4 required inputs - X_train, y_train, categorical_embedding_info:output of get_embedding_info function and is_classification: True for classification tasks; False for regression tasks.

For classification: loss = 'binary_crossentropy'; metrics = 'accuracy' and for regression: loss = 'mean_squared_error'; metrics = 'r2'

Please find the Github Repo and Example Notebook here.

--

--

Shivanand Roy
Analytics Vidhya

Senior Data Scientist at Ernst & Young | I write about Machine Learning, Deep Learning, Generative AI and LLMs | Living and Writing with Passion and Purpose!