Using entity embeddings with FastAI (v1 and v2!)

Published in

codon-consulting

11 min readMay 7, 2020

by Mikael Huss

The FastAI library’s built-in functionality for tabular data classification and regression, based on neural networks with categorical embeddings, allows for rapid experimentation to achieve good results quickly.

Intro

In this blog post, which is a follow-up to my Tabular data and deep dearning: Where do we stand? blog post, I will talk about the idea of entity embeddings and show how to use them with the FastAI package. The figures are taken from the excellent chapter on tabular data in the upcoming book on FastAI.

Entity (categorical) embeddings

In 2018, I watched a great FastAI lecture by Jeremy Howard where he argued that using neural networks with something called “entity embeddings” was an easy way to get onto Kaggle leaderboards for tabular data problems. I liked the concept and started implementing it in my own projects. (Later, Fast AI followed up with a blog post on the same topic.)

Entity embeddings are a way to encode categorical variables, that is, non-numerical variables that take their values from some fixed set. It could be, for instance, cities in France, or weekdays, or car brands. Such variables can be encoded in different ways for use in machine learning systems, and sometimes pose trouble, especially when they have a large number of possible values.

Make it numerical

The idea behind entity embeddings is to learn to encode each value of the categorical variable as a numerical vector, often with a fairly low dimensionality. The embeddings are learned during training, as a “side effect” of trying to solve a classification or regression problem.

This concept is similar to word vectors and to user or product profiles in collaborative filtering. Think about something like Netflix or Spotify: even though you have watched thousands or movies or listened to tens of thousands of songs, your taste profile can probably be described in a much smaller number of dimensions than that; maybe as some specific positioning of less than ten different sliders that encode your genre preferences, maybe a faiblesse for movies/songs from certain countries, comedy vs drama or uptempo vs downtempo, and so on.

Importantly, categorical variables that are similar will get similar embeddings, which can often lead to insights into the properties of the categorical variables when you plot a two-dimensional projection of the embeddings.

For example, in the image shown below (from the FastAI book), the learned embeddings for a categorical variable for cities and areas in Germany turn out to be arranged similarly to their geographical layout in a 2D plot.

The first paper I read which described this explicitly (although the same ideas had been used earlier), was this 2016 paper, which details a successful solution to a Kaggle competition.

How does one use entity embeddings?

The idea of categorical embeddings is already pretty established, and the various deep learning libraries all have their own versions of this. For example, Keras has special Embedding layers, as does PyTorch. These are usually implemented as lookup tables that map integer values to vectors of floats. To use these embedding layers, you first need to encode your categorical variable with integer values. Each of these integers will then correspond to a vector representation of the corresponding category. The exact values in this vector representation are learned via gradient descent while training the whole model on some task.

Each variable that you wish to embed would have its own embedding layer. The resulting representations are then usually concatenated and fed to the next layer (often a dense layer).

Entity embedding layers (image from https://github.com/fastai/fastbook/blob/master/09_tabular.ipynb)

In Keras and Pytorch, the user needs to specify the range of possible integer values that can be input to the embedding layer, as well as the dimensionality of the the embedding.

Here is a blog post that shows the general approach for using Keras embedding layers for category embeddings. And here is one for Pytorch.

An alternative is to use FastAI, which has built-in support for tabular data and provides sensible defaults which usually gives decent results out of the box — this is one of the design principles of the library. FastAI is currently being updated to a v2, which is in some ways quite different — this preprint article explains how learnings from the first version have informed this new version.

Let’s look at some examples of how to use FastAI on tabular data! We will again use the Adult dataset, just as in the posts about TabNet and NODE.

All the code is available in a Colaboratory notebook: https://colab.research.google.com/drive/1UUp5U8KGPYJ7AuJJ6G5BsDd239AJP9wh

FastAI 1

The fastai library is pre-installed on Colab. If you are in a different environment, a pip install fastai should do the trick; if it doesn’t, please check this page for other possibilities.

Most of FastAI’s code examples use star imports (importing many functions at once), which is generally not considered good practice, but as the FastAI authors explain in the paper about FastAI2: “The library is carefully designed to ensure that importing in this way only imports the symbols that are actually likely to be useful to the user and avoids cluttering the namespace or shadowing important symbols”, so we will not feel shy to do the same.

from fastai.tabular import *

Read a CSV with the Adult dataset and shuffle it:

df = pd.read_csv("https://docs.google.com/uc?id=10eFO2rVlsQBUffn0b7UCAp28n0mkLCy7&export=download")df = df.sample(frac=1, random_state=42)

Now we need to specify some parameters in order to tell FastAI how to process the data.

dep_var = '<=50K'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

In the first three lines, we define which columns in the data frame that are the dependent variable (i.e. the variable to predict; the y), the categorical variables, and the continuous (numerical) variables, respectively.

The fourth and last line defines the various preprocessing functions we want to apply to the data. Categorify transforms the columns specified via cat_names to a categorical data type; FillMissing fills in any existing missing values in continuous variables using one of the available strategies (the default is to fill with the median and add an extra indicator column to indicate which values were missing; a nice way to get non-lossy imputation) and Normalize applies standardization (mean subtraction and division by standard deviation).

Using a test set

In this blog post, I will also show what I have gleaned about using a test set in FastAI and FastAI2. Using a validation set is well supported throughout the different workflows, but the test set has been handled in different ways in different versions. As far as I can tell, the way to introduce a test set in FastAI v1 is to create two different objects from our data frame. FastAI v1 has a class called TabularList, which can be used to define the training, validation and test sets. For the test set, we can just send in the relevant portion of the dataframe and the names of the categorical and continuous variables. For the training and validation sets, we also need to provide the preprocessing functions, a labeling function, and a specification for how to split the data into training and validation. We also add the test set using the add_test() function (important!) Finally, the resulting TabularList is converted into a DataBunch for training.

test_size = 1000
val_size = 10000test = TabularList.from_df(df.iloc[-test_size:].copy(), path='.', cat_names=cat_names, cont_names=cont_names)train_val = (TabularList.from_df(df, path='.', cat_names=cat_names,                   cont_names=cont_names,procs=procs)
   .split_by_idx(list(range(df.shape[0]-test_size-val_size,   df.shape[0]-test_size)))
  .label_from_df(cols=dep_var)
  .add_test(test) # Here is where the test set is added
  .databunch())

Note that this version of FastAI uses a “fluent” interface where operators are chained: after a TabularList is created via the from_df() function, it is split with split_by_idx(), the results of which are sent into label_from_df(), and so on. Another way to say it is that results are piped between functions.

Great, now we have an object for the training (and validation) data and an object for the test data. If we want to, we can look at a batch of data with

train_val.show_batch(rows=10)

So what about the model?

Training the model

The beauty of the FastAI library is that model training is very simple. In this case we can just invoke tabular_learner and give it our training/validation data as well as the number of layers and nodes we would like in our neural network.

learn = tabular_learner(train_val, layers=[200,100], metrics=accuracy)

Given this, we can use the built-in functionality to find a suitable learning rate:

learn.lr_find()
learn.recorder.plot()

We get something like this:

Choosing a good learning rate value from this type of plot is not an exact science. One rule of thumb that I heard is to pick the value at the steepest decline, which is somewhere around 1e-02 (0.01) in this plot; another one is to go for the minimum loss, which would be somewhere around maybe 0.4 here.

Or we can just disregard all of that and choose 0.001 :D

Once you have a tabular_learner object (called learn in our case), you can just use the fit() model to train it, or actually, let’s use fit_one_cycle(). This way of fitting the model uses the 1-cycle policy, which has proved to be an efficient way of training neural networks.

learn.fit_one_cycle(5, 1e-3)

This will print out some training and validation metrics such as the following:

FastAI v1: Metrics for training and validation sets

In order to see the performance on the test set, we can use get_preds() with DatasetType.Test — we can do this because the test set was explicitly added when we created the train_val object which learn was fitted on: (yes, the whitespace in front of >50K is supposed to be there!)

preds = learn.get_preds(ds_type=DatasetType.Test)
y_pred = np.argmax(preds[0], axis=1).numpy()
y_true = [int(i) for i in df.iloc[-test_size:]['<=50K']==' >50K']
sum(y_pred == y_true) / len(y_pred)

This yielded an accuracy of 83.6% in my trial run.

Inspecting the embeddings

Can we visualize some of our embeddings? I confess that this section contains some guesswork on my part. We can list the embedding layers used by our model with:

learn.model.embeds

which prints

ModuleList( 
  (0): Embedding(10, 6) 
  (1): Embedding(17, 8) 
  (2): Embedding(8, 5) 
  (3): Embedding(16, 8) 
  (4): Embedding(7, 5) 
  (5): Embedding(6, 4) )

These are the input and output sizes corresponding to each embedding layer. Typically the input size is (the number of possible values for the variable in the training set + 1) — the extra dimension is used when encountering a previously unseen value.

We can find the possible values of a categorical feature — say, “occupation” — by using Pandas’ categorical data type (this is what happens in the FastAI preprocessing of tabular data):

list(df['occupation'].astype('category').cat.categories.values)

This will print a list of 15 occupations — so we should have an input size of 16, corresponding to index 3 in the ModuleList above.

I am omitting some code here — it is all available in the Colab notebook. In any case, given that you know the index you are interested in, you can fish out the embedding matrix by doing something like

emb_mx = to_np(next(learn.model.embeds[ix].parameters()))

The multi-dimensional representations of each category value can then be visualized after doing tSNE or PCA (we use the latter in this case).

In the plot below, we see how the occupation embeddings, summarized via PCA, relate to each other.

PCA of learned embeddings for the “occupation” variable in the Adult dataset

FastAI2

I decided to try training the model in FastAI v2 as well. It’s scheduled to be officially released during summer 2020 and is expected to be an improvement compared to v1 with an even better callback functionality, modular optimizers, and other things.

As v2 is still under heavy development, the authors recommend doing an editable install. However, this also means that the code below might not work if the code base has diverged too much between the time I published this and the time you try it. So no guarantees!

This is what an editable install could look like in Colaboratory:

!git clone https://github.com/fastai/fastai2
import os
os.chdir('fastai2')
!pip install -e ".[dev]" > /dev/null

Luckily, the initial setup, before creating the dataset loaders, is pretty much similar to how things work in FastAI v1.

from fastai2.basics import *
from fastai2.tabular.all import *
import numpy as np
import pandas as pddf = pd.read_csv("https://docs.google.com/uc?id=10eFO2rVlsQBUffn0b7UCAp28n0mkLCy7&export=download")
df = df.sample(frac=1, random_state=42)test_size = 1000df_main, df_test = df.iloc[:-test_size].copy(), df.iloc[-test_size:].copy()
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

The only significant addition here is the RandomSplitter() function which we use to create the training/validation split. Note that we also split up the original dataframe to training and test subsets at an earlier stage than before.

Data loaders in FastAI v2

However, the data loaders in FastAI v2 are defined in a different way from v1. There is a new class called TabularPandas which we first use to create a data loader for tabular data.

to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="<=50K", splits=splits)

The same functionality (preprocessing, split, labeling) is used, but via function input parameters rather than the “fluent” interface of v1.

Now we can make a tabular_learner (in this case we also need to run the dataloaders() function on the TabularPandas object; this will split the dataset into training and validation subsets):

learn = tabular_learner(to.dataloaders(), layers=[200,100], metrics=accuracy, opt_func=ranger)

And as with v1, we can run a learning rate finder to identify a good value for the learning rate. In v2, you don’t need a separate plotting command — a single command will run the LR finder and show the plot (and even recommend some good values to spare you from interpreting the plot).

learn.lr_find()

Let’s train this model with the same learning rate as before:

learn.fit_one_cycle(5, 1e-3)

Now we can bring in the test set. One way to do it is to add a test set to the learn object using the .dls.test_dl() function, which takes a standard Pandas data frame as input. When you do this, the appropriate preprocessing transformations will be applied to the df_test data frame.

dl = learn.dls.test_dl(df_test)

Now we can evaluate the model against the test set in one go using validate():

learn.validate(dl=dl)

This will print out a loss and an accuracy. Just to make sure that validate() is doing what we expect, we can try a bit more manually, again with get_preds() as in the v1 example:

preds = learn.get_preds(dl=dl)
y_pred = np.argmax(preds[0], axis=1).numpy()
y_true = [int(i) for i in df_test['<=50K']==' >50K']
sum(y_pred == y_true) / len(y_pred)

Indeed, the result is identical to that of validate().

Conclusion

Entity, or categorical, embeddings have enabled neural network models to approach tree ensemble performance on tabular data problems in recent years. While Keras/Tensorflow and PyTorch have the necessary functionality for using entity embeddings, FastAI probably has the most straightforward way of defining and iterating on such models. Whether its performance can match gradient boosting (e.g. xgboost, LightGBM and CatBoost) or the other types of deep learning models for tabular data (NODE, TabNet) remains to be seen, although FastAI lecturer Zach Mueller and others have spent some time benchmarking FastAI and competing approaches (including TabNet) on a couple of tabular datasets here.

About the author, Mikael Huss

Mikael Huss is a senior data scientist and co-founder of Codon Consulting and holds a PhD in computational neuroscience and an associate professorship in bioinformatics. Mikael works with, and likes to blog about machine learning and deep learning. Apart from his 15+ years of academic research, he has wide experience from applying machine learning to industries such as retail, manufacturing, and medical imaging. Before joining Codon Consulting, he was a data scientist at Peltarion.