Self-Organizing Maps with fast.ai — Step 4: Handling unsupervised data with Fast.ai DataBunch

Riccardo Sayn
Kirey Group
Published in
4 min readAug 5, 2020

This is the fourth part of the Self-Organizing Maps with fast.ai article series.

All the code has been published in this repository and this PyPi library.

Overview

Many datasets come in tabular form. For this reason, Fast.ai has a handy Tabular subclass for its DataBunch that can natively perform categorization and handle continuous and categorical features.

In this article, we will use TabularDataBunch as a starting point to load our data, and then build a conversion function to move it into our UnsupervisedDataBunch.

The main features we’re going to re-implement are:

  • Normalization
  • Categorical feature encoding
  • Export of the SOM codebook into a Pandas DataFrame

Normalization

We will use a separate Normalizer class to perform per-feature normalization. Let’s define a base class:

While comparing different normalizers, I found out that normalizing by mean and standard deviation helps a lot with SOM convergence, so we’ll extend our Normalizer class into a VarianceNormalizer:

Note that we also implemented the denormalization function. Since we are normalizing our data, the trained SOM codebook will contain normalized data points: we will need denormalization in order to retrieve values in the initial data range.

Let’s add the normalizer to our UnsupervisedDataBunch:

Handling categorical features

Another important preprocessing step for Self-Organizing Maps is transforming categorical features into continous; this could be done by either One-Hot encoding features, or by using embeddings. Since One-Hot encoding is the easiest to implement, we’ll start with that, although embeddings have a better performance.

Both methods require a mixed distance function to compare actual continuous features and converted categoricals independently, but we will skip this step for simplicity’s sake. If you’re interested in how a mixed distance function can be implemented, feel free to have a look at the code on Github.

As we did for normalizers, we will start by defining a base CatEncoder class:

And subclass it into an OneHotCatEncoder:

All we’re doing here is using torch.nn.functional.one_hot to perform one-hot encoding of our input variables, storing the number of categories for each feature in the training set during fit, and then using this information to perform encoding with make_continuous and decoding with make_categorical .

Importing Pandas DataFrames

One feature we might want for our UnsupervisedDataBunch is the ability of being created from a Pandas DataFrame. As mentioned in the overview above, we will leverage Fast.ai TabularDataBunch to do the data loading and preprocessing for us, then we’ll import the data into our own databunch.

A TabularDataBunch is usually created as follows:

Creation of a TabularDataBunch from a Pandas DataFrame

The code above does the following:

  • Load the dataframe into a TabularList
  • Split the list into training and validation sets
  • Fill in missing values
  • Turn all categorical features into ordinal categories

Now we can write a conversion function to transform the TabularDataBunch into an UnsupervisedDataBunch. This is where things get hacky: we need to retrieve categorical and continuous features separately, process categorical features using OneHotCatEncoder and then concatenate everything into a single Tensor.

TabularDataBunch -> UnsupervisedDataBunch conversion

Since the TabularDataBunch can have a target variable, we are going to add the optional train_y and valid_y arguments to our UnsupervisedDataBunch:

UnsupervisedDataBunch with optional targets

We can now convert any TabularDataBunch by simply using the extension function:

The next step is testing everything we’ve done so far on an actual dataset.

Training on House Prices dataset

I chose the House Prices dataset, since it is well-known and it has a good number of categorical features that we can use to test our data workflow. You can find the dataset on Kaggle, among other places.

Let’s start from a CSV file containing the training set and go from there:

Pretty neat, right? In just about 40 lines of code we got ourselves a trained Self-Organizing Map 😊

Here’s the loss plot:

Loss plot on House Prices dataset

Creating a DataFrame from the SOM codebook

One of the best things about Self-Organizing Maps is the ability to run predictions of another model (trained on the same dataset) over the codebook elements, and then plot prediction values / classes for each item on the map.

To do so, we could write a codebook_to_df function inside our SomLearner:

Creating a DataFrame from SOM codebook elements

Now we need a model to use for prediction. Let’s use this Kaggle submission of House Prices regression with a Fast.ai Tabular learner as a starter:

Training a tabular model

Running regression on SOM codebook

Now that we have a trained regressor, let’s generate the DataFrame of the SOM codebook and use it as a test set:

Regression on SOM codebook

Now we can use plt.imshow() on predictions to get a visualization of house price distribution over the SOM 😊

Tabular model predictions on SOM codebook

This is cool, right? In the next article we will complete our SOM toolkit by adding a whole lot of visualization and interpretation utilities, basing our API on Fast.ai ClassificationInterpretaion class.

Note: the library code for UnsupervisedDataBunch has been rewritten by using Fast.ai TabularDataBunch with additional transforms. This article builds the DataBunch from scratch, and it was left untouched for easier understanding.

--

--