Self-Organizing Maps with — Step 4: Handling unsupervised data with DataBunch

Riccardo Sayn
Kirey Group
Published in
4 min readAug 5, 2020

This is the fourth part of the Self-Organizing Maps with article series.

All the code has been published in this repository and this PyPi library.


Many datasets come in tabular form. For this reason, has a handy Tabular subclass for its DataBunch that can natively perform categorization and handle continuous and categorical features.

In this article, we will use TabularDataBunch as a starting point to load our data, and then build a conversion function to move it into our UnsupervisedDataBunch.

The main features we’re going to re-implement are:

  • Normalization
  • Categorical feature encoding
  • Export of the SOM codebook into a Pandas DataFrame


We will use a separate Normalizer class to perform per-feature normalization. Let’s define a base class:

While comparing different normalizers, I found out that normalizing by mean and standard deviation helps a lot with SOM convergence, so we’ll extend our Normalizer class into a VarianceNormalizer:

Note that we also implemented the denormalization function. Since we are normalizing our data, the trained SOM codebook will contain normalized data points: we will need denormalization in order to retrieve values in the initial data range.

Let’s add the normalizer to our UnsupervisedDataBunch:

Handling categorical features

Another important preprocessing step for Self-Organizing Maps is transforming categorical features into continous; this could be done by either One-Hot encoding features, or by using embeddings. Since One-Hot encoding is the easiest to implement, we’ll start with that, although embeddings have a better performance.

Both methods require a mixed distance function to compare actual continuous features and converted categoricals independently, but we will skip this step for simplicity’s sake. If you’re interested in how a mixed distance function can be implemented, feel free to have a look at the code on Github.

As we did for normalizers, we will start by defining a base CatEncoder class:

And subclass it into an OneHotCatEncoder:

All we’re doing here is using torch.nn.functional.one_hot to perform one-hot encoding of our input variables, storing the number of categories for each feature in the training set during fit, and then using this information to perform encoding with make_continuous and decoding with make_categorical .

Importing Pandas DataFrames

One feature we might want for our UnsupervisedDataBunch is the ability of being created from a Pandas DataFrame. As mentioned in the overview above, we will leverage TabularDataBunch to do the data loading and preprocessing for us, then we’ll import the data into our own databunch.

A TabularDataBunch is usually created as follows:

Creation of a TabularDataBunch from a Pandas DataFrame

The code above does the following:

  • Load the dataframe into a TabularList
  • Split the list into training and validation sets
  • Fill in missing values
  • Turn all categorical features into ordinal categories

Now we can write a conversion function to transform the TabularDataBunch into an UnsupervisedDataBunch. This is where things get hacky: we need to retrieve categorical and continuous features separately, process categorical features using OneHotCatEncoder and then concatenate everything into a single Tensor.

TabularDataBunch -> UnsupervisedDataBunch conversion

Since the TabularDataBunch can have a target variable, we are going to add the optional train_y and valid_y arguments to our UnsupervisedDataBunch:

UnsupervisedDataBunch with optional targets

We can now convert any TabularDataBunch by simply using the extension function:

The next step is testing everything we’ve done so far on an actual dataset.

Training on House Prices dataset

I chose the House Prices dataset, since it is well-known and it has a good number of categorical features that we can use to test our data workflow. You can find the dataset on Kaggle, among other places.

Let’s start from a CSV file containing the training set and go from there:

Pretty neat, right? In just about 40 lines of code we got ourselves a trained Self-Organizing Map 😊

Here’s the loss plot:

Loss plot on House Prices dataset

Creating a DataFrame from the SOM codebook

One of the best things about Self-Organizing Maps is the ability to run predictions of another model (trained on the same dataset) over the codebook elements, and then plot prediction values / classes for each item on the map.

To do so, we could write a codebook_to_df function inside our SomLearner:

Creating a DataFrame from SOM codebook elements

Now we need a model to use for prediction. Let’s use this Kaggle submission of House Prices regression with a Tabular learner as a starter:

Training a tabular model

Running regression on SOM codebook

Now that we have a trained regressor, let’s generate the DataFrame of the SOM codebook and use it as a test set:

Regression on SOM codebook

Now we can use plt.imshow() on predictions to get a visualization of house price distribution over the SOM 😊

Tabular model predictions on SOM codebook

This is cool, right? In the next article we will complete our SOM toolkit by adding a whole lot of visualization and interpretation utilities, basing our API on ClassificationInterpretaion class.

Note: the library code for UnsupervisedDataBunch has been rewritten by using TabularDataBunch with additional transforms. This article builds the DataBunch from scratch, and it was left untouched for easier understanding.

