Self-Organizing Maps with fast.ai — Step 4: Handling unsupervised data with Fast.ai DataBunch
This is the fourth part of the Self-Organizing Maps with fast.ai article series.
All the code has been published in this repository and this PyPi library.
- Overview: Self-Organizing Maps with Fast.ai
- Step 1: Implementing a SOM with PyTorch
- Step 2: Training the SOM Module with a Fast.ai Learner
- Step 3: Updating SOM hyperparameters with Fast.ai Callbacks
Overview
Many datasets come in tabular form. For this reason, Fast.ai has a handy Tabular subclass for its DataBunch that can natively perform categorization and handle continuous and categorical features.
In this article, we will use TabularDataBunch
as a starting point to load our data, and then build a conversion function to move it into our UnsupervisedDataBunch
.
The main features we’re going to re-implement are:
- Normalization
- Categorical feature encoding
- Export of the SOM codebook into a Pandas DataFrame
Normalization
We will use a separate Normalizer
class to perform per-feature normalization. Let’s define a base class:
While comparing different normalizers, I found out that normalizing by mean and standard deviation helps a lot with SOM convergence, so we’ll extend our Normalizer
class into a VarianceNormalizer
:
Note that we also implemented the denormalization function. Since we are normalizing our data, the trained SOM codebook will contain normalized data points: we will need denormalization in order to retrieve values in the initial data range.
Let’s add the normalizer to our UnsupervisedDataBunch
:
Handling categorical features
Another important preprocessing step for Self-Organizing Maps is transforming categorical features into continous; this could be done by either One-Hot encoding features, or by using embeddings. Since One-Hot encoding is the easiest to implement, we’ll start with that, although embeddings have a better performance.
Both methods require a mixed distance function to compare actual continuous features and converted categoricals independently, but we will skip this step for simplicity’s sake. If you’re interested in how a mixed distance function can be implemented, feel free to have a look at the code on Github.
As we did for normalizers, we will start by defining a base CatEncoder
class:
And subclass it into an OneHotCatEncoder
:
All we’re doing here is using torch.nn.functional.one_hot
to perform one-hot encoding of our input variables, storing the number of categories for each feature in the training set during fit
, and then using this information to perform encoding with make_continuous
and decoding with make_categorical
.
Importing Pandas DataFrames
One feature we might want for our UnsupervisedDataBunch
is the ability of being created from a Pandas DataFrame. As mentioned in the overview above, we will leverage Fast.ai TabularDataBunch
to do the data loading and preprocessing for us, then we’ll import the data into our own databunch.
A TabularDataBunch
is usually created as follows:
The code above does the following:
- Load the dataframe into a TabularList
- Split the list into training and validation sets
- Fill in missing values
- Turn all categorical features into ordinal categories
Now we can write a conversion function to transform the TabularDataBunch into an UnsupervisedDataBunch. This is where things get hacky: we need to retrieve categorical and continuous features separately, process categorical features using OneHotCatEncoder
and then concatenate everything into a single Tensor.
TabularDataBunch ->
UnsupervisedDataBunch conversionSince the TabularDataBunch
can have a target variable, we are going to add the optional train_y
and valid_y
arguments to our UnsupervisedDataBunch:
We can now convert any TabularDataBunch by simply using the extension function:
The next step is testing everything we’ve done so far on an actual dataset.
Training on House Prices dataset
I chose the House Prices dataset, since it is well-known and it has a good number of categorical features that we can use to test our data workflow. You can find the dataset on Kaggle, among other places.
Let’s start from a CSV file containing the training set and go from there:
Pretty neat, right? In just about 40 lines of code we got ourselves a trained Self-Organizing Map 😊
Here’s the loss plot:
Creating a DataFrame from the SOM codebook
One of the best things about Self-Organizing Maps is the ability to run predictions of another model (trained on the same dataset) over the codebook elements, and then plot prediction values / classes for each item on the map.
To do so, we could write a codebook_to_df
function inside our SomLearner
:
Now we need a model to use for prediction. Let’s use this Kaggle submission of House Prices regression with a Fast.ai Tabular learner as a starter:
Running regression on SOM codebook
Now that we have a trained regressor, let’s generate the DataFrame of the SOM codebook and use it as a test set:
Now we can use plt.imshow()
on predictions to get a visualization of house price distribution over the SOM 😊
This is cool, right? In the next article we will complete our SOM toolkit by adding a whole lot of visualization and interpretation utilities, basing our API on Fast.ai ClassificationInterpretaion
class.
Note: the library code for UnsupervisedDataBunch
has been rewritten by using Fast.ai TabularDataBunch
with additional transforms. This article builds the DataBunch from scratch, and it was left untouched for easier understanding.