AutoEmbedder — Training embedding layers on unsupervised tasks

Published in

Kirey Group

5 min readJan 27, 2021

How AutoEncoders can be used to train unsupervised entity embeddings

3D Visualization of Word2Vec embeddings from Tensorflow Projector

Embeddings have become the standard way to represent categorical features in Machine Learning. The ability to encode words, entities or category values into meaningful, dense vector representations and to perform numerical operations and comparisons between them has led to a lot of progress in the field during the last years.

In this post I’d like to make a quick overview of various embedding strategies (mainly supervised vs. unsupervised) and present the AutoEmbedder, a model I’m using to train embedding layers in unsupervised learning tasks.

Word Embeddings

Word Embeddings originated in the field of Natural Language Processing as a statistical approach of representing words as vectors based on co-occurrence of words in sentences. The main advantage of these representations is the high correlation between word sense similarity and embedding similarity.

Years later, Google’s Word2Vec (2013) and Facebook’s FastText (2016) introduced new approaches to embedding generation based on machine learning:

Continuous Bag-of-Words (Word2Vec): predicts the middle word based on surrounding context words
Continuous Skip-Gram (Word2Vec): predicts the words in a range around the current word
Subword information (FastText): uses sub-word level features (groups of characters) to generate embeddings

All these approaches can be trained on large corpora of unlabeled data, because they rely on the underlying sentence structure (and word structure, in the case of FastText) to learn about word relationships.

Entity Embeddings

All the embedding strategies listed above can also work on entities, given that we encode them properly. After all, sentences are lists of strings, and an entity made of categorical properties can be represented in the same way!

Training models like FastText, Word2Vec on entities to create embeddings is a possibility, but there are 2 issues:

Embeddings generated by these models are “on the same plane”, meaning that the models do not account for the fact that each “word” is actually a value of a different feature; this means that all categorical features will be encoded in the same latent space
When embeddings are used in supervised learning tasks, it’s generally best to train them on the same target as the task, i.e. to train task-specific embeddings.

Both problems can be overcome by using neural embedding layers. These layers are trained with back-propagation on-task, and can be used to feed categorical data into neural networks.

Usually, one embedding module is used for each categorical feature; thus, each feature will be encoded on a different latent space (which means that comparing embeddings from different features doesn’t make much sense).

In case you’re interested in exploring this topic, here are the links to a paper and some implementations:

AutoEmbedder

Neural embeddings work great on supervised learning tasks. But what about unsupervised tasks? We don’t have any target on which to train the embedding layers! We could resort to Word2Vec / FastText models but, as said before, this would leave us with a single latent space for all of our categorical features…

I explored this problem in the last weeks, and came up with this idea: what if we could pass the embedding vectors generated from the layers into a model that needs no explicit target to be trained, like an AutoEncoder?

This could remove the need for an explicit target feature, since the AutoEncoder loss is computed by comparing the network input with the network output.

The resulting architecture is the following:

Implementation

Fastai was the perfect playground, since they already provide a Tabular toolkit that takes care of data loading and splits categorical / continuous features before they are passed to the model.

I had to hack their standard TabularModel a bit in order to fit an AutoEncoder in it, and I ended up replacing it altogether.

The final result is the following piece of code:

Note: the SymmetricalAutoEncoder class was omitted, but you can find all the code here.

Evaluating the model

Model evaluation was performed by measuring a reconstruction score and a prediction score, both defined below.

Evaluation datasets used are the Adult sample dataset (using Fastai built-in), the House Prices dataset, the Used Cars catalog and World Cities Database.

The AutoEmbedder was trained 20 times on each dataset for 20 cycles each (see more about one-cycle policy here).

Reconstruction Score

The reconstruction score measures the amount of confusion between different values of each variable.

It is computed in three steps:

Encoding the test dataset into embedding vectors
Decoding the vectors back into categories
Comparing original values vs. reconstructed values

The score is computed as a percentage of correct reconstructions for each feature. The table below shows the scores for all four datasets, averaged over 20 training cycles:

The embedding generation shows little to no category confusion for most features. The exception to this behavior are category values that appear only once or twice in the dataset.

Prediction Score

The prediction score measures the meaningfulness of the embedding vectors.

It is obtained by comparing prediction performance between two models trained on the same supervised task. The only difference between the models is that one uses supervised embeddings while the other uses unsupervised pre-trained embeddings.

Prediction score was computed on Adult Sample and House Prices datasets over 20 training cycles, using accuracy and exponential RMSPE metrics respectively. The model score for each cycle is the metric value of the last training epoch.

Comparison was performed by using Student-T test under the hypotesis of equal average between same-size populations.

The null-hypotesis of same average was accepted with a 0.05% error on both datasets, meaning that the unsupervised embeddings trained with the AutoEmbedder can be successfully used for inference with a negligible performance drop.

Conclusions

The AutoEmbedder is a viable strategy for training neural embeddings in an unsupervised learning context. It generates values that belong to separate latent spaces, which makes it different from other strategies such as NLP-based embeddings.

A working implementation of the AutoEmbedder can be found in this Python library, together with a FastText wrapper and various other categorical encoding methods that I wrapped in a common interface.

The library also contains Fastai bindings for Tabular tasks, so if you’re already using Fastai you can train unsupervised embeddings for your model by simply adding the CategoryEncode tabular transform.