Simplifying embeddings with embedder
Deep learning on structured data using a utility library, embedder, to learn representations of categorical variables.
Deep learning is taking the world by storm and achieving human-level results in a diverse range of tasks we previously thought to be highly challenging for computers — such as image recognition, machine translation, or language generation. Much has been said on the topic by individuals with much deeper expertise than my own. I wanted to address a smaller, but nonetheless important misconception about neural networks — that they are not particularly useful for machine learning tasks on structured data , when compared to more traditional approaches, such as ensembles of decision trees.
There is one area where neural networks can be superior or, in fact, complimentary to gradient boosting machines  — learning feature representations for categorical variables, which are commonly observed in structured data. As an example, think of a problem to predict a new graduate’s wages from a set of features including her university. Clearly, university is a categorical variable — a particular student in the data would have one of the hundreds of categories, such as ‘Stanford University’. Prior to training a model, we would need to find a numerical representation for this variable. 
The beauty of neural networks is that, as part of learning a mapping from inputs to outputs, they also learn representations of the data. Those representations are hidden layers in the network. Well-known examples include convolutional layers of ConvNets or word embeddings learnt by the word2vec algorithm. Similarly, we can learn representations of categorical variables that are relevant for a specific task by embedding their one-hot encodings in a lower-dimensional vector space. This is what embedder aims to simplify.
Entity embeddings are fixed-size continuous vectors that represent a categorical variable numerically. There is a unique vector for each category in the variable (e.g. for each university), and the stack of these vectors is called the embedding matrix. Specifically, we will focus on embeddings that a neural network learns from data. The best introduction to the ideas behind this concept is Guo and Berkhahn (2016); for a beginner-friendly introduction check out this post.
embedder is a simple utility I wrote to make the process of working with entity embeddings easier — and I will use it to demonstrate some examples. At its heart, embedder is a wrapper around Keras that exposes a scikit-learn-like interface to fit a neural network on training data and subsequently transform categorical variables into learnt vectors.
Entity embeddings are conceptually simple, but powerful and relevant for many data science applications.
- They provide a continuous representation of categorical variables, allowing to train more powerful neural networks.
- They provide a way to add metadata to deep learning on unstructured data, including images, video, natural language, or time series data.
- They can improve the performance of other machine learning algorithms, including XGBoost and Random Forest.
- Using transfer learning, one can train embeddings on a large dataset and subsequently use them for problems with much smaller training data. This can be very helpful in industry.
- Because we learn vectors for categorical variables in the Euclidean space, it becomes straightforward to cluster observations on those variables.
- Information retrieval and recommender system problems commonly use embeddings.
Entity embeddings have been used to win Kaggle competitions with very little feature engineering and can be useful for data scientists in almost any industry, since categorical variables are extremely prevalent. While entity embeddings are not particularly difficult to code up using Keras, I have found the process of pre-processing the data, extracting the embeddings and transforming the input data sufficiently cumbersome and repetitive to benefit from simplification.
For demonstration purposes, I will use the embedder library and the dataset explored in Guo and Berkhahn (2016), the Rossman drug stores data. The dataset contains historical sales data for 1115 Rossman stores, and each store is located in a specific German state. Learning embeddings for this dataset using embedder can be done in as many as 5 lines of code (not counting loading the data).
For further details on what the library is trying to do in the snippet above, see the repository. Note that the pre-processing steps are only meant to simplify the process of preparing the data prior to training a neural network, but are not mandatory.
Two things to point out — by default, embedder will train a feedforward network with two hidden layers, which is a sensible default. Of course, it may not be optimal for all possible applications. The desired architecture can be passed as a json at class instantiation. Second, by default on a regression task a mean squared error loss function will be used (and cross-entropy loss for classification tasks)— again, a sensible default for vanilla applications that embedder aims to simplify. 
Transforming the data
While embedder allows to pass arbitrary topologies and find the network that will minimize loss, it is not its primary objective. For that task, it would clearly make sense to modify the layers and tune hyperparameters using a deep learning framework such as Keras. Instead, its primary purpose is to learn and extract feature representations for categorical variables.
Suppose the goal of training a neural network is only to learn good feature representations. That is, to replace strings like ‘Stanford University’ with a numeric vector learnt by gradient descent. Subsequently, the user fits a completely different model on transformed data — for example, a linear model. The reasons for this may be business preference for easy interpretability of other non-categorical variables’ coefficients, but a linear model may easily overfit when one-hot encoding drastically increases dimensionality of the feature space. Guo and Berkhahn (2016) also provide evidence that the performance of gradient boosting machines and random forests can be improved by the use of embeddings.
With embedder, the transformation step is very easy and intuitive to anyone who has used feature preprocessing in scikit-learn.
The benefit of exposing a scikit-learn interface is that now entity embeddings can be obtained as part of a feature pre-processing and feature selection pipeline. Such a pipeline would also be useful for serving models in production when the raw incoming data needs to be processed before it is scored by the model.
Extracting and examining embeddings
A particularly interesting use case for embeddings — when the goal is not to simply fit the best neural network on training data — is transfer learning. In my experience, applications of this idea in a business setting are not hard to come by. Returning to our graduate income prediction, the network may be trained to learn continuous representations of universities from a different, much larger dataset (e.g., historical data from the government), prior to fitting a different model on smaller dataset of interest (e.g., specific population of a fresh edtech startup). Intuitively, this is similar to how we may wish to train word embeddings on very large corpuses such as Wikipedia before using them for other tasks such as IMDb sentence classification.
Conceptually, embeddings are the weights going from the one-hot encoded categorical variable to a hidden layer (which is the embedding layer). Hence, extracting embeddings can be achieved by accessing the appropriate weights in the network. However, it can be cumbersome to figure out which embedding layer corresponds to which categorical variable — and embedder simplifies this by returning a dictionary where the key is the correct variable and the value is the embedding matrix.
Once you have this embedding matrix, you can use it as the initialized weights for the same categorical variables in a different dataset, or as feature representations for training new models.
Finally, it is important to visualize embeddings to assess what the neural network has learnt. The typical way to do this is to reduce the dimensionality of the embedding matrix using t-SNE (a non-linear dimensionality reduction technique which has the advantage of preserving the distance between embeddings in the original space) and subsequently visualize all embeddings on the same 2D plot. embedder reduces this operation to a single function, as shown above.
Embeddings are a highly applicable concept that can significantly improve machine learning on structured data. It is a reminder that the power of neural networks is not only in learning complex functions that achieve near-human predictive performance, but also in learning transferrable feature representations.
I have shown how to train and extract vector representations of categorical variables succinctly using embedder. It can be useful on its own or as part of larger feature pre-processing pipelines, which are commonly implemented in production environments around scikit-learn and other tools with similar API. I hope these ideas will benefit other data scientists and machine learning engineers.
Many thanks to Tuan Anh Le for feedback and discussions on this topic.
Guo, C. and Berkhahn, F., 2016. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737.
 By structured data I simply mean tabular data that lives in spreadsheets or relational databases, in contrast to unstructured data such as images or video. I would argue that it is this kind of data that a data scientist working in industry is more likely to encounter.
 A quick look at Kaggle competitions suggests that, faced with a supervised learning task on structured data, the dominant approach would be to use a gradient boosting machine, especially its highly efficient implementations such as XGBoost or LightGBM. In my experience, beating XGBoost, even out-of-the-box with very little feature pre-processing and parameter tuning, is a time-consuming and challenging task for a fully-connected feedforward neural network, which is generally the architecture of choice for structured data with no sequential dimension. I’ll leave the deeper reason behind this for another post.
 The two most common approaches are one-hot encoding and integer encoding. Both suffer from fundamental drawbacks. One-hot encodings, in particular, are not conductive to training neural networks as they are binary features. See Guo and Berkhahn (2016).
 If you wish to compile the model to minimize a different loss with a different architecture, all you need to do is inherit from the Base class and override the fit method.