Gentle Introduction to Embedding

I have been wanting to blog from a very long time. I guess this is the moment. Currently I am taking a wonderful deep learning course taught by @jeremyphoward.

Let start with some background and motivation to why embedding is so awesome. All machine learning models require input data to be numerical. Unfortunately real word data is a mix of numerical and categorical values (Considering structured data).

Examples of categorical data could be something like below:

Original Data

Here we have two categorical variables (Lets ignore User Id for now). Favourite Colour (FC) and T-shirt Size (TS). We could represent our input data using the following methods:

Label Encoding

Representing FC as a integer values is incorrect. Why? If I add red two times (1 + 1) will it add upto blue (2)?. No, This does not make sense. Doing this we loose complete information about this variable.

Representing TS as numerical value is also incorrect. Why? If I add small and medium (1 + 2) will it add upto large (3)?. No, Doing this again leads to loss of information about this variable.

Label Encoding

One Hot Encoding

A better idea would be one-hot-encoding. Its just a simple way to represent categorical data as a sparse vector. For example something like below:

One Hot Encoding

Representing FC by one-hot-encoding is a good idea. We represent Red as {1,0,0} Blue as {0,1,0} and Green as {0,0,1}. This means that each level (red, green or blue) is equi-distant from each other.

However representing TS the same way is not a very good idea, we know that small < medium < large. Ordering information is lost doing this. All levels in this variable are treated equi-distant from each other. Also, What would happen if we had 1000 levels instead of 3?. This would make our matrix large and sparse.


Lets say we want to represent our input variables with three levels as 2 dimensional data. Using embedding layer, the underlaying automatic differentiation engines (e.g., Tensorflow or PyTorch) reduces input data with three levels to 2 dimensional data.

From Left to Right: Input Data; Representation of Input Data by Label Encoding; Embedded Data

The input data needs to represented by its Index. This can be easily achieved by Label Encoding. This is your input to the embedding layer.

Initially the weights are randomly initialised and they get optimised using stochastic gradient descent obtaining good representations of this data in 2 dimensions. This idea is especially powerful when we have 100s of levels and want to obtain a representation of this data in 50 dimensions.

Rossmann Challenge

This strategy has been used to by many Kagglers to obtain great representations of their categorical dataset. (The team who proposed this idea came in third in this competition).

You can observe that after one-hot-encoding the input data they embed it to a lower dimension from different categorical variables.

The output of these embeddings are concatenated and fed into two layer neural network.

(Right) Embedded representation of the variable week; (Left) Embedded representation of the variable state

To our left we see that an embedded representation of day of the week variable in two dimensions.

Its amazing to see how embedding has managed to figure out weekend sales are different from weekday sales.

Embedded representation of just the states variable is almost equivalent to the actual representation on the world map.

Here is a simple example, using embedding layer in keras.