On learning embeddings for categorical data using Keras

Mayank Satnalika
5 min readMay 22, 2018

(This is a breakdown and understanding of the implementation of Joe Eddy solution to Kaggle’s Safe Driver Prediction Challenge ( Kernel-Link ))

Traditionally categorical data has been encoded into 2 common ways:

  • A label-encoder, where each unique category is assigned a unique label .
  • A one hot encoding, where the categorical variable is broken into as many features as the unique number of categories for that feature and for every row, a 1 is assigned for the feature representing that row’s category and rest of the features are marked 0.

An embedding learns to map each of the unique category into a N-dimensional vector real numbers. This method was used in Kaggle competition and won the 3rd prize with relatively simple approach and was popularised in Jeremy Howard’s Fast.ai course. ( Link to paper ).

The advantage of using embeddings is that we can determine the number of dimensions to represent the categorical feature as opposed to in one-hot-embedding where we need to break the feature into as many unique values are present for that categorical feature.

Also like word vectors, entity embeddings can be expected to learn the intrinsic properties of the categories and group similar categories together.

What we are trying to do is learn a set of weights for each of the categorical columns, and these weights will be used to get the embeddings for some value of that column. So we define a model for each of the categorical columns present in the data-set:

for categoical_var in categoical_vars :

model = Sequential()
no_of_unique_cat = df_train[categorical_var].nunique()
embedding_size = min(np.ceil((no_of_unique_cat)/2), 50 )
embedding_size = int(embedding_size)
vocab = no_of_unique_cat+1
model.add( Embedding(vocab ,embedding_size, input_length = 1 ))
model.add(Reshape(target_shape=(embedding_size,)))
models.append( model )

In the above code, for each of the categorical variables present in the data-set we are defining a embedding model. The embedding size is set according to the rules given in Fast.ai course. We reshape the model output to a single 1-D array of size = embedding size.

For the other non-categorical data columns. we simply send it to model like we would do for any regular network. But since the above networks are made individually to handle each of the categorical data, we define single another network for the other columns and add them to our models list.

model_rest = Sequential()
model_rest.add(Dense(16, input_dim= 1 ))
models.append(model_rest)

Merging the models:

Once we have these (n_cat+1) different models, we append them together in a line using

full_model.add(Merge(models, mode='concat'))

Now this has been depreciated and Keras v2.x onwards no longer allows using merge models on the sequential API , but I found using this easier to understand. What this concat mode does is join the models one after the other.

Some info on merge models:

If we try the merging models using sum mode:

The below message is good for understanding what happens when we merge using the sum mode. The add mode does a element wise addition while concat appends it one after another


Code:
full_model = Sequential()
full_model.add(Merge(models, mode=’sum’))
O/p:
ValueError: Only layers of same output shape can be merged using sum mode. Layer shapes: [(None, 2), (None, 1), (None, 16)]
Here the lengths of 3 vectors to be added are 2,1, 16 thus add mode concatenation cannot be done.

Other modes available are dot and mul which performs dot product and multiplies the model outputs as received respectively.

What the concat mode does it append the outputs one after another in a single array. So the final length of our output from the full_model network till now would be e1+e2+e3+...e(last category)+ 16 (the number of outputs for the dense layer in model_rest model where `e` are the embedding sizes for the models.

Input format for the merged network :

We’ll pass a list of inputs and each of the list except the last one will have information about a single categorical column from all the rows of the batch and another last list will have having values of the all other continuous columns.

If there are N columns with n_cat number columns as categorical variables and n_other number of columns as columns of other variables and `M` instances of data, the input will be as follows:The input will be of length ( n_cat+ 1) i.e. the ( total number of categories +1). Each of the (1 to n_cat) values of the input will be a list itself of size M (the number of instances), and the  `m-th` value for list `i` will be equal to i-th column value for m-th data instance where `i` goes from 1 to n_cat.The other last list will be a list of size M and each of that value themselves will again be a list where the values are made of the other remaining columns. i.e. for the last list, the m-th value value will be a list of size n_other and will have values from the m-th column in the data.The first n_cat lists sends input to the embeddings network made for each category and the last list acts as input for the final network which handles all other columns. 

The input the models needs can be found out from the error message itself.

...Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 3 (which is n-cat+1) array(s), but instead got....

See the following data-set and the corresponding input shape for a better idea :

The data-set has 15 rows ( M = 15), 2 categorical columns ( n_cat =2 ) and 2 continuous columns.
The corresponding input is of length ( n_cat +1 ) = 3 and each of those is a list
The elements 1 and 2 are 1–Dimensional lists. List 1 has 15 values of the first categorical column and list 2 has 15 values of the second categorical column. The last list is a 2-D list, it has 15 elements and each element has 2 values (the values of the 2 continuous columns ).

Remember for each of the embedding network we had set input-size =1 we are taking 1 value each from all the list (except the last list) and sending it to the combined network for training. For the last list, each value itself is a list having the other columns values, and this is sent to the models_rest network.

Training the network:

From keras docs:

…Multiple Sequential instances can be merged into a single output via a Merge layer. The output is a layer that can be added as first layer in a new Sequential model…

So once we have the individual models merged into a full model, we can add layers on top of it network and train it.

full_model.add(Dense(1024))
full_model.add(Activation('relu'))
full_model.add(Dense(256))
full_model.add(Activation('sigmoid'))
full_model.add(Dense(2))
full_model.add(Activation('sigmoid'))
full_model.compile(loss='binary_crossentropy', optimizer='adam')full_model.fit( data, values )

Entity Embedding looks a good and easy way to directly make the data suitable ready for input to neural nets with no feature engineering involved.

Find the complete code here: Link

Fast.ai also updated their python package with modules to handle categorical data with embeddings ( Link )

Link to original code from Joe Eddy : Link

Find the complete code accompanying this here and do comment if you find any mistakes in the code/ post .

--

--