Learning embeddings for your machine learning model

Matias Aravena Gamboa

Published in

spikelab

5 min readFeb 16, 2019

How to learn embeddings representation for categorical variables.

Encoding categorical variables in machine learning

Choosing the correct encoding of categorical data can improve the results of a model significantly, this feature engineering task is crucial depending of your problem and your machine learning algorithm.

Two of the most well-known ways to convert categorical variables are LabelEncoding and One Hot Encoding. The first one converts the string labels into k integer values. The second one creates k columns and set the variable value as 1 or 0 based on the category, but there are a wide variety of encodings.

The problem with LabelEncoding is that sometimes can bring a natural order on the different classes. For example, the algorithm thinks the class 1 is less important than the class 2, just because 2 is bigger than 1. On the other side, One Hot Encoding could create too many columns if you have a lot of different classes, and some algorithms aren’t too good for handle this.

On the last couple of days, I have been reading about another way to encode categorical variables, this consist on learning embeddings.

So, what are embeddings?

A simple definition of embedding by the people at Tensorflow:

An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.
The individual dimensions in these vectors typically have no inherent meaning. Instead, it’s the overall patterns of location and distance between vectors that machine learning takes advantage of.

A way to create embedding is training (or use a pre-trained) model like word2vect. For example, if we train an embedding from texts and we plot the results, we can get a projection like this:

In the image, we can see that words like ideas are closer to perspective than history. This is the basic idea about learning embeddings, now let’s see how can we learn embeddings from our categorical variables.

Learn embeddings from Pokemon types

For this post, we are going to use the Pokemon with stats (just because it seems like a fun idea) dataset from Kaggle. The dataset includes information about 800 Pokemon's, including: name, type, HP, Attack, and other stats. Including a Total which is the sum of all stats.

The data looks like this:

The first step is to load the data

pokemon = pd.read_csv("Pokemon.csv")
pokemon.drop(labels="#", axis=1, inplace=True)
pokemon.fillna(value="No Type", axis=1, inplace=True)
pokemon.rename({'Type 1': 'type'}, inplace=True, axis=1)

Now, let’s check how many distinct Pokemon types do we have

n_types = pokemon['type'].nunique()
print("We have:",n_types, "diferents pokemons types")We have: 18 diferents pokemons types

So we have 18 different Pokemon types, this number is important so we are going to store it.

We need to use a LabelEncoder to convert our strings values into an integer and a MinMaxScaler, to scale the Total (will be our target variable) into a range of [0,1], this will help to train our Embedding model.

from sklearn.preprocessing import LabelEncoder, MinMaxScalerencoder = LabelEncoder()
scaler = MinMaxScaler()
pokemon['encoded_type'] = encoder.fit_transform(pokemon['type'])
pokemon['scaled_total'] = scaler.fit_transform(pokemon[['Total']])types = pokemon['encoded_type']
total = pokemon['scaled_total']

Now our data looks like this:

It’s time to create our embedding model, for this we’re going to use Keras. The first step is to define the embedding size, Jeremy Howard suggest using the following formula, in which our case the embedding size should be 9.

embedding_size = min(np.ceil((no_of_unique_cat)/2),50)

I’m going to use 3 as embedding size. This is because I want plot the results without the need of using PCA or t-SNE. So let’s define our model:

from keras.models import Sequential
from keras.layers import Dense, Embedding, Flattenmodel = Sequential()
model.add(Embedding(input_dim=n_types,output_dim=emb_size, input_length=1, name="poke_embedding"))
model.add(Flatten())
model.add(Dense(30, activation="relu"))
model.add(Dense(15, activation="relu"))
model.add(Dense(1, activation="linear"))

We define an Embedding layer, where input_dim corresponds to the size of our vocabulary (18), output_dim is the size of our embedding and input_length is 1 because we are going to use only 1 word.

To compile the model, we are going to use Adam as optimizer and our loss function will be mse because the Total variable is continuous.

model.compile(optimizer=”adam”, loss=”mse”)

We must train the model until the loss converges into a stable value, for this I’m going to train it for 30 epochs (maybe too much)

model.fit(x=types.values, y=total.values, epochs=30)

After 9 epochs, the loss converges into 0.0367 and then stabilizes in this value. We can get the weights of the embedding layer, and store them into a DataFrame.

embedding_layer = model.get_layer(name="poke_embedding")
embedding_layer = pd.DataFrame(embedding_layer.get_weights()[0])
embedding_layer.columns = ['C1','C2','C3']

Plotting the results

We are going to plot the weights and see how the different embeddings are related.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d, Axes3Dtypes_names = list(encoder.inverse_transform([x for x in range(0,n_types)]))xs = embedding_layer['C1']
ys = embedding_layer['C2']
zs = embedding_layer['C3']fig = plt.figure(figsize=(8, 4))
ax = fig.add_subplot(111, projection='3d')
for index, embedding in embedding_layer.iterrows():
    x = embedding['C1']
    y = embedding['C2']
    z = embedding['C3']
    ax.scatter(x, y, z, color='b')
    ax.text(x, y, z, '%s' % (types[index]), size=9, zorder=1, color='k')
plt.draw()

Running the previous code we get:

From another angle:

We can see in the image that some types are closer to each other, this could mean that those Pokemon (same type) may have similar stats (I’m not a Pokemon expert). This gave us a vectorial representation of the different types. We can use this vector to replace the categorical variable on the original dataset and train some ML algorithms with it.

Conclusions

Using embeddings is a great way to represent categorical variables and reduce dimensionality of categories. The next step (maybe for another post) is to use this vectorial representation to train machine learning models and compare to others categorical encoding methods. In my experiments, using embeddings always brings better results, but the representation of the categories makes the model less interpretable.

I‘m working on my own package to do this more easily, you can check it here.