# Learning embeddings for your machine learning model

How to learn embeddings representation for categorical variables.

## Encoding categorical variables in machine learning

Choosing the correct encoding of categorical data can improve the results of a model significantly, this feature engineering task is crucial depending of your problem and your machine learning algorithm.

Two of the most well-known ways to convert categorical variables are *LabelEncoding *and *One Hot Encoding. *The first one converts the string labels into *k *integer values. The second one creates *k *columns and set the variable value as 1 or 0 based on the category, but there are a wide variety of encodings.

The problem with *LabelEncoding* is that sometimes can bring a natural order on the different classes. For example, the algorithm thinks the class 1 is less important than the class 2, just because 2 is bigger than 1. On the other side, *One Hot Encoding *could create too many columns if you have a lot of different classes, and some algorithms aren’t too good for handle this.

On the last couple of days, I have been reading about another way to encode categorical variables, this consist on learning *embeddings.*

## So, what are embeddings?

A simple definition of embedding by the people at Tensorflow:

An

embeddingis a mapping from discrete objects, such as words, to vectors of real numbers.The individual dimensions in these vectors typically have no inherent meaning. Instead, it’s the overall patterns of location and distance between vectors that machine learning takes advantage of.

A way to create embedding is training (or use a pre-trained) model like word2vect. For example, if we train an embedding from texts and we plot the results, we can get a projection like this:

In the image, we can see that words like *ideas *are closer to *perspective *than *history. *This is the basic idea about learning embeddings, now let’s see how can we learn embeddings from our categorical variables.

## Learn embeddings from Pokemon types

For this post, we are going to use the Pokemon with stats (just because it seems like a fun idea) dataset from Kaggle. The dataset includes information about 800 Pokemon's, including: *name*, *type, HP, Attack, *and other stats. Including a *Total *which is the sum of all stats.

The data looks like this:

The first step is to load the data

`pokemon = pd.read_csv("Pokemon.csv")`

pokemon.drop(labels="#", axis=1, inplace=True)

pokemon.fillna(value="No Type", axis=1, inplace=True)

pokemon.rename({'Type 1': 'type'}, inplace=True, axis=1)

Now, let’s check how many distinct Pokemon types do we have

n_types = pokemon['type'].nunique()

print("We have:",n_types, "diferents pokemons types")We have: 18 diferents pokemons types

So we have 18 different Pokemon types, this number is important so we are going to store it.

We need to use a *LabelEncoder *to convert our strings values into an integer and a *MinMaxScaler, *to scale the *Total *(will be our target variable) into a range of [0,1], this will help to train our Embedding model.

from sklearn.preprocessing import LabelEncoder, MinMaxScalerencoder = LabelEncoder()

scaler = MinMaxScaler()

pokemon['encoded_type'] = encoder.fit_transform(pokemon['type'])

pokemon['scaled_total'] = scaler.fit_transform(pokemon[['Total']])types = pokemon['encoded_type']

total = pokemon['scaled_total']

Now our data looks like this:

It’s time to create our embedding model, for this we’re going to use Keras. The first step is to define the embedding size, Jeremy Howard suggest using the following formula, in which our case the embedding size should be 9.

`embedding_size = min(np.ceil((no_of_unique_cat)/2),50)`

I’m going to use 3 as embedding size. This is because I want plot the results without the need of using PCA or t-SNE. So let’s define our model:

from keras.models import Sequential

from keras.layers import Dense, Embedding, Flattenmodel = Sequential()

model.add(Embedding(input_dim=n_types,output_dim=emb_size, input_length=1, name="poke_embedding"))

model.add(Flatten())

model.add(Dense(30, activation="relu"))

model.add(Dense(15, activation="relu"))

model.add(Dense(1, activation="linear"))

We define an Embedding layer, where `input_dim`

corresponds to the size of our vocabulary (18), `output_dim`

is the size of our embedding and `input_length`

is 1 because we are going to use only 1 word.

To compile the model, we are going to use `Adam`

as optimizer and our loss function will be `mse`

because the `Total`

variable is continuous.

`model.compile(optimizer=”adam”, loss=”mse”)`

We must train the model until the `loss`

converges into a stable value, for this I’m going to train it for 30 epochs (maybe too much)

`model.fit(x=types.values, y=total.values, epochs=30)`

After 9 epochs, the loss converges into 0.0367 and then stabilizes in this value. We can get the `weights`

of the embedding layer, and store them into a `DataFrame.`

`embedding_layer = model.get_layer(name="poke_embedding")`

embedding_layer = pd.DataFrame(embedding_layer.get_weights()[0])

embedding_layer.columns = ['C1','C2','C3']

## Plotting the results

We are going to plot the weights and see how the different embeddings are related.

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import axes3d, Axes3Dtypes_names = list(encoder.inverse_transform([x for x in range(0,n_types)]))xs = embedding_layer['C1']

ys = embedding_layer['C2']

zs = embedding_layer['C3']fig = plt.figure(figsize=(8, 4))

ax = fig.add_subplot(111, projection='3d')

for index, embedding in embedding_layer.iterrows():

x = embedding['C1']

y = embedding['C2']

z = embedding['C3']

ax.scatter(x, y, z, color='b')

ax.text(x, y, z, '%s' % (types[index]), size=9, zorder=1, color='k')

plt.draw()

Running the previous code we get:

From another angle:

We can see in the image that some types are closer to each other, this could mean that those Pokemon (same type) may have similar *stats *(I’m not a Pokemon expert). This gave us a vectorial representation of the different types. We can use this vector to replace the categorical variable on the original dataset and train some ML algorithms with it.

## Conclusions

Using embeddings is a great way to represent categorical variables and reduce dimensionality of categories. The next step (maybe for another post) is to use this vectorial representation to train machine learning models and compare to others categorical encoding methods. In my experiments, using embeddings always brings better results, but the representation of the categories makes the model less interpretable.

I‘m working on my own package to do this more easily, you can check it here.