TensorFlow Tutorial — Part 3

In the previous Part 1 and Part 2 of this tutorial, I introduced a bit of TensorFlow and Scikit Flow and showed how to build various models on Titanic dataset.

In this part, let’s make a more complex model, something that can handle categorical variables.

Usually in machine learning, handling of categorical variables requires creating a one-hot vector for each category. In deep learning, there is an alternative solution for that — distributed representations or embeddings.

Using embeddings, you can represent each category as a vector of floats of the desired size, which can be used as features for the rest of the model. Note, that because of the fully differentiable nature of the TensorFlow components (and other Deep Learning frameworks), this allows to “train” the most optimal representation for your task. This has shown been the most powerful tool in the Deep Learning toolkit as it removes need to do manual feature engineering.

This brings most value when you have a lot of categories or discrete sparse values — e.g. hundred and thousands. Then you get compression of the input from N categories to fixed size embedding.

When you have a small number of categories it still works by using the embedding space to model one-hot vectors per category (e.g. by spreading categories around without any semantic).

Let’s continue with our Titanic dataset and try a simple example of using just Embarked field as categorical variable for prediction:

First, we select only “Embarked” column, for as our features. We then follow regular 20% train/test split.

It’s always useful to analyze what kinds of values features have:

Embarked has next classes: array([‘S’, ‘C’, ‘Q’, nan]

This is passed to CategoricalProcessor — a helper class that maps categorical variables to ids. In this case it will create a vocabulary of S->1, C->2, Q->3 and unknonw/nan -> 0 and remap this column to integers.

The final model is simple, it leverages another helper function skflow.ops.categorical_variable which creates an embedding matrix of size n_classes by embedding_size and looks up ids from input in it. This is a similar to skflow.ops.one_hot_matrix but instead returning a learnable distributed representations for given categories.

Finally train model and predict on a test dataset and voila, we got a model using distributed representations for categorical variables.

Accuracy: 0.625698324022
ROC: 0.610550615595

After using embeddings, there is a simple model to compare with one-hot vector representation. It will map from class 1, 2, 3 into a vector with one at the position of the class and zero everywhere else. E.g. class C (2) will be mapped into [0, 0, 1, 0] vector.

In this case, given only one feature with 3 classes for prediction, results end up been exactly the same as using embeddings.

Coming up…

You can learn how to combine categorical and continues values in Part 4.

Additionally, as I mentioned above, the most value in using distributed representation coming from categorical features with large number of classes. In the Part 5 you can use this method of representing categorical variables for Natural Language Understanding tasks, like Document classification.