How to serve an embedding trained with Estimators

By Lak Lakshmanan, Technical Lead, Google Cloud Platform

When you have a sparse categorical variable (a variable that can take many possible values), it can be helpful to embed it into a lower dimension. The most well-known form of embedding is word-embedding (as in word2vec or Glove embeddings) where all the words in the language are represented by a vector of, say, 50 elements. The idea is that similar words are close-by in the 50-dimensional space. You can do the same thing with your categorical variables — train the embedding on one problem, and reuse that embedding instead of just one-hot encoding the categorical variable in related problems. The lower dimensional space of your embedding is continuous and so, it can also function as the input to a clustering algorithm — you can find natural groupings of the categorical variable.

Embeddings help you see the forest and not just the trees.

In order to serve an embedding trained with an Estimator, you can send out the lower dimensional representation of your categorical variable along with your normal prediction outputs. Embedding weights are saved in the SavedModel, and one option is to share that file itself. Alternatively, you can serve the embedding on demand to clients of your machine learning team — which may be more maintainable,because those clients are now only loosely coupled to your choice of model architecture. They will get an updated embedding every time your model is replaced by a newer, better version.

In this article, I will show you how to:

  1. Create an embedding as part of a regression/classification model
  2. Represent categorical variables in different ways
  3. Do math with feature columns
  4. Serve out the embedding along with the outputs of the original model

The entire code of this article is on GitHub and it contains much more context. I’m only showing you key snippets here.

Model to predict bicycle demand

Let’s build a simple demand forecasting model to predict the number of bicycle rentals at a station, given that we know the day of the week and whether it is a rainy day. The data for this comes from a public dataset of New York City bicycle rentals and NOAA weather data:

The inputs to the model are:

  • The day of week (integerized, since it is 1–7).
  • The station id (here, we’re using hash buckets, since we don’t know full vocabulary. The dataset has about 650 unique values. We’ll use a much larger hash bucket size, but then embed it into a lower dimension).
  • Whether it is rainy (true/false).

The label we’ll want to predict is num_trips.

We can create the dataset by running this query in BigQuery to join the bicycle and weather datasets and do the necessary aggregations:

Writing the model using an Estimator

To write the model, we will use a custom estimator in TensorFlow. Although this is just a linear model, we can not use the LinearRegressor because the LinearRegressor hides all the underlying feature column arithmetic. We need access to the intermediate output (the output of the embedding feature column), and so we will write the linear model explicitly.

To implement a custom estimator, you have to write a model function and pass it into the Estimator constructor:

The model function in a custom estimator has 5 parts:

1. Define the model:

We are taking the station column, and putting it into a bucket based on its hashcode. This is a trick to avoid having to build a full vocabulary. There are only about 650 bicycle rental stations in New York, so by having 5000 hashbuckets, we greatly reduce the chance of collisions. By then embedding the station id into a smaller number of dimensions, we will also get to learn which stations are like each other, at least in the context of rainy-day-rentals. Ultimately, every station-id is represented by just a 2-dimensional vector. The number 2 controls how accurately the lower-dimensional space represents the information in the categorical variable. My choice of 2 here was arbitrary — realistically, we will need to tune this hyperparameter for best performance.

The other two categorical columns are created using their actual vocabulary and then one-hot encoded (the indicator column one-hot encodes the data).

The two sets of inputs are concatenated to create one wide input layer and then passed into a dense layer with one output node. This is how you program a linear model at a relatively low-level. This is equivalent to writing a LinearRegressor as:

Note that input_layer, indicator_column, etc. are all hidden away by the LinearRegressor. I am, however, exposing it because I want access to the station’s embeddings.

2. Use a regression head to set up an estimator spec.

For a regression problem, we can minimize the mean squared error using the Ftrl optimizer (this is the default one used by the LinearRegressor, so I’m also using it):

3 — 4. Create a dictionary of outputs

Normally, we will send out only the predictions, but in our case, we want to send back both the predictions and the embeddings:

The ability to change the export_outputs is the other reason that we need to use a custom estimator here.

5. Send back an EstimatorSpec with the predictions and export outputs replaced:

Now, we train the model as normal.

Invoking predictions

The exported model can then be served using TensorFlow Serving, or optionally deployed to Cloud ML Engine (which is essentially hosted TF Serving), and then invoked for predictions. You can also invoke the local model using gcloud (which provides a more convenient interface for this purpose than saved_model_cli):

What’s in test.json?

{“day_of_week”: 4, “start_station_id”: 435, “rainy”: “true”}

{“day_of_week”: 4, “start_station_id”: 521, “rainy”: “true”}

{“day_of_week”: 4, “start_station_id”: 3221, “rainy”: “true”}

{“day_of_week”: 4, “start_station_id”: 3237, “rainy”: “true”}

As you can see, I am sending 4 instances, corresponding to stations 435, 521, 3221 and 3237.

The first two stations are in Manhattan, in an area where rentals are quite frequent (and serve both commuters and tourists). The last two stations are in Long Island, in an area where rentals are somewhat less common (and perhaps only on weekends). The resulting output contains both the predicted number of trips (our labels) and the embedding for the stations:

In this case, the first dimension of the embedding is almost zero in all cases. So, we only need a one dimensional embedding. Looking at the second dimension, it is quite clear that the Manhattan stations have positive values (0.0081, 0.0011) whereas the Long Island stations have negative values (-0.0025, -0.0031).

This was learned purely by the machine learning model looking at bicycle rentals on different days at the two locations! If you have categorical variables in your TensorFlow models, try serving out the embeddings from them. Perhaps they will lead to new insights!