Reg2Vec: Learning Embeddings for High Cardinality Customer Registration Features

Published in

Building Ibotta

4 min readSep 25, 2019

Motivation

When a new customer registers for the Ibotta app we collect several pieces of customer metadata such as zip code, state, email domain, device model, and age. These features are useful in making predictions for our users including recommendations, but are difficult for some machine learning (ML) models to ingest as-is. The challenge stems from the fact that most of these features are high cardinality categorical variables. Zip code, for example, has roughly 37,000 unique categories which makes one-hot encoding challenging. Other solutions such as ordinal encoding or target encoding limit model choice or require significant feature engineering.

As Ibotta adopts a more service-oriented ML architecture, the portability of our features becomes increasingly important to provide low latency predictions for our users. A previous post explained how IbottaML has built latent aggregate features that encode user behavior across a variety of contexts into a low-dimensional space. In that same spirit, Reg2Vec encodes customer registration features in a lightweight, vectorized format for use in downstream ML applications requiring little to no additional preprocessing.

Embeddings

The lightweight features we build from customer registration data are also known as embeddings. Embeddings are simply mappings of discrete objects to vectors of real numbers. Word2Vec, for example, is a popular framework for creating embeddings of English words. One method for building these embeddings is actually a byproduct of training a standard neural network on some supervised task. This technique has been popularized to create entity embeddings of categorical variables for other applications, and we follow a similar methodology with Reg2Vec.

Model Architecture

The Ibotta app rewards users by giving cash back on purchases made at hundreds of retail chains, restaurants, movie theaters, convenience stores, home improvement centers, pet stores, and pharmacies nationwide. These retailers our users engage with provide a valuable signal of their unique tastes and preferences.

With this in mind, the supervised task we used to learn these embeddings was a multi-label classification task of predicting which retailers a user shopped at in a given window of time. We experimented with several other supervised tasks, but found retailer prediction to produce the highest quality, most expressive embeddings. We leveraged the PyTorch framework and trained the model via Amazon Sagemaker.

The first layer of the network contains embeddings for each categorical feature. These are initialized randomly, but updated via gradient descent through training. Embeddings are then concatenated and passed through a fully connected layer that fans out to an output layer that is the same size as the number of retailers we have. This is followed by a sigmoid activation due to the nature of the multi-label classification task.

Unlike most models where we’re interested in the outputs, in this case, we’re interested in the inputs. More precisely, we’re trying to extract the learned representation of the inputs in the embedding layer of the network. We can easily extract these from the weights of the neural network within the state_dict() of the trained model.

Results

One of the most interesting applications of the resulting embeddings is to examine the distance between different categories within this new latent space using a standard similarity score like cosine similarity. Take state for example — since we used retailer prediction as the supervised task to train these embeddings and retailers are sometimes regional, we might expect geographically close states to have similar embeddings. We can confirm this by selecting some example states and looking at their top 5 nearest neighbors by cosine similarity.

We can see that the model has done a good job of orienting neighboring states closer together in the embedding space. Remember, the model is not given any geographic information whatsoever — these embeddings are learned representations of states as a byproduct of training our retailer prediction task which makes these results pretty exciting. We found similar relationships among the other features.

One of our main goals with Reg2Vec was to produce lightweight representations of these high cardinality categorical variables while maintaining (or ideally improving) their predictive power in downstream ML models. In order to evaluate the quality of these new features, we compared the performance of using the embeddings vs. a benchmark encoding on some common classification tasks we have here at Ibotta. Across the tasks we tested, we saw a 2–3% improvement in AUC using the Reg2Vec embeddings.

Conclusion

Reg2Vec encodes customer registration features in a lightweight, vectorized format for use in downstream ML applications requiring little to no additional preprocessing. These new features have shown to be superior to benchmark encodings with the added benefit of being extremely portable making them ideally suited for realtime prediction tasks. This framework could feasibly be applied to any situation where you are dealing with high cardinality categorical features provided you have a supervised task where the learned embeddings would be useful in other models across your organization.

We’re Hiring

Ibotta is hiring Machine Learning Engineers, so if you’re interested in working on challenging machine learning problems like the one described in this article then give us a shout. Ibotta’s career page here.