Better, faster, smaller features with word2vec

James Foley
Building Ibotta
Published in
6 min readSep 13, 2019

Motivation

IbottaML is moving towards a service oriented architecture, where we deliver ML services to stakeholders as opposed to features to our data lake. The feature requirements of these models are different from those of our batch prediction modeling frameworks; unlike our traditional large, sparse feature spaces these new features need to be compressed for portability, while maintaining predictability. To meet these needs, we designed a next generation feature engineering framework to incorporate more sophisticated methods that produce smaller, better features.

Our original feature store consists of both dense (aggregated features, e.g., total number of clicks) and sparse features (lower level features, e.g., number of clicks by category). Both feature types are interpretable, analytics-ready, and can be read into an encoder pipeline as-is. The sparse features, which are easily consumable by a DictVectorizer, are stored in the data lake as Spark MapTypes:

{“apparel_clicks”: 2, “produce_clicks”: 6, “electronics_clicks”: 1, …}

But when we encode these sparse features across our user base and each key becomes a column, the dimensionality can explode in the wide direction to tens of thousands. This isn’t a problem for our models that predict in batch, as minimizing preprocessing time isn’t essential so we can include a dimensionality reduction step in the preprocessing pipeline. But performing that preprocessing step is more challenging for our real time models, which need to be able to pick up features and deliver predictions as quickly as possible. We needed to build user level features offline so that they’re off-the-shelf ready for model input with little to no processing, and ideally, they hold more predictive power than a simple dimensionality reduction transformation.

The Framework

The framework we built learns how to represent a user given their action history. Take for example product purchases: the model takes as input the ordered sequence of user purchases and learns how products relate to one another simply with regard to adjacent items in the user’s “shopping basket”. Once these “latent product embeddings” are learned, we can aggregate those purchased by a given user to represent that user’s purchase history in the latent vector space. We build these latent aggregate features across various customer level contexts including transactions, in-app behavior, and search terms, which make up the LatentAF product suite.

The LatentAF framework uses Amazon SageMaker’s BlazingText algorithm, an optimized implementation of word2vec (see this IbottaML article for more details on how we use BlazingText). With all training jobs, transform jobs, and model artifacts managed by AWS, we’ve significantly simplified the portion of the production pipeline we need to maintain. We use word2vec’s skip-gram model, but instead of training the model to learn word embeddings, we feed the model sequences of generic tokens, e.g. a sequence of product ID purchases that represent a user’s “recent shopping trip(s)” where the learned embeddings represent products. These product embeddings are similar to word embeddings, in that similar and complementary products will be closer to one another in the latent vector space, giving us a way to represent more dimensions of a product than just its metadata. Aggregating these embeddings up to the user level allows us to capture a time component as well as complementary product level relationships which don’t exist in our old sparse feature space.

Evaluation

We did a thorough evaluation of these new features comparing them to our old feature space, training dozens of models that estimate a user’s propensity to buy certain products. The new latent features outperformed the old with an average AUC lift of 8%. In addition to being more predictive, the LatentAF vector dimensionality is 1/100th of the old sparse feature space, making these features more portable and faster for models to read. We also did some tuning around latent vector size (see table below) and word2vec performs highly effectively even when condensing a feature space of ~10,000 down to as small as 25:

Average model performance across various embedding sizes.

A by-product of these customer level features are the product level latent embeddings themselves, which are fun to play with and a good way to prove that word2vec is learning effective product representations. Focusing on the product category embedding space (representing the categories of products that users purchase, e.g., hair care, snack bars, pasta, etc.), we calculate the cosine similarity between each product category embedding pair in our vocabulary to represent how “close” each product category embedding is to another in the latent vector space. Given a particular product category of interest we can see the most similar product categories according to cosine similarity. Lets take for example the category “ketchup”; the five most similar category vectors are:

  1. mayonnaise
  2. mustard
  3. sauces
  4. hot dogs
  5. frozen potatoes

The interesting thing about these “similar” categories is that they’re not necessarily similar, but highly complementary. For example, “hot dogs” and “frozen potatoes” aren’t as similar to “ketchup” as categories like “mayonnaise” or “mustard”, but they’re very complementary to “ketchup” because people often put ketchup on hot dogs and potatoes/hash browns. This strong association between complementary categories makes sense when we consider how these embeddings are learned, which is solely with regard to adjacent products purchased on users’ “shopping trips”. In other words, people tend to buy mayonnaise, mustard, sauces, hot dogs, and frozen potatoes alongside ketchup.

The plot below visualizes a variety of product category embeddings in 2-dimensional space (actual embeddings, not fake data). The distance between data points are relative to the closeness of their embeddings in the latent vector space and the color groupings represent distinct clusters of products.

t-SNE plot of various product embeddings represented in 2-dimensional space.

As seen above, grape, banana, and apple are very similar as they’re all fruits. While tomatoes and avocados are technically fruits, they fall within the vegetable realm in most people’s minds and apparently according to people’s buying behavior too, as tomato, avocado, and carrot are close to the fruit cluster but seemingly distinct. The third cluster of products, pizza, cookie, and cereal, are clearly different from the fruit/vegetable clusters and seem to represent a variety of dry goods/less healthy food. Again, it’s very cool to see the model learn sensible relative representations of products given only ordered product IDs of users’ shopping trips.

Conclusion

With latent embeddings learned across various contexts, we’re able to fully represent a user’s purchasing behavior, app usage, etc. in a dense vector. We write these features to our on-demand feature store so our realtime models have easy access to them, and because these LatentAF features are significantly compressed our models can quickly consume complete user level information to make realtime predictions with minimal feature processing. Easy portability, high predictability, and a lighter weight production pipeline make for a feature engineering service that enables IbottaML to build next generation models to make a smarter, more efficient mobile app.

We’re Hiring

Ibotta is hiring Machine Learning Engineers, so if you’re interested in working on challenging machine learning problems like the one described in this article then give us a shout. Ibotta’s career page here.

--

--