Ibotta’s Recommender System

Published in

Building Ibotta

9 min readMar 28, 2019

written by Matt Johnson and James Foley

Our users are presented with a plethora of savings opportunities when they open the Ibotta app and the variety of available content is constantly growing. This leads to the interesting challenge of how to organize all that content. Is there an optimal arrangement of content for each user? Do we have enough signal in our data to model users’ shopping preferences? How do we scale a solution across millions of users? Product recommendations is one of many challenges addressed by IbottaML and in this piece we discuss the development of our latest recommender system.

Evolution of recommendations at Ibotta

Our goal as a Machine Learning team is to develop production quality features and scalable, high impact ML solutions across the organization. The Recommender system we’ve developed to serve personalized content to users is an ensemble of feature engineering and modeling components that embodies the standards we set for production quality, fully managed, large scale machine learning frameworks. This blog post outlines how IbottaML developed the Recommender pipeline used today, both from technical and product perspectives.

Presenting users with content most relevant to them results in a more rewarding app experience; we save people time when users launch the app and can immediately see offers/products they like without having to search for them. We also have the opportunity to surface novel offers to users, encouraging them to try new things they will likely enjoy. In our situation though, effectively predicting what offers users will like is tricky:

Not all offers are available to all users;
New offers are being introduced into the app each week so we have limited information on those offers;
New users are constantly joining the app so we have limited information on their preferences.
Scaling these predictions is also a major challenge when we commit to delivering daily recommendations to millions of users.

Over time we’ve developed better ways of addressing these challenges, as we learn from each iteration of the recommender product.

The initial approach by the Analytics team to recommend content to users was tested through email communications. Clustering and PCA were used to identify which types of items interest users, which determined the set of curated items to be surfaced in an email to users. The test yielded immediate positive results, proving to the team there’s value in building out personalized offer recommendations, so the Analytics team moved on to developing a system to power offer recommendations in the app. The first version of the Recommender system to be productionized blended various recommendation scores based on user purchase histories, overall offer popularity, and brand similarity mappings to create a set of recommended offers that were personalized for the user. These recommendations were calculated in batch, written to a database, then picked up by a service that served the content to a specific module/offer gallery in the app. A major milestone that came from this work, in addition to the incremental lift generated by the new recommender engine, was the development of the production pipeline that allowed us to directly insert offers into a user’s app experience. This pipeline allowed for quick iteration and testing of Recommender system variants.

Our second iteration of the Recommender was set up as a supervised learning problem where we predict the propensity a user will purchase a given offer. We engineered features at the user and offer levels and combined those with output from collaborative filtering methods, feeding all features into a final supervised model. From this model we generated predictions for all user/offer pairs (representing a user’s propensity to interact with that offer), which lets us easily pick the top-ranked offers for every user. This machine learning based approach utilized more user- and offer level metadata, allowing for a deeper level of personalization as well as better recommendations for offers not previously purchased.

The evolution of the Recommender highlights an effective approach taken to develop and improve upon a product. We didn’t start by applying machine learning and developing a complex feature engineering pipeline, we instead built a simpler POC, tested it, and then pursued more sophisticated solutions after proven lift.

The Pipeline

Training a recommender model requires features that are predictive of a user engaging with an offer. Ibotta fortunately has a large feature bank that allows us to reuse features instead of having to start from scratch for each project (for more on Ibotta Feature Engineering, see here.). For the Recommender system we used hundreds of features, some of which were already available in our feature bank. In addition to these already available features, we constructed new ones that were hypothesized to be relevant to recommendations. These new features are now available for others at Ibotta to consume (available in a Hive metastore). Some of these new features include:

ALS features (more on what this feature is in a moment)
Ratio of redemptions to impressions of an offer
Percent through lifecycle
Purchase Cadence

Some of these features are hypothesized to interact. For example, suppose a user just purchased trash bags three days ago. Interacting with this feature of days_since_last_purchase, we have a feature that calculates the expected purchase cycle of this item. Given that the days since last purchase is a low number and the item has a low purchase frequency, these features will interact to adjust the probability of the relevancy of this offer.

The ALS features have the most power in predicting user engagements with offers. ALS (Alternating Least Squares) is a matrix factorization technique (for a deep dive into how ALS works, see here) which, at a high level, constructs a matrix where each row represents a user, each column an item, and the cell value depends on the item context and the level of engagement. (An item can be the offer’s brand, the offer’s category, or the offer itself. Engagements are unlocks and redemptions; a user will unlock an offer showing intent to purchase, and redeem an offer once purchased.) For example, with a brand level context we calculate how many times a user purchased that brand in a given time period, whereas for the offer level context we determine how many times a user purchased that specific offer. We construct four such matrices:

Row: User Id, Column: Offer Id, Cell Value: Redemptions

Row: User Id, Column: Offer Id, Cell Value: Unlocks

Row: User Id, Column: Brand Id, Cell Value: Unlocks

Row: User Id, Column: Category Id, Cell Value: Unlocks

Now a lot of these entries will be missing since a user typically doesn’t interact with all items. When there’s a lot of missing entries, we refer to this as a sparse matrix. These matrices are factored into a product of two other matrices — this is known as matrix factorization. When this occurs, the machine learning model is able to automatically learn what values to fill in the missing cells. These inferred values are useful approximations for how we would expect users to interact with items they haven’t actually interacted with. We then use all these imputed values as features in their own right to the user-offer combination in context.

All of these features are then inputted into a gradient boosted tree machine learning model that predicts the propensity of a strong engagement for the user, where a strong engagement is defined as an offer unlock or redemption.

Evaluation

A natural question to ask after building a recommender model is: how well does this model perform? There are two common approaches to evaluation: offline and online metrics.

Offline Metrics

This technique is observational, in that we look at already collected data (e.g., which offers users have historically engaged with) and we “hide” this data from the model and see if we can predict it. This is an easy way to “grade” the performance of your model. A problem however, is that these offline metrics are often not well correlated with how your model will perform in the real world! So while it’s a rough tool to help drive early development, it’s not a rigorous technique on which one should report business metrics.

Online Metrics

This technique is interventional. An intervention is powerful in data science. In an intervention, you often have a treatment and a control (known as an A/B test). The control is the baseline — the offer recommender algorithm which produces content that Ibotta users are currently exposed to. The treatment is the new recommender algorithm — the content arrangement from which users have not yet been exposed. Then, a random proportion of users are assigned the new recommender output. The other users will continue getting the current recommender output. We let this experiment run over a few weeks and then we calculate several metrics using a technique developed here at Ibotta.

We actually tested several different recommender models (12 in total) against the control. Some of these model variants included modifying the time windows we look over, how we aggregate redemptions and unlocks, etc. The model that was selected maximized metrics across several different outcomes (e.g. increased user activity, highest revenue, etc.). As an example, our internal MVT framework (Multivariate Testing Framework, which generalizes an A/B test to A/B/C/D/…) automatically published the graphic below:

For each treatment, probability that treatment is best

This bar chart reports the probability that each treatment is the highest for the given metric (in this example, unlocks). The tallest bar seen here (around 60%) corresponds to the model we ended up selecting.

Productionization

Delivering over one billion offer recommendations daily is no small feat; building feature sets, training models, making predictions, and loading recommendations into the app, all while considering computation efficiency requires a robust, monitored production pipeline. This pipeline executes every day because we want to ensure we’re making fresh offer recommendations for our users. So as new offers are introduced, as new users join the app, and as ongoing users’ purchase histories and app behaviors update, we can incorporate this new information to generate the most up-to-date recommendations each day.

Retraining the model daily requires amalgamating various features from our data lake and defining target values, for which we use Spark. Joining together dozens of tables to create a final set of offer level and customer level features is not a simple process. Some challenges we dealt with were data skew, optimal file sizing, Spark application parameter tuning, and generally munging and shipping hundreds of gigabytes of features each day. Training data is passed to the model, which is also fit via Spark where grid searching is executed in parallel, model performance metrics are logged, and the model is saved to S3.

Once the new model is trained we’ll pass it the most recent set of features to make predictions, which we distribute and parallelize using Spark. At this point we have a table comprising the propensity a user will engage with a offer, for every eligible user:offer combination. We then take the top n offers for each user and ship that data to a production data store that feeds data to the app, so that when any user opens the app a lookup is made to surface their recommended offers, what we call the For You gallery.

Our latest Recommender had a significant impact to the business. It’s an example of how machine learning is not only interesting, but highly practical. It has helped move key metrics for our business, which means our users are having more efficient and personalized experiences within the app.

We’re Hiring

IbottaML is hiring Machine Learning Engineers, so if you’re interested in working on challenging machine learning problems like the one described in this article then give us a shout. Ibotta’s career page here.