Improving Ad Targeting with Starbucks
Discovering how Starbucks mobile app users respond to offers
I. Introduction
1. Background
As consumers, we are constantly bombarded with advertisements on every channel of our digital lives from email to social media. As you’ve probably noticed, a lot of these offers are highly relevant to us, while others are a complete miss. Companies spend millions of dollars on advertising every year, and these misses are not exactly an efficient use of that money.
This is a customer segmentation problem for ad targeting. Since customers respond differently to different types of offers, there is no one offer type that is universally best for all customers.
In this article, we will analyze mobile app data from Starbucks in order to discover how users respond to different offers and segment them accordingly. With a better understanding of customer behavior, Starbucks would be able to improve how advertisements are sent to customers.
2. Overview
The purpose here is to improve ad targeting so I am proposing a two-part solution:
- Segment customers based on demographics and behavior on the mobile app — these segments will give us a visual understanding of the customers who are or are not responding to offers
- Build a classifier that predicts whether or not a user will respond to a certain offer — this classifier will help us determine whether an offer should be sent to a particular user
You can follow along with the Jupyter notebooks from my Github repository. With the solution in mind, this will be the sequence of our workflow:
- Exploratory data analysis — preprocess the raw data and explore the different offers and users in the data
- Customer segmentation —using quantile analysis and k-means clustering
- Predictive modeling — with logistic regression, k-nearest neighbors (KNN), support vector machine (SVM), decision tree, random forest, and light gradient boosting machine (LightGBM)
This is a binary classifier that predicts whether a user will complete a certain offer. The models are evaluated based on prediction accuracy and the F1 score. Our final classifier will be the model with the best performance.
3. Data
The dataset we are looking at contains 1 month of simulated data that mimics customer behavior on the Starbucks rewards mobile app. This is a simplified version of the real Starbucks app because the underlying simulator only has 1 product whereas Starbucks actually sells dozens of products.
Every few days, Starbucks sends out an offer to mobile app users. Some users might not receive any offers during certain weeks and not all users receive the same offer. There are 3 types of offers:
BOGO
(buy one get one free) — spend amountA
in ONE purchase before the offer expires to get rewardR
of equal value toA
Discount
— spend amountA
in ONE OR MORE purchases before the offer expires to get a discountD
of equal or lesser value toA
(all purchases within the validity period accumulate to meet the required amountA
)Informational
— only provides information about a product
For discount and BOGO offers, the required spending amount, reward, and validity period all vary. As for informational offers, there is no required spending amount and no reward, but there is still a validity period. In these cases, it is assumed that the customer is feeling the influence of the offer during this period.
Customers do not opt into the offers they receive. In other words, a user can receive an offer, never actually view the offer, and still complete it. While these offers were recorded as completed, they really had no influence on the customer because they were not viewed.
There are 3 associated datasets:
Porfolio
(10 offers x 6 fields) — metadata for each offer
id
— offer IDoffer_type
— BOGO, discount, or informationaldifficulty
— required spending amount to complete the offerreward
— reward for completing the offerduration
— validity period in days (the offer expires after this period)channels
— web, email, mobile, social
2. Profile
(17,000 users x 5 fields) — demographic data for each user
age
— missing values were encoded as 118became_member_on
— date in which the customer created an accountgender
— “M” for male, “F” for female, and “O” for otherid
— customer IDincome
— annual income of customer
3. Transcript
(306,534 events x 4 fields) — records of events that occurred during the month
event
— transaction, offer received, offer viewed, or offer completedperson
— customer IDtime
— number of hours since the start of the test (begins at time t=0)value
— details of the event (offer metadata for offer-related events and amount for transactions)
Now that you understand the problem at hand and the data we’ll be using, let’s begin!
II. Exploratory Data Analysis
1. Data preprocessing
Before performing any analysis, we must ensure that the data is in a usable format. First off, here’s what the raw data looks like.
There are a couple of things here that need to be addressed:
- 2,175 users were missing all demographic information in the
profile
set. Since demographics are a large part of what we want to analyze, these users and all of their events were dropped. - There were 374 duplicated events in the
transcript
set, all of which were dropped as well. - The
channels
feature of theportfolio
set (figure 2A) was expanded into 3 binary features that indicate whether the offer was sent via that channel —web
,mobile
, andsocial
. - The
value
feature of thetranscript
set (figure 1B) was also expanded into 3 additional features —offer_id
andreward
for offer-related events andamount
for transaction events. - Offer IDs and customer IDs were stored as long and meaningless hash strings, so let’s make our lives easier by mapping them to integers:
- Customer IDs were mapped to integers in the order they appeared
- Offers were mapped to integers in order of overall difficulty (Figure 2B)
2. Exploring offers
A. How many offers were viewed and/or completed?
A little over 11,000 were sent out on days 1, 8, 15, 18, 22, and 25 for a total of 66,501 offers. And there is a relatively balanced number of each of the 10 different offers, ranging from 6,576 to 6,726 each.
When a customer receives an offer, there are 4 possibilities, which we will use to group the offers (figure 3):
- Group 1 (9,296 offers) — offers that were neither viewed nor completed
- Group 2 (17,866 offers) — offers that were viewed, but not completed
- Group 3 (11,497 offers) — offers that were completed, and then viewed afterward or not viewed at all (either way, the customer was not aware of the offer and made the purchase(s) anyway)
- Group 4 (27,842 offers) — offers that were viewed, and then completed
Out of the 66k offers that were sent out, we can see that almost 40k were completed (figure 3), which is not bad. But we also see that 11k of those completed offers (group 3) were wasted on customers who didn’t even know there was an offer and spent the money regardless.
B. How many of each offer was completed?
Putting aside group 3 offers for now, let’s break down the offer completions. Discount offers had both the highest and lowest completion rates (figure 4).
It doesn’t seem like the reward
is as important a factor as the difficulty
or the duration
in predicting whether a customer will complete an offer.
You may have also noticed that the top 2 discount offers both had an equal amount of days in duration
as the dollar amount in difficulty
, while the bottom 2 had fewer days. You might say that it’s fair to give customers 1 day for every dollar required to complete a discount offer.
3. Exploring users
A. User demographics of completed versus incomplete offers
There are a couple of differences here (Figure 5):
- There are twice as many male customers as female with incomplete offers, while they are almost equal with completed offers. Are female customers are more likely to complete offers?
- There is a higher number of users under the
age
of 40 with incomplete offers than with completed offers. Are younger customers less likely to complete offers? - With incomplete offers, there is a higher number of low earners and a lower number of high earners, which makes sense as
income
is positively related to spending. - With completed offers, there was a steep drop in signups from 2017 to 2018, but this wasn’t the case with incomplete offers. It could be that 2018 users are newer to the app so they are less inclined to spend money in unfamiliar territory.
B. Patterns in user spending
Age
and income
were both grouped in 5 quantiles in order to visualize their relationship to spending habits. Quantile 1 is the lowest and 5 is the highest.
Spending increases with increasing income
, but what’s interesting is that spending only increases up to age
group 3 (about age 50) and then remains the same with groups 4 and 5 (figure 6).
Another thing is that female customers are spending more money than male customers in every age group and in 4 out of 5 income groups. This might explain why we saw a greater proportion of female users with completed offers than incomplete offers (figure 5).
III. Customer Segmentation
Using 2 different methods, we will now attempt to segment customers to help us better understand how customers respond to offers.
1. Quantile analysis — frequency, monetary value, tenure (FMT)
This is a take on the popular RFM (recency
, frequency
, monetary
value) analysis. But since we are looking at only 1 month of data, recency
is of no use so we will be looking at tenure
instead. For each customer, we will calculate:
- Frequency
— how often the user made a transaction
- Monetary
value — how much money the user spent
- Tenure
— how long the user has been using the app
Frequency
was grouped in 6 quantiles, monetary
value 8 quantiles, and tenure
3 quantiles (snippet 1). This is similar to assigning feature weights, which were assigned arbitrarily, but the idea is that monetary
value is the most important followed by frequency
and then tenure
.
With a quantile “score” for each of these 3 features, adding up all 3 would yield the customer’s total score. Customers were then divided into 3 segments based on this total score (figure 7):
1. Bronze
tier — total score between 3 and 7
2. Silver
tier — total score between 8 and 12
3. Gold
tier — total score between 13 and 17
As expected, bronze
customers do not complete many offers and gold
customers complete the most offers (figure 8). Even with the easiest offers (discount offers 3 and 4) bronze
tier’s completion rate was lower than 40% while gold
tier’s rate was as high as 97%!
As we go from bronze
to silver
to gold
, we can see a greater proportion of female users, a decrease in the number of users under the age
of 40, a decrease in low earners, and an increase in high earners.
2. K-means clustering
In this section, we’ll be segmenting customers using k-means clustering with users’ gender
, age
, income
, frequency
, monetary
value, and tenure
. The previous method only accounted for user behavior on the app, but now we’re throwing demographics into the mix.
I simplified gender
into a binary feature that indicates whether the user is male
. In other words, female and other-gendered users were grouped together, since male
was the majority gender in the data.
As this is still technically a categorical feature, we’ll be using PCA (principal component analysis) to create continuous components that capture the variance in the data. This does make the analysis less interpretable, but we can get an idea of how each component was constructed by looking at the feature coefficients (figure 10A).
Although dimensionality reduction was not the intention, I only created 5 components, which still explain almost 95% of the variation in the 6 original features. These components were then used to create 4 clusters (figure 10B). As mentioned earlier, PCA does make interpreting the clusters more convoluted so let’s examine the 6 original features of each cluster.
Average values (figure 11) may not be representative of the entire cluster, but we can use it to give each cluster a general description:
Cluster 1
customers are entirelymale
with below averageage
andincome
. They are among the newest of users and spend the least money.Cluster 2
customers are almost all female with a few other-gendered (nomale
). Both theirage
andincome
are above average, but they also spend relatively little.Cluster 3
customers are in large partmale
and are the youngest group, but have been using the app the longest. Here’s where it gets interesting. They have the LOWESTincome
, yet the HIGHEST spendingfrequency
and average spendingamount
, which means they are probably making frequent but small purchases.Cluster 4
customers are the biggest spenders by a huge margin (inamount
, notfrequency
). There are more female customers thanmale
in this segment. Theirage
andincome
are both well above the average.
Let’s look at how each of these clusters responds to offers.
Since the clusters were assigned their numbers in ascending order of their average spending, we can infer that a customer’s spending habit is one of the most important factors in our prediction goal.
3. Recommendations
As segments from the two segmentation methods share a similar offer-completion profile, recommendations can be made for each pair alike:
- Bronze
tier and cluster 1
- Silver
tier and clusters 2/3
- Gold
tier and cluster 4
As
bronze
users do not spend a lot of money, they're not very likely to respond to offers so it would be a good idea to either stop sending them offers or to only send them offers that are easy to complete.
Silver
users do spend quite a bit more thanbronze
users, so it's actually worth it to be sending them offers. They completed a good portion of discount offers 3 and 4, so focusing on the easier offers or lowering the difficulty of the harder offers would likely increase the rate at which they complete offers.
Gold
users consistently have a high rate of offer completion so it would actually benefit Starbucks to increase the difficulty of offers being sent to these users. As they are highly likely to respond to offers, a higher difficulty would likely increase the amount these customers spend.
Again, the recommendations are for the FMT tier segments, but the same would apply to their cluster counterpart.
IV. Predictive Modeling
Now, we will build a classifier that predicts whether a user will respond to an offer. This classifier will help us decide whether we should send users a particular offer.
The input of this classifier includes an offer’s metadata and a user’s demographic and behavioral features. It will produce a binary output that predicts if the user will complete the offer.
1. Preprocessing
Before creating any models, there is still a little bit of preprocessing to do. We begin by extracting the data that will be going into the model: the 66,501 received offers from the transcript
set.
Next, we use one-hot encoding to expand the categorical feature offer_type
into 3 binary numeric features — info_offer
, disc_offer
, and bogo_offer
. We then create the binary target label as defined: 1
if the offer was viewed and then completed or 0
otherwise.
There are 15 features in our final feature set: reward
, difficulty
, duration
, mobile
, social
, web
, age
, income
, frequency
, monetary
, tenure
, info_offer
, disc_offer
, bogo_offer
, male
As seen in the correlation heatmap (Figure 14), there is high multicollinearity with the following features: duration
, mobile
, info_offer
, disc_offer
, and bogo_offer
.
Since linear models have an assumption of non-multicollinearity, we will be dropping these 5 from the feature set when running logistic regression.
We finish up by doing a 60–20–20 data split for the training, validation, and test sets, and normalizing the features in all 3 sets. And now we are ready to begin building models!
2. Modeling
To give an overview, I trained 6 different machine learning models — logistic regression, k-nearest neighbors (KNN), support vector machine (SVM), decision tree, random forest, and LightGBM.
The hyperparameters of the first 5 models were tuned using a grid search. Since LightGBM has a lot more hyperparameters to tune, performing a grid search is not feasible so I opted for Optuna instead, which uses Bayesian optimization to tune hyperparameters.
The details of the machine learning algorithms and hyperparameter tuning process are not within the scope of this article, but if you are interested, I discuss it more in the last notebook at my Github repository.
Logistic regression, k-nearest neighbors, and support vector machine
Logistic regression will be our baseline model for prediction performance. After the grid search, we get a logistic regression model that made predictions on the validation set with an F1 score of 0.70 (figure 15).
If we inspect the coefficients, monetary
value, spending frequency
, and social
media (which indicates whether the offer was sent through social media) had the greatest predictive power in logistic regression. We have not looked into how channels play a role in this, but perhaps customers are responding more to offers that are sent via social media.
With the baseline established, let’s see if we can do better with k-nearest neighbors (KNN) and support vector machine (SVM).
Both KNN and SVM did a few percentage points better than logistic regression (Figure 16). But we can see that there is an imbalance in the predictions made by SVM. Let’s take a closer look at these predictions.
False positives outnumber false negatives at a ratio of about 3 to 2 (figure 17).
The bias towards the positive class is tolerable in this case because (1) it’s not a drastic difference and (2) we would rather send offers to non-responsive customers than to miss out on sending offers to responsive customers.
Tree-based classifiers
Next, we will be looking at the results for decision tree and LightGBM.
LightGBM was the only one that used Optuna to tune its hyperparameters, but surprisingly, it actually did worse than decision tree, despite the Bayesian optimization in the hyperparameter tuning process and the gradient boosting optimization in the algorithm itself.
And unfortunately, neither decision tree nor LightGBM was able to do better than SVM (figure 18). So let’s talk about what did have better predictions.
3. Final classifier
Our final classifier is…RANDOM FOREST. The final parameters (snippet 3) are the result of a lot of trial and error with grid search.
The random forest classifier made predictions on the validation set with an F1 score of 0.74 and on the test set with an F1 score of 0.75 (figure 19A).
Random forest also had a similar imbalance in predicted classes as SVM (figure 19B), but not as pronounced. As mentioned earlier, a small bias towards the positive class is not a bad thing in this case.
Looking at the feature importances, we find the same top 3 features — monetary
, frequency
, and social
— as in logistic regression. So we can conclude that a customer’s spending habits and the channel through which the offer was sent are the most important factors in predicting whether a customer will complete an offer.
V. Conclusion
1. Summary
In this article, we walked through an in-depth analysis of simulated mobile app data from Starbucks in order to discover how customers respond to different offers.
Our goals were to (1) segment customers based on demographics and behavior and (2) build a classifier that predicts whether a user will respond to an offer. We have successfully done both and in the process, we uncovered a few key insights about the data:
- Offers sent via social media get a better response from customers
- The longer users have been using the app, the more comfortable they are spending money on it
- Female customers tend to spend more money than male customers, given they are in the same
age
group orincome
group- Younger customers tend to make frequent, but small transactions
- As
age
increases, both averageincome
and average spending showed a very similar pattern for bothgenders
: an increase to aboutage
50 and then no change for the upper age groups- As
income
increases, spending increases- The more customers spend, the more likely they are to respond to offers
From the segmentation analysis, we found that there are generally 3 classes of customers — the lower, middle, and upper segments which I referred to as bronze
, silver
, and gold
customers respectively. To reiterate my recommendations for improving ad targeting:
-
Bronze
users are not very likely to respond to offers so it would be a good idea to stop sending them offers or to only send them offers that are easy to complete.-
Silver
users completed a lot of the easier offers, so focusing on the easier offers or lowering the difficulty of the harder offers would likely improve their response.-
Gold
users consistently have a high rate of offer completion so it would actually benefit Starbucks to increase the difficulty of offers being sent to these users.
2. Extending the project
Although this was a very in-depth analysis, there is still a lot of room for improvement. If you’d like to dig further into the data, here are some ideas:
- Explore channels — see how different channels affect the way customers respond to offers and the demographics of customers who do respond on certain channels.
- Analyze attempted offers — there are a lot of cases where the customer viewed the offer after receiving it and made purchases, but ultimately failed to complete the offer. These cases may give deeper insight into how offers can be retargeted or otherwise restructured.
- Train different classifiers — XGBoost or neural networks may give better results.
You can find the full code at my Github repository. I hope this has been an insightful read!