Improving Ad Targeting with Starbucks

Discovering how Starbucks mobile app users respond to offers

Tri Bui

Published in

Analytics Vidhya

16 min readOct 4, 2020

I. Introduction

1. Background

As consumers, we are constantly bombarded with advertisements on every channel of our digital lives from email to social media. As you’ve probably noticed, a lot of these offers are highly relevant to us, while others are a complete miss. Companies spend millions of dollars on advertising every year, and these misses are not exactly an efficient use of that money.

This is a customer segmentation problem for ad targeting. Since customers respond differently to different types of offers, there is no one offer type that is universally best for all customers.

In this article, we will analyze mobile app data from Starbucks in order to discover how users respond to different offers and segment them accordingly. With a better understanding of customer behavior, Starbucks would be able to improve how advertisements are sent to customers.

2. Overview

The purpose here is to improve ad targeting so I am proposing a two-part solution:

Segment customers based on demographics and behavior on the mobile app — these segments will give us a visual understanding of the customers who are or are not responding to offers
Build a classifier that predicts whether or not a user will respond to a certain offer — this classifier will help us determine whether an offer should be sent to a particular user

You can follow along with the Jupyter notebooks from my Github repository. With the solution in mind, this will be the sequence of our workflow:

Exploratory data analysis — preprocess the raw data and explore the different offers and users in the data
Customer segmentation —using quantile analysis and k-means clustering
Predictive modeling — with logistic regression, k-nearest neighbors (KNN), support vector machine (SVM), decision tree, random forest, and light gradient boosting machine (LightGBM)

This is a binary classifier that predicts whether a user will complete a certain offer. The models are evaluated based on prediction accuracy and the F1 score. Our final classifier will be the model with the best performance.

3. Data

The dataset we are looking at contains 1 month of simulated data that mimics customer behavior on the Starbucks rewards mobile app. This is a simplified version of the real Starbucks app because the underlying simulator only has 1 product whereas Starbucks actually sells dozens of products.

Every few days, Starbucks sends out an offer to mobile app users. Some users might not receive any offers during certain weeks and not all users receive the same offer. There are 3 types of offers:

BOGO (buy one get one free) — spend amount A in ONE purchase before the offer expires to get reward R of equal value to A
Discount — spend amount A in ONE OR MORE purchases before the offer expires to get a discount D of equal or lesser value to A (all purchases within the validity period accumulate to meet the required amount A)
Informational — only provides information about a product

For discount and BOGO offers, the required spending amount, reward, and validity period all vary. As for informational offers, there is no required spending amount and no reward, but there is still a validity period. In these cases, it is assumed that the customer is feeling the influence of the offer during this period.

Customers do not opt into the offers they receive. In other words, a user can receive an offer, never actually view the offer, and still complete it. While these offers were recorded as completed, they really had no influence on the customer because they were not viewed.

There are 3 associated datasets:

Porfolio (10 offers x 6 fields) — metadata for each offer

id — offer ID
offer_type— BOGO, discount, or informational
difficulty — required spending amount to complete the offer
reward — reward for completing the offer
duration — validity period in days (the offer expires after this period)
channels — web, email, mobile, social

2. Profile (17,000 users x 5 fields) — demographic data for each user

age — missing values were encoded as 118
became_member_on — date in which the customer created an account
gender — “M” for male, “F” for female, and “O” for other
id — customer ID
income — annual income of customer

3. Transcript (306,534 events x 4 fields) — records of events that occurred during the month

event — transaction, offer received, offer viewed, or offer completed
person — customer ID
time — number of hours since the start of the test (begins at time t=0)
value — details of the event (offer metadata for offer-related events and amount for transactions)

Now that you understand the problem at hand and the data we’ll be using, let’s begin!

II. Exploratory Data Analysis

1. Data preprocessing

Before performing any analysis, we must ensure that the data is in a usable format. First off, here’s what the raw data looks like.

**Figure 1**: (A) The **`profile`** set on the left contains the demographic data of 17,000 users. (B) The **`transcript`** set on the right contains 306,534 records of the events that occurred during the month.

There are a couple of things here that need to be addressed:

2,175 users were missing all demographic information in the profile set. Since demographics are a large part of what we want to analyze, these users and all of their events were dropped.
There were 374 duplicated events in the transcript set, all of which were dropped as well.
The channels feature of the portfolio set (figure 2A) was expanded into 3 binary features that indicate whether the offer was sent via that channel — web, mobile, and social.
The value feature of the transcript set (figure 1B) was also expanded into 3 additional features — offer_id and reward for offer-related events and amount for transaction events.
Offer IDs and customer IDs were stored as long and meaningless hash strings, so let’s make our lives easier by mapping them to integers:
- Customer IDs were mapped to integers in the order they appeared
- Offers were mapped to integers in order of overall difficulty (Figure 2B)

**Figure 2**: (A) The raw **`portfolio`** set on the left contains the 10 different offers that were sent out during the month. (B) The transformed `**portfolio`** set on the right shows the order of overall difficulty, which goes from informational to discount to BOGO offers. Within each type, the offers were sorted in the order of increasing `difficulty` and decreasing `duration`.

2. Exploring offers

A. How many offers were viewed and/or completed?

A little over 11,000 were sent out on days 1, 8, 15, 18, 22, and 25 for a total of 66,501 offers. And there is a relatively balanced number of each of the 10 different offers, ranging from 6,576 to 6,726 each.

When a customer receives an offer, there are 4 possibilities, which we will use to group the offers (figure 3):

Group 1 (9,296 offers) — offers that were neither viewed nor completed
Group 2 (17,866 offers) — offers that were viewed, but not completed
Group 3 (11,497 offers) — offers that were completed, and then viewed afterward or not viewed at all (either way, the customer was not aware of the offer and made the purchase(s) anyway)
Group 4 (27,842 offers) — offers that were viewed, and then completed

**Figure 3**: Offer groups 1 (bottom) and 3 (top) are blue. Offer groups 2 (bottom) and 4 (top) are red.

Out of the 66k offers that were sent out, we can see that almost 40k were completed (figure 3), which is not bad. But we also see that 11k of those completed offers (group 3) were wasted on customers who didn’t even know there was an offer and spent the money regardless.

B. How many of each offer was completed?

Putting aside group 3 offers for now, let’s break down the offer completions. Discount offers had both the highest and lowest completion rates (figure 4).

**Figure 4**: Completion rate of all 10 offers. Note that the number of received offers is different for each of the 10 because group 3 offers have been omitted.

It doesn’t seem like the reward is as important a factor as the difficulty or the duration in predicting whether a customer will complete an offer.

You may have also noticed that the top 2 discount offers both had an equal amount of days in duration as the dollar amount in difficulty, while the bottom 2 had fewer days. You might say that it’s fair to give customers 1 day for every dollar required to complete a discount offer.

3. Exploring users

A. User demographics of completed versus incomplete offers

**Figure 5**: User demographics of completed offers (groups 3 and 4) are in green and user demographics of incomplete offers (groups 1 and 2) are in red.

There are a couple of differences here (Figure 5):

There are twice as many male customers as female with incomplete offers, while they are almost equal with completed offers. Are female customers are more likely to complete offers?
There is a higher number of users under the age of 40 with incomplete offers than with completed offers. Are younger customers less likely to complete offers?
With incomplete offers, there is a higher number of low earners and a lower number of high earners, which makes sense as income is positively related to spending.
With completed offers, there was a steep drop in signups from 2017 to 2018, but this wasn’t the case with incomplete offers. It could be that 2018 users are newer to the app so they are less inclined to spend money in unfamiliar territory.

B. Patterns in user spending

Age and income were both grouped in 5 quantiles in order to visualize their relationship to spending habits. Quantile 1 is the lowest and 5 is the highest.

**Figure 6**: Total amount spent by age is on the left and total amount spent by income is on the right. Both display the differences between genders. Group 1 is the bottom quantile and group 5 is the top quantile.

Spending increases with increasing income, but what’s interesting is that spending only increases up to age group 3 (about age 50) and then remains the same with groups 4 and 5 (figure 6).

Another thing is that female customers are spending more money than male customers in every age group and in 4 out of 5 income groups. This might explain why we saw a greater proportion of female users with completed offers than incomplete offers (figure 5).

III. Customer Segmentation

Using 2 different methods, we will now attempt to segment customers to help us better understand how customers respond to offers.

1. Quantile analysis — frequency, monetary value, tenure (FMT)

This is a take on the popular RFM (recency, frequency, monetary value) analysis. But since we are looking at only 1 month of data, recency is of no use so we will be looking at tenure instead. For each customer, we will calculate:

- Frequency— how often the user made a transaction
- Monetary value — how much money the user spent
- Tenure — how long the user has been using the app

Snippet 1: Code for creating the FMT segments using the `transcript` and `profile` sets.

Frequency was grouped in 6 quantiles, monetary value 8 quantiles, and tenure 3 quantiles (snippet 1). This is similar to assigning feature weights, which were assigned arbitrarily, but the idea is that monetary value is the most important followed by frequency and then tenure.

**Figure 7**: Customer tiers from FMT segmentation.

With a quantile “score” for each of these 3 features, adding up all 3 would yield the customer’s total score. Customers were then divided into 3 segments based on this total score (figure 7):

**Figure 8**: Completion rate for each offer by customer tier.

1. Bronze tier — total score between 3 and 7
2. Silver tier — total score between 8 and 12
3. Gold tier — total score between 13 and 17

As expected, bronze customers do not complete many offers and gold customers complete the most offers (figure 8). Even with the easiest offers (discount offers 3 and 4) bronze tier’s completion rate was lower than 40% while gold tier’s rate was as high as 97%!

**Figure 9**: User demographics of each customer tier.

As we go from bronze to silver to gold, we can see a greater proportion of female users, a decrease in the number of users under the age of 40, a decrease in low earners, and an increase in high earners.

2. K-means clustering

In this section, we’ll be segmenting customers using k-means clustering with users’ gender, age, income, frequency, monetary value, and tenure. The previous method only accounted for user behavior on the app, but now we’re throwing demographics into the mix.

Snippet 2: Code for creating the cluster segments using the FMT data created earlier.

I simplified gender into a binary feature that indicates whether the user is male. In other words, female and other-gendered users were grouped together, since male was the majority gender in the data.

As this is still technically a categorical feature, we’ll be using PCA (principal component analysis) to create continuous components that capture the variance in the data. This does make the analysis less interpretable, but we can get an idea of how each component was constructed by looking at the feature coefficients (figure 10A).

**Figure 10**: (A) Feature loadings of the 5 PCA components are seen on the left. (B) Separation of the 4 clusters in relation to the top 2 PCA components is seen on the right.

Although dimensionality reduction was not the intention, I only created 5 components, which still explain almost 95% of the variation in the 6 original features. These components were then used to create 4 clusters (figure 10B). As mentioned earlier, PCA does make interpreting the clusters more convoluted so let’s examine the 6 original features of each cluster.

**Figure 11**: Cluster summary with actual average values for reference with the snake plot.

**Figure 12**: The snake plot shows the normalized average values of demographic and FMT features for each cluster.

Average values (figure 11) may not be representative of the entire cluster, but we can use it to give each cluster a general description:

Cluster 1 customers are entirely male with below average age and income. They are among the newest of users and spend the least money.
Cluster 2 customers are almost all female with a few other-gendered (no male). Both their age and income are above average, but they also spend relatively little.
Cluster 3 customers are in large part male and are the youngest group, but have been using the app the longest. Here’s where it gets interesting. They have the LOWEST income, yet the HIGHEST spending frequency and average spending amount, which means they are probably making frequent but small purchases.
Cluster 4 customers are the biggest spenders by a huge margin (in amount, not frequency). There are more female customers than male in this segment. Their age and income are both well above the average.

Let’s look at how each of these clusters responds to offers.

**Figure 13**: Completion rate for each offer by cluster. For most offers, the completion rate increases as we go through the clusters in order.

Since the clusters were assigned their numbers in ascending order of their average spending, we can infer that a customer’s spending habit is one of the most important factors in our prediction goal.

3. Recommendations

As segments from the two segmentation methods share a similar offer-completion profile, recommendations can be made for each pair alike:

- Bronze tier and cluster 1
- Silver tier and clusters 2/3
- Gold tier and cluster 4

As bronze users do not spend a lot of money, they're not very likely to respond to offers so it would be a good idea to either stop sending them offers or to only send them offers that are easy to complete.
Silver users do spend quite a bit more than bronze users, so it's actually worth it to be sending them offers. They completed a good portion of discount offers 3 and 4, so focusing on the easier offers or lowering the difficulty of the harder offers would likely increase the rate at which they complete offers.
Gold users consistently have a high rate of offer completion so it would actually benefit Starbucks to increase the difficulty of offers being sent to these users. As they are highly likely to respond to offers, a higher difficulty would likely increase the amount these customers spend.

Again, the recommendations are for the FMT tier segments, but the same would apply to their cluster counterpart.

IV. Predictive Modeling

Now, we will build a classifier that predicts whether a user will respond to an offer. This classifier will help us decide whether we should send users a particular offer.

The input of this classifier includes an offer’s metadata and a user’s demographic and behavioral features. It will produce a binary output that predicts if the user will complete the offer.

1. Preprocessing

Before creating any models, there is still a little bit of preprocessing to do. We begin by extracting the data that will be going into the model: the 66,501 received offers from the transcript set.

Next, we use one-hot encoding to expand the categorical feature offer_type into 3 binary numeric features — info_offer, disc_offer, and bogo_offer. We then create the binary target label as defined: 1 if the offer was viewed and then completed or 0 otherwise.

**Figure 14**: Pearson correlation among features.

There are 15 features in our final feature set: reward, difficulty, duration, mobile, social, web, age, income, frequency, monetary, tenure, info_offer, disc_offer, bogo_offer, male

As seen in the correlation heatmap (Figure 14), there is high multicollinearity with the following features: duration, mobile, info_offer, disc_offer, and bogo_offer.

Since linear models have an assumption of non-multicollinearity, we will be dropping these 5 from the feature set when running logistic regression.

We finish up by doing a 60–20–20 data split for the training, validation, and test sets, and normalizing the features in all 3 sets. And now we are ready to begin building models!

2. Modeling

To give an overview, I trained 6 different machine learning models — logistic regression, k-nearest neighbors (KNN), support vector machine (SVM), decision tree, random forest, and LightGBM.

The hyperparameters of the first 5 models were tuned using a grid search. Since LightGBM has a lot more hyperparameters to tune, performing a grid search is not feasible so I opted for Optuna instead, which uses Bayesian optimization to tune hyperparameters.

The details of the machine learning algorithms and hyperparameter tuning process are not within the scope of this article, but if you are interested, I discuss it more in the last notebook at my Github repository.

Logistic regression, k-nearest neighbors, and support vector machine

**Figure 15**: Classification report for logistic regression on the validation set.

Logistic regression will be our baseline model for prediction performance. After the grid search, we get a logistic regression model that made predictions on the validation set with an F1 score of 0.70 (figure 15).

If we inspect the coefficients, monetary value, spending frequency, and social media (which indicates whether the offer was sent through social media) had the greatest predictive power in logistic regression. We have not looked into how channels play a role in this, but perhaps customers are responding more to offers that are sent via social media.

With the baseline established, let’s see if we can do better with k-nearest neighbors (KNN) and support vector machine (SVM).

**Figure 16**: (A) Classification report for KNN is on the left. (B) Classification report for SVM is on the right.

Both KNN and SVM did a few percentage points better than logistic regression (Figure 16). But we can see that there is an imbalance in the predictions made by SVM. Let’s take a closer look at these predictions.

**Figure 17**: Confusion matrix of predictions on the training set (left) and validation set (right) made by SVM.

False positives outnumber false negatives at a ratio of about 3 to 2 (figure 17).

The bias towards the positive class is tolerable in this case because (1) it’s not a drastic difference and (2) we would rather send offers to non-responsive customers than to miss out on sending offers to responsive customers.

Tree-based classifiers

Next, we will be looking at the results for decision tree and LightGBM.

**Figure 18**: (A) Classification report for decision tree is on the left. (B) Classification report for LightBGM is on the right.

LightGBM was the only one that used Optuna to tune its hyperparameters, but surprisingly, it actually did worse than decision tree, despite the Bayesian optimization in the hyperparameter tuning process and the gradient boosting optimization in the algorithm itself.

And unfortunately, neither decision tree nor LightGBM was able to do better than SVM (figure 18). So let’s talk about what did have better predictions.

3. Final classifier

Our final classifier is…RANDOM FOREST. The final parameters (snippet 3) are the result of a lot of trial and error with grid search.

Snippet 3: Final random forest classifier.

The random forest classifier made predictions on the validation set with an F1 score of 0.74 and on the test set with an F1 score of 0.75 (figure 19A).

**Figure 19**: (A) On the left, the classification report for random forest predictions on the validation set is on top and the test set on the bottom. (B) On the right, the confusion matrix for random forest predictions on the validation set is on the left and the test set on the right.

Random forest also had a similar imbalance in predicted classes as SVM (figure 19B), but not as pronounced. As mentioned earlier, a small bias towards the positive class is not a bad thing in this case.

Looking at the feature importances, we find the same top 3 features — monetary, frequency, and social — as in logistic regression. So we can conclude that a customer’s spending habits and the channel through which the offer was sent are the most important factors in predicting whether a customer will complete an offer.

V. Conclusion

1. Summary

In this article, we walked through an in-depth analysis of simulated mobile app data from Starbucks in order to discover how customers respond to different offers.

Our goals were to (1) segment customers based on demographics and behavior and (2) build a classifier that predicts whether a user will respond to an offer. We have successfully done both and in the process, we uncovered a few key insights about the data:

- Offers sent via social media get a better response from customers
- The longer users have been using the app, the more comfortable they are spending money on it
- Female customers tend to spend more money than male customers, given they are in the same age group or income group
- Younger customers tend to make frequent, but small transactions
- As age increases, both average income and average spending showed a very similar pattern for both genders: an increase to about age 50 and then no change for the upper age groups
- As income increases, spending increases
- The more customers spend, the more likely they are to respond to offers

From the segmentation analysis, we found that there are generally 3 classes of customers — the lower, middle, and upper segments which I referred to as bronze, silver, and gold customers respectively. To reiterate my recommendations for improving ad targeting:

- Bronze users are not very likely to respond to offers so it would be a good idea to stop sending them offers or to only send them offers that are easy to complete.
- Silver users completed a lot of the easier offers, so focusing on the easier offers or lowering the difficulty of the harder offers would likely improve their response.
- Gold users consistently have a high rate of offer completion so it would actually benefit Starbucks to increase the difficulty of offers being sent to these users.

2. Extending the project

Although this was a very in-depth analysis, there is still a lot of room for improvement. If you’d like to dig further into the data, here are some ideas:

Explore channels — see how different channels affect the way customers respond to offers and the demographics of customers who do respond on certain channels.
Analyze attempted offers — there are a lot of cases where the customer viewed the offer after receiving it and made purchases, but ultimately failed to complete the offer. These cases may give deeper insight into how offers can be retargeted or otherwise restructured.
Train different classifiers — XGBoost or neural networks may give better results.

You can find the full code at my Github repository. I hope this has been an insightful read!