How to target promotional offers in Starbucks to increase ROI?

11 min readSep 23, 2019

If you have an office job, you probably have a cafe membership app on your phone, where you top-up regularly for daily coffees and receive promotional offers in return. How would the cafe, on the other end of the line, optimise their Return on Promotional Offers by tailoring their offers to you?

This is precisely the problem I was trying to solve for the Starbucks Capstone Assignment in the Data Science Nanodegree. The two main goals here are to encourage customers to (1) buy more often and (2) pay more. I approached these from the following angles:

1. On a broader sense, which group of customers are likelier to complete an offer and increase their basket size? Which type of offers are likelier to elicit this type of behaviour?

2. From a customisation perspective, what offer should I send to a particular customer (if any) to maximise the return?

Let us delve into this step by step…

Understanding more about the Business Context

Starbucks send three types of offers with varying levels of minimum spend, reward, effective period and distribution channels. They are:

BOGO (buy-one-get-one-free offer)
Discount (e.g. $5 off if you make a $10 purchase)
Informational offer (e.g. Summer is here! Come get a Mocha Frappuchino to shake off the heat)

It randomly sends the offers to different customers on the app at different point of time, and records whether or not the customer has viewed an offer or completed (the offer triggered the customer to make transactions). It also contains transaction data of these group of customer that are not related to offers.

Cleansing and Exploring the Datasets

For this exercise, I was provided with three datasets:

profile.json — demographic data on customers who have signed up for the Starbucks app and have been sent offers during the duration of this experiment
portfolio.json — outlines the details of the three types of offers with ten different spend and reward variations
transcript.json — record unique events (offer received, viewed, completed or transaction made) and their timestamps by the registered customers

There are a few problems on the dataset that I tried to solve with data cleansing:

Nulls in profile data
There are 2,175 customers, out of a total of 17,000 in the profile dataset, with no demographic data (age, income or gender). As this is more than 10% of the total data, I imputed the numerical columns (age, income) with mean and gender with mode. I also created a separate column to flag this group of customers as their reluctance to provide personal data might suggest a particular type of purchase behaviour.
Difficulty in matching transactions events with offer events
I joined all the datasets and summarise the transcript data where every row is a unique combination of customer and offer received. Other columns on the same row would flag whether this offer has been viewed or completed, what the associated transaction amount was, as well as the number of previous purchases made.
This has been challenging because the transaction dataset would flag that an offer is completed even though it has not been viewed, and there was no data on whether informational offers have been completed. Therefore, I made assumptions that only viewed offer can be completed, and that if a customer received an informational offer and made a transaction within its effective period, it is considered completed. Any transactions that is successfully completed is flagged as one in a separate column.

Final dataframe combining all datasets — each row is a unique customer-offer received combination

Lack of control group
There is no data prior to the experimentation period when customers’ behaviours without offers can be tracked. There is also only 6 customers amongst the 17,000 in the dataset who have not received any offers in the duration of the experiment. In other words, there is no control group upon which we can compare the treatment results of different offers to. As a result, I could only compare the outcomes of those receiving different treatments.
Potential bias in treatment target selection
As the mechanism upon which the customers are selected to receive different offers is unknown, I compared the distribution of demographic and behavioural metrics for each group to see if any statistical matching (for example, propensity score matching) is needed. In this exercise, all the distribution are nearly identical so no matching is required.

The number of customers receiving each type of offer is similar

Distribution of Age (left), Income (centre), Gender (right) of customers who received the ten types of offers

Distribution of Average amount of purchase made per order without offers prior to the offer received (left), and Percentage of offer completed prior to the offer received (right)

1. Which group of customers are likelier to complete an offer and increase their basket size? Which type of offers are likelier to elicit this type of behaviour?

To answer this question, I have undergone the following analysis.

Compare Average Treatment Effects

Traditionally, treatment effect is defined as:

Firstly, there is no data on customer who hadn’t make purchases (control set), so it would be inaccurate to conclude whether a customer is likelier to purchase with or without a promotional offer. Some people might not have completed the offer because they have recently bought a Starbucks coffee without an offer and do not feel the need to buy another one. Or even if they have viewed and completed the offer, the decision to make a purchase might still be independent of viewing the offer as it could be part of their habit. Without control data, it is impossible to find the treatment effect of the likelihood of purchase.

Therefore, I simply compared the percentage of those who completed a particular type of offer out of those who viewed that offer. I only selected the ones who viewed the offer as the denominator as they are the ones who are truly exposed and affected by the offer. This is an approach outlined by a whitepaper co-published by Facebook and Kellogg School. [1]

Secondly, to compare how much more or less a customer spend with an offer, the transactions that are not prompted by offers come in handy as a proxy control dataset. Unlike the comparison of the propensity of purchase, I could only compare the spend appetite for those who have made purchases, which I have all the data for. Also, a customer’s spend is likely to be influenced by their income, perceived value of Starbuck’s drinks and price sensitivity. A Starbucks promotion offers directly influences the perceived value of Starbucks products. The offer rewards affects the drinks earned by the customers and hence their price sensitivity.

For the above reasons, the following logic is used in the analysis to compare the effects of offers:

The insights are as follows:

- Most offers yield an average expected uplift in net revenue amongst customers who actually viewed the offer. It is likely encouraged by the minimum spend (difficulty level)
- The top two offers with the highest expected uplift are both discount type offers that are distributed via four communication channels including the social media

Compare the responses of different demographic groups

A simple EDA has been carried out to understand how much different demographic groups are willing to pay to complete offers.

Insights

- Males are likely to spend just above the minimum spend required than females, who spend on average 15 dollars on most offers.

Distribution of Female Spend (Left) and Male Spend (Right) in response to different types of offers

- Those who do not enter any personal details almost only spend below $20 and only spend the bare minimum to meet the minimum requirement for offer redemption.

Distribution of Spend of those with NULL personal data fields in response to offers

- Those with lower income are likely to spend just above the minimum spend required than those earning more. For those earning above 70000, their mean spend is 20 regardless of which offer they are redeeming.

Distribution of Spend of those earning below $35000 (left), between $35000 and $70000 (middle), and above $70000 (right)

2. From a customisation perspective, what offer should I send to a particular customer (if any) to maximise the return?

Modelling and Offer Recommendation Engine

For this question, I built several machine learning models to predict a customer’s expected net revenue uplift using the same logic as I had described above for Question 1.

A classification model is fit onto the data to predict whether a customer would complete an offer. Features for prediction would include the type of offer received, the customer demographics and behavioural metrics and interaction between the two.
Regression models are only fitted onto the data with success flag. In total, ten regression models are trained. One regression model is fitted to predict the amount purchased associated with each offer as distribution of purchase amount varies amongst offers.
A single regression model will be fit onto purchase data that is not associated with any offers to predict how much a customer would spend without any offers.

For the classification model, ensemble methods like Random Forest, Gradient Boosting, as well as Logistics model are fitted onto the data to compare results. After choosing the one, I carried out hyper-parameter tuning.

For the regression models, I ran the model on a number of algorithms like XGBoost (from which I can choose models that fit skewed distributions), Random Forest, Linear Regression and L2 Linear Regression (Ridge). As the spend distribution is skewed to the right, there is a long tail of high spenders. For example, there could be a customer making a 900 dollar purchase in one visit, which is very uncommon. To eliminate these outliers, I have removed values above the 90th percentile.

Last but not least, the above models were combined to generate offer recommendations tailored to each customer.

User can either provide customer id (for existing customers), customer details (for new customers) or no details (for new customers). In the first two cases, the offers that would generate positive net revenue uplift (at most three) would be recommended to each customer. In the last case, the top three default offers (calculated in section 4a) that have the highest expected net revenue uplift will be pushed to the customer.

Model Evaluation

XGBoost performs the second best in classification with results similar to Gradient Boosting (less than 1% lower) and the best across all regression models.

Classification

Precision, recall and ROC AUC are chosen as the classifier metrics instead of accuracy as there is class imbalance — there are more offers that have not been successfully completed than those that have been.

XGBoost Classifier achieved precision (0.70), ROC AUC (0.79) and on recall (0.51) on the test set after hyperparameter tuning. Lower recall than precision, meaning that nearly half of the positives (offer completions) are being classified as negatives, yet fewer negatives (non-completed offers) are being wrongly classified as positives. This problem exist across all models and re-sampling techniques might be required to address such class imbalance.

Training scores are only 1–2% higher than the test score so there is no overfitting issue.

Regression

Standard regression metrics R2 and MSE are used to evaluate regression models. XGBoost Regressor achieved the highest training R2 (ranging from 0.2 to 0.67 with most models scoring above 0.55) and lower MSE (ranging from 9 to 21 with most models below 15).

The regression results are mediocre — only around 55% of the variation in the spend can be explained by regressors and that the models are likely to predict an error of $4 (which is around 40% of error for the sample’s average spend of $10). This is likely because of the non-standard distribution of spend. Some of them are not only skewed but bimodal.

Why XGBoost models have the best performance

Randomised feature selection and regularisation reduces overfitting but still results in relatively high precision and recall scores.
Quicker model training speed compared with all other models.
For classification, it doesn’t assume linearity of variables as logistics regression does.
For regression, the package allows for the fitting of skewed distributions. In this case, gamma and tweedie distribution were fitted.

Limitations and Potential Improvements

Class Imbalance and High False Negatives: Its limited accuracy is likely because of imbalance of its success classes, and insufficient data to give more insight in purchase behaviours prior to offers being distributed. To mitigate this, balancing techniques like sub-sampling, over-sampling or SMOTE can be used. More transaction information prior to the offer distribution could also be extracted.
Bimodal spend distributions: The regression models haven’t performed as well for some of the offers as some of the distribution are bimodel. Gaussian mixture models could potentially be of better fit in this situation. Also, there might be insufficient prior transaction data to help predict a customer’s purchase frequency and behaviour.
Limited to the 10 Offer Variations: In this analysis, I considered the 10 variation of offers as distinct offer types — 10 dummy variables are created for classification and 10 models are generated for regression. Alternatively, I could consider modelling for only three types of offers (BOGO, discount and informational). Distribution channels can be one-hot encoded, while difficulty and reward variables can be treated as continuous numerical variables. In this way, the models can potentially recommend new offer variations that can further increase the net revenue uplift of customers.

TL;DR

Who to target in the offer: Targeting females and high income individuals are likely to increase the expected net revenue (post deducting cost of reward).

What offer to push: Discount offers that are distributed on social media have the highest redemption rate and expected net revenue (after deducting reward) uplift.

Modelling and Customised Offer Recommendation: XGBoost performs the best for both classification and regression. Nevertheless, class balancing techniques, mixture models and alternative treatment variable handling approach might improve the outcome.

My code can be found here on Github.

Reference materials in this analysis:

[1] Gordon, B. R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2017). A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook. SSRN Electronic Journal. doi: 10.2139/ssrn.3033144

[2] Gelman, A., & Hill, J. (n.d.). Causal inference using regression on the treatment variable. Data Analysis Using Regression and Multilevel/Hierarchical Models, 167–198. doi: 10.1017/cbo9780511790942.012