Target Marketing through A/B Testing and Machine Learning — Starbucks Promotion Strategy Analysis

Zacks Shen
9 min readApr 21, 2023

--

Photo by quan le on Unsplash

Motivation

Traditional A/B tests, such as UX design for a website, have no additional cost once the experiment and the website upgrade are done. But what if you have extra and continuous expenses from the new strategy, which comes from the positive feedback of the A/B test? Is this positive change in your business worth the additional costs?

In this case, we have some potential customers who are only willing to purchase with a promotion. Every promotion costs $0.15, while every sold product earns $10. If we send out promotions to inactive customers who will never purchase, we lose $0.15. Therefore, the best way to find out proactive customers and send out promotions to them as many as possible.

Note: This article is written in a human-readable way. To read the professional report or review the code, please check the GitHub Repo or the Kaggle Notebook.

Dataset Summary

The data for this exercise consists of about 120,000 data points split in a 2:1 ratio among training and test files. In the experiment simulated by the data, an advertising promotion was tested to see if it would bring more customers to purchase a specific product priced at $10. Since it costs the company 0.15 to send out each promotion, it would be best to limit that promotion only to those that are most receptive to the promotion. Each data point includes one column indicating whether or not an individual was sent a promotion for the product, and one column indicating whether or not that individual eventually purchased that product. Each individual also has seven additional features associated with them, which are provided abstractly as V1-V7.

Starbucks Purchase Funnel

KPIs

Starbucks defines the KPI IRR and NIR below:

Incremental Response Rate: IRR depicts how many more customers purchased the product with the promotion compared to if they didn’t receive the promotion.

Net Incremental Revenue: NIR depicts how much is made (or lost) by sending out the promotion.

IRR Formula
Incremental Response Rate
Net Incremental Revenue

Questions

  1. Do promotions bring more purchases, leading to revenue increment?
  2. If so, how to maximize the purchases, resulting in better IRR and NIR?

Therefore, there are two main parts:

- Part I: Applying A/B Testing for answering the 1st question.

- Part II: Implementing Machine Learning to answer the 2nd question.

Part I: A/B Testing

Click on view raw to zoom in

I made this General A/B Testing Flowchart as a guideline for this research. Since this is simplified research, we don't have to 100% follow the flowchart. But the main steps are still meaningful.

Note: the control group is also called Group A or g1; the treatment group is also called Group B or g2 in this research.

First, we need to get Group A and Group B.

# define Group A as g1, Group B as g2
g1 = train_data[(train_data.Promotion == "No")]
g2 = train_data[(train_data.Promotion == "Yes")]

Checking the Invariant Metric: A/A Testing. Second, we need to check the invariant metric — the number of customers. I assume there is no difference in the number of customers between Group A and Group B as the Null Hypothesis; there is a difference in the number of customers between Group A and Group B as the Alternative Hypothesis. If there is a statistically significant difference detected, it means the promotions were not equally and randomly sent out to the customers in 50% of probability.

# get the sample size of g1 & g2
n1 = g1.shape[0] # 42170
n2 = g2.shape[0] # 42364

# One-proportion z-test for Invariant Metric number of customers (n1 & n2)
zstat, pval = proportions_ztest(
count=n1, nobs=n, value=0.5, alternative="two-sided", prop_var=False
)
zstat, pval
# (-0.6672478204244043, 0.5046138502146766)

With a p-value of 0.5, we fail to reject the Null Hypothesis at the 0.05 level of significance. Therefore, there is not enough evidence to suggest that the number of customers in the treatment group significantly differs from that in the control group.

Checking the Evaluation Metric: A/B Testing. Third, we need to check the evaluation metric purchase rate. I assume there is no difference in purchase rate between Group A and Group B as the Null Hypothesis, which means IRR = 0, And the purchase rate in Group B is greater than the purchase rate in Group A as the Alternative Hypothesis, which means IRR > 0. If a statistically significant difference is detected, it means the promotions were not equally and randomly sent out to the customers in 50% probability.

# get the number of purchase of g1 & g2
k1 = g1[g1.purchase == 1].shape[0]
k2 = g2[g2.purchase == 1].shape[0]

# Two-proportion pooled z-test for Evaluation Metric - Purchase Rate
d0 = 0 # assume there is no difference between two groups
count = [k2, k1]
nobs = [n2, n1]

zstat, pval = proportions_ztest(
count,
nobs,
value=d0, # null hypothesis
alternative="larger",
prop_var=False, # pooled
)
zstat, pval
# (12.468449461599388, 5.548209627035781e-36)

With a p-value of 5.55×10^-36, which is extremely small, we should reject the Null Hypothesis, which states that there is no difference between the two groups. And we are more than 99% confident that there is a statistically significant difference in purchase rate between the control and treatment groups. This means that the promotion has a significantly positive impact on the purchase rate. Therefore, we can infer that the promotion has successfully increased the purchase rate compared to the control group, where no promotion was provided. We can recommend the continuation of the promotion strategy to increase sales.

Part II: Machine Learning

Click on view raw to zoom in

I made this General Machine Learning Flowchart as a guideline for this research. Since this is simplified research, we don't have to 100% follow the flowchart. But the main steps are still meaningful.

Random-promo Strategy vs. All-promo Strategy. Now we already know that randomly assigned promotions did increase the purchase rate. However, remember, we have a cost of $0.15 for every promotion. It means we should only send out promotions to some of the customers. What if we send out promotions to all customers?

# IRR baseline
IRR = k2 / n2 - k1 / n1 # 0.009454547819772702
# NIR baseline
NIR = (k2 * 10) - n2 * 0.15 - k1 * 10 # -2334.5999999999995

# if we send out promotions to all of the customers
def promotion_strategy(df):
promotion = np.array(["Yes"] * df.shape[0])

return promotion

all_irr, all_nir = test_results(promotion_strategy)
all_irr, all_nir
# (0.009593158278250108, -1132.1999999999998)

The basic IRR is 0.0095, and the basic NIR is $-2334.60. If every customer has a promotion, we have almost the same IRR of 0.0096. It's not surprising since promotions were randomly assigned to the sample dataset. And we did increase the NIR, which is now $-1132.20. But it is still a negative number. We should not run a promotion strategy for lowering the NIR. So can we apply a better promotion strategy by Machine Learning for a higher IRR and NIR?

Feature Selection. Technically speaking, we should draw a heatmap of the correlation coefficient or mutual information for feature selection. However, src/test_results.py it shows that the test function takes all features. To simplify the question, we take every feature for Machine Learning.

Model Selection. The target variable purchase is labeled data. Thus supervised learning is the best choice; purchase is also a boolean variable. Therefore, we need to pick a classifier for this classification problem. I took three commonly used supervised classification ML algorithms as Model Baselines with default arguments and a fixed random_state. As a result, the untuned XGBoost Classifier has an IRR of 0.0228 and a NIR of $137.65, which are much better than the random-promotion strategy and the all-promotion strategy.

Confusion Matrix. The Confusion Matrix is crucial for analyzing the test results from the Machine Learning outputs. It can help us to find out the best metric as a scorer for measuring the performance of Machine Learning. A good metric can significantly improve the model tuning efficiency.

Starbucks Confusion Matrix
  • Type I: True Positive. We predict this customer will purchase, so we are sending out a promotion. And this customer indeed purchases the product. Although we pay $0.15 for the promotion, it can increase the purchase rate (we proved this in the A/B Test part). So it is still a good deal.
  • Type II: False Positive. We predict this customer will purchase, so we are sending out a promotion. But this customer makes no purchase. FP (Type I Error) is the metric we really should decrease since it hurts Starbucks' Net Profit.
  • Type III: False Negative. We predict this customer will not purchase, so we don’t send out any promotions. But this customer actually purchases the product. It’s good to have this kind of loyal customer since they can purchase without any promotions. But we would better identify FN (Type II Error) for further marketing strategies. And make them long-term loyal customers.
  • Type IV: True Negative. We predict this customer will not make a purchase, so we don’t send out any promotions. And this customer indeed makes no purchase.

Metric Selection. In this case, either increasing the TP rate or lowering the FP rate is beneficial for Starbucks' Key Metrics — IRR and NIR. Therefore, we have three directions for tuning the model — Recall, Precision, or Treat Score. I will also make a metric baseline with commonly used metrics such as accuracy, roc_auc, and f1 score.

Notice the IRR and NIR formulas use different metrics than the commonly used scorers. I made three customized scorers for this project to compete with the above scorers — irr_score, nir_score, and irr_nir_score. But they are not exactly the formulas of IRR and IRR. They cannot be the same but can be close to the formulas. If we do so, there is data leakage.

# define irr score
def irr_score(y_true, y_pred):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
return tp / (tp + fp) - fn / (fn + tn)
# define nir score
def nir_score(y_true, y_pred):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
return (10 * tp - 0.15 * (tp + fp)) - 10 * fn
# combine irr score and nir score together
def irr_nir_score(y_true, y_pred):
irr = irr_score(y_true, y_pred)
nir = nir_score(y_true, y_pred)
return irr * nir

Model Tuning. My tuned model with proper hyperparameters can significantly increase the IRR and NIR. Therefore, Starbucks should apply this ML model as an IRR and NIR improvement promotion strategy. The function made by Starbucks from src/test_results.py has an IRR of 0.0188 and a NIR of $189.45.

  • In comparison to the basic IRR (Random-promotion strategy), my solution increased IRR by 0.0109, or 114.73%;
  • In comparison to the Starbucks IRR (Starbucks-promotion strategy), my solution increased IRR by 0.0016, or 8.51%;
IRR Line Plot
  • In comparison to the basic NIR (Random-promotion strategy), my solution increased NIR by 2853.5;
  • In comparison to the Starbucks NIR (Starbucks-promotion strategy), my solution increased NIR by 329.45, or 173.90%;
NIR Line Plot

Conclusions

In this research, I explored how to combine A/B Testing & Machine Learning for powering a real-world target marketing problem. By analyzing the samples with randomly assigned promotions, we conclude that promotions positively impact purchases. Then we trained and tuned Machine Learning models to maximize the IRR & NIR.

Breaking down a complicated problem into small pieces with statistical and Machine Learning solutions is an efficient way to contribute to the company's revenue, no mention how it benefits the data science and marketing department's time, budget, and human resources.

--

--