# Target Marketing through A/B Testing and Machine Learning — Starbucks Promotion Strategy Analysis

# Motivation

Traditional A/B tests, such as UX design for a website, have no additional cost once the experiment and the website upgrade are done. But what if you have extra and continuous expenses from the new strategy, which comes from the positive feedback of the A/B test? Is this positive change in your business worth the additional costs?

In this case, we have some potential customers who are only willing to purchase with a promotion. **Every promotion costs $0.15, while every sold product earns $10**. If we send out promotions to inactive customers who will never purchase, we lose **$0.15**. Therefore, the best way to find out proactive customers and send out promotions to them as many as possible.

Note: This article is written in a human-readable way. To read the professional report or review the code, please check the **GitHub Repo** or the Kaggle Notebook.

# Dataset Summary

The data for this exercise consists of about 120,000 data points split in a 2:1 ratio among training and test files. In the experiment simulated by the data, an advertising promotion was tested to see if it would bring more customers to purchase a specific product priced at $10. Since it costs the company 0.15 to send out each promotion, it would be best to limit that promotion only to those that are most receptive to the promotion. Each data point includes one column indicating whether or not an individual was sent a promotion for the product, and one column indicating whether or not that individual eventually purchased that product. Each individual also has seven additional features associated with them, which are provided abstractly as V1-V7.

# KPIs

Starbucks defines the KPI IRR and NIR below:

Incremental Response Rate: IRR depicts how many more customers purchased the product with the promotion compared to if they didn’t receive the promotion.

Net Incremental Revenue: NIR depicts how much is made (or lost) by sending out the promotion.

# Questions

- Do promotions bring more purchases, leading to revenue increment?
- If so, how to maximize the purchases, resulting in better IRR and NIR?

Therefore, there are two main parts:

- Part I: Applying A/B Testing for answering the 1st question.

- Part II: Implementing Machine Learning to answer the 2nd question.

# Part I: A/B Testing

I made this General A/B Testing Flowchart as a guideline for this research. Since this is simplified research, we don't have to 100% follow the flowchart. But the main steps are still meaningful.

*Note: the control group is also called Group A or g1; the treatment group is also called Group B or g2 in this research.*

First, we need to get Group A and Group B.

`# define Group A as g1, Group B as g2`

g1 = train_data[(train_data.Promotion == "No")]

g2 = train_data[(train_data.Promotion == "Yes")]

C**hecking the Invariant Metric: A/A Testing. **Second, we need to check the invariant metric — the number of customers. I assume there is no difference in the number of customers between Group A and Group B as the *Null Hypothesis*; there is a difference in the number of customers between Group A and Group B as the *Alternative Hypothesis*. If there is a statistically significant difference detected, it means the promotions were not equally and randomly sent out to the customers in 50% of probability.

`# get the sample size of g1 & g2`

n1 = g1.shape[0] # 42170

n2 = g2.shape[0] # 42364

# One-proportion z-test for Invariant Metric number of customers (n1 & n2)

zstat, pval = proportions_ztest(

count=n1, nobs=n, value=0.5, alternative="two-sided", prop_var=False

)

zstat, pval

# (-0.6672478204244043, 0.5046138502146766)

**With a p-value of 0.5, we fail to reject the Null Hypothesis at the 0.05 level of significance.** Therefore, there is not enough evidence to suggest that the number of customers in the treatment group significantly differs from that in the control group.

C**hecking the Evaluation Metric: A/B Testing. **Third, we need to check the evaluation metric purchase rate. I assume there is no difference in purchase rate between Group A and Group B as the *Null Hypothesis*, which means *IRR = 0*, And the purchase rate in Group B is greater than the purchase rate in Group A as the *Alternative Hypothesis*, which means *IRR > 0*. If a statistically significant difference is detected, it means the promotions were not equally and randomly sent out to the customers in 50% probability.

`# get the number of purchase of g1 & g2`

k1 = g1[g1.purchase == 1].shape[0]

k2 = g2[g2.purchase == 1].shape[0]

# Two-proportion pooled z-test for Evaluation Metric - Purchase Rate

d0 = 0 # assume there is no difference between two groups

count = [k2, k1]

nobs = [n2, n1]

zstat, pval = proportions_ztest(

count,

nobs,

value=d0, # null hypothesis

alternative="larger",

prop_var=False, # pooled

)

zstat, pval

# (12.468449461599388, 5.548209627035781e-36)

**With a p-value of 5.55×10^-36, which is extremely small, we should reject the Null Hypothesis**, which states that there is no difference between the two groups. And we are more than 99% confident that there is a statistically significant difference in purchase rate between the control and treatment groups. This means that the promotion has a significantly positive impact on the purchase rate. Therefore, we can infer that the promotion has successfully increased the purchase rate compared to the control group, where no promotion was provided. We can recommend the continuation of the promotion strategy to increase sales.

# Part II: Machine Learning

I made this General Machine Learning Flowchart as a guideline for this research. Since this is simplified research, we don't have to 100% follow the flowchart. But the main steps are still meaningful.

R**andom-promo Strategy vs. All-promo Strategy. **Now we already know that randomly assigned promotions did increase the purchase rate. However, remember, we have a cost of $0.15 for every promotion. It means we should only send out promotions to some of the customers. What if we send out promotions to all customers?

`# IRR baseline`

IRR = k2 / n2 - k1 / n1 # 0.009454547819772702

# NIR baseline

NIR = (k2 * 10) - n2 * 0.15 - k1 * 10 # -2334.5999999999995

# if we send out promotions to all of the customers

def promotion_strategy(df):

promotion = np.array(["Yes"] * df.shape[0])

return promotion

all_irr, all_nir = test_results(promotion_strategy)

all_irr, all_nir

# (0.009593158278250108, -1132.1999999999998)

The basic IRR is **0.0095**, and the basic NIR is **$-2334.60**. If every customer has a promotion, we have almost the same IRR of **0.0096**. It's not surprising since promotions were randomly assigned to the sample dataset. And we did increase the NIR, which is now **$-1132.20**. But it is still a negative number. We should not run a promotion strategy for lowering the NIR. So can we apply a better promotion strategy by Machine Learning for a higher IRR and NIR?

F**eature Selection. **Technically speaking, we should draw a heatmap of the correlation coefficient or mutual information for feature selection. However, `src/test_results.py`

it shows that the test function takes all features. To simplify the question, we take every feature for Machine Learning.

M**odel Selection. **The target variable `purchase`

is labeled data. Thus supervised learning is the best choice; `purchase`

is also a boolean variable. Therefore, we need to pick a classifier for this classification problem. I took three commonly used supervised classification ML algorithms as Model Baselines with default arguments and a fixed `random_state`

. As a result, the untuned XGBoost Classifier has an IRR of **0.0228** and a NIR of **$137.65**, which are much better than the *random-promotion strategy *and* the all-promotion strategy.*

C**onfusion Matrix.** The Confusion Matrix is crucial for analyzing the test results from the Machine Learning outputs. It can help us to find out the best metric as a scorer for measuring the performance of Machine Learning. A good metric can significantly improve the model tuning efficiency.

- Type I: True Positive. We predict this customer will purchase, so we are sending out a promotion. And this customer indeed purchases the product. Although we pay $0.15 for the promotion, it can increase the purchase rate (we proved this in the A/B Test part). So it is still a good deal.
- Type II: False Positive. We predict this customer will purchase, so we are sending out a promotion. But this customer makes no purchase. FP (Type I Error) is the metric we really should decrease since it hurts Starbucks' Net Profit.
- Type III: False Negative. We predict this customer will not purchase, so we don’t send out any promotions. But this customer actually purchases the product. It’s good to have this kind of loyal customer since they can purchase without any promotions. But we would better identify FN (Type II Error) for further marketing strategies. And make them long-term loyal customers.
- Type IV: True Negative. We predict this customer will not make a purchase, so we don’t send out any promotions. And this customer indeed makes no purchase.

M**etric Selection.** In this case, either increasing the TP rate or lowering the FP rate is beneficial for Starbucks' Key Metrics — IRR and NIR. Therefore, we have three directions for tuning the model — Recall, Precision, or Treat Score. I will also make a metric baseline with commonly used metrics such as accuracy, roc_auc, and f1 score.

Notice the IRR and NIR formulas use different metrics than the commonly used scorers. I made three customized scorers for this project to compete with the above scorers — `irr_score`

, `nir_score`

, and `irr_nir_score`

. But they are not exactly the formulas of IRR and IRR. They cannot be the same but can be close to the formulas. If we do so, there is data leakage.

`# define irr score`

def irr_score(y_true, y_pred):

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

return tp / (tp + fp) - fn / (fn + tn)

# define nir score

def nir_score(y_true, y_pred):

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

return (10 * tp - 0.15 * (tp + fp)) - 10 * fn

# combine irr score and nir score together

def irr_nir_score(y_true, y_pred):

irr = irr_score(y_true, y_pred)

nir = nir_score(y_true, y_pred)

return irr * nir

M**odel Tuning.** My tuned model with proper hyperparameters can significantly increase the IRR and NIR. Therefore, Starbucks should apply this ML model as an IRR and NIR improvement promotion strategy. The function made by Starbucks from `src/test_results.py`

has an IRR of 0.0188 and a NIR of $189.45.

- In comparison to the basic IRR (Random-promotion strategy), my solution increased IRR by
**0.0109**, or**114.73%**; - In comparison to the Starbucks IRR (Starbucks-promotion strategy), my solution increased IRR by
**0.0016**, or**8.51%**;

- In comparison to the basic NIR (Random-promotion strategy), my solution increased NIR by
**2853.5**; - In comparison to the Starbucks NIR (Starbucks-promotion strategy), my solution increased NIR by
**329.45**, or**173.90%**;

# Conclusions

In this research, I explored how to combine A/B Testing & Machine Learning for powering a real-world target marketing problem. By analyzing the samples with randomly assigned promotions, we conclude that promotions positively impact purchases. Then we trained and tuned Machine Learning models to maximize the IRR & NIR.

Breaking down a complicated problem into small pieces with statistical and Machine Learning solutions is an efficient way to contribute to the company's revenue, no mention how it benefits the data science and marketing department's time, budget, and human resources.