Starbucks mobile app offer analysis from a data perspective

Phong Vo Ngoc
14 min readMay 23, 2023

--

Introduction

With over 60 million customers served each week, Starbucks is the leader in coffee sales, selling an impressive four million cups of coffee per day. Starbucks encourages customer engagement by rewarding their customers with special offers occasionally. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offers during certain weeks. Not all users receive the same offer.

I combine transaction, demographic, and offer data to explore offer effectiveness to customers, to build a model to predict whether an offer will be wasted or not.

To achieve this goal, we will split the exploration into a few steps:

Understand what offers Starbucks has and what Starbucks customer background

  • What kind of offers does Starbucks provide? What characteristics does each offer type have?
  • What are the demographic characteristics of members?
  • How differently do people react to the different promotion offers(transaction characteristics)?

Explore the offer effectiveness

  • How many offers did we send to customers?
  • Which type of offer is more attractive to customers?
  • How long do people respond to an offer?
  • How many offers are wasted?
  • What kind of people are more likely to waste an offer?
  • Build a prediction engine to determine if the customer will waste an offer or not.

Now, let’s start the analysis.

Data Exploration

In this section, we conduct exploratory data analysis to help understand the data and generate insights.

Data understanding

We have three datasets:

  • portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.). This dataset contains 10 rows and 6 columns.
Portfolio dataset review
  • profile.json — demographic data for each customer. This dataset contains 17000 rows and 5 columns.
Profile dataset review
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed. This dataset contains 306534 rows and 4 columns.
Transcript dataset review

Observations and Insights from exploratory data analysis

Let us first understand the three types of offers that Starbucks is looking to potentially send its customers:

  • Buy-One-Get-One (BOGO): In this particular offer, a customer is given a reward that enables them to receive an extra, equal product at no cost. The customer must spend a certain threshold in order to make this reward available.
  • Discount: With this offer, a customer is given a reward that knocks a certain percentage off the original cost of the product they are choosing to purchase, subject to limitations.
  • Informational: With this final offer, there isn’t necessarily a reward but rather an opportunity for a customer to purchase a certain object given a requisite amount of money. (This might be something like letting customers know that Pumpkin Spice Latte is coming available again toward the beginning of autumn.)

We can see the customer demographic overview of the dataset. The male is around 9000 and the female is around 6000, and a small amount is undeclared.

Across all the age groups except for age 70+, the number of male customers is higher than the number of females. The group that visits Starbucks the most frequently is of age 40–70. When it comes to income, an interesting finding is that the group with higher income(80k+) has more females than males. In terms of Starbucks memberships, we can see gradual growth each year till 2018.

We can see that there are about 140000 transactions and half of them are offers received. The number of offers viewed and completed is diminishing accordingly. Most people received 2–5 offers in the experiment and with an average of 4 offers.

We can also find a similar distribution of customer events in relation to income level and gender, with the most population falling in the group of income 50k-80k, and more males than females across the board.

The most sent offer is BOGO type and the least is Informational type. From the offer distribution perspective, we can see the three types of offers: BOGO, Discount, and Informational offers are relatively sent to different groups of people.

From the graphs of spending behaviors, we can see an increase from age 20 to 50, and the level of stables after that. Females tend to spend more while males and others spend less.

Exploratory data analysis findings and insight

What kind of offers does Starbucks provide? What characteristics does each offer type have?

  • There are three types of offers: BOGO, discount, and informational offer.
  • There are more BOGO and discount offers than informational offers.
  • An Informational offer doesn’t require spending money and no reward as a result, and it lasts the shortest time about 3 days.
  • BOGO is the best promotion offered to customers — cost less than a discount to get more reward. It is usually available for about 6 days, which is shorter than discount offers.

What are the demographic characteristics of members?

  • Gender: There are more male than female members.
  • Age: most members are between the ages of 21–70, especially a large group of those between 50–60. Only a small part of members are under 20. The number of males and females at the age of 20s and 30s is quite close.
  • Income: It seems people whose annual income is between 50k-80k are more likely to become a Starbucks member. While there are many more male members than females who make less than 80k a year, we see there are more female members in the same income level above 80k a year. The income distribution of members under 40 is almost the same, compared to the older members.
  • Membership: There is no obvious difference in average membership duration among age groups. Female members have greater membership duration on average. As the number of members grows year over year, it is always more male members than female members, except in the year 2016. We see there are a little bit more females joining Starbucks membership.

How differently do people react to the different promotion offers(transaction characteristics)?

  • Most people received 2–5 offers in the experiment and with an average of 4 offers.
  • BOGO and Discount types are sent to a similar amount of customers. It turns out that more people completed discount type offers than BOGO type, while BOGO offers are viewed by more people than discount offers.
  • There is less percentage of people from the 30–39 age group complete offers.
  • Males spent much less money than females but still gained similar rewards.
  • Discount type offer takes the longest for customers to make a reaction.

Data Preparation

A lot of data cleaning and data wrangling has been done in the previous Data Understanding section. At the end of the exploratory data analysis, we combined three datasets together into a new data frame (Starbucks) for integrated analysis. In the Data Preparation part, we will focus on collecting all useful information to explore the offer effectiveness and prepare input data for modeling. We will take advantage of the several sub-tasks to explore the offer's effectiveness.

Data Processing

From the above data understanding, we know that there are four types of events: offer completed, offer received, offer viewed, and transaction; three types of offers: BOGO, Discount, and Informational.

Among the three types of offers, BOGO and Discount offers require customers to spend a certain amount of money in order to achieve the reward. So there’s an offer_id tied with the transaction to record the usage of the offer. Informational offer that no need to spend money and no reward will be generated. To sum up, if we want to know the effectiveness of the offer, we need to determine if an offer is wasted or not. The approaches to how offers will be converted are different, shown as below:

  • BOGO and Discount offer: offer received -> offer viewed -> offer completed(offer effective). Note: if an offer is used without viewing, it should be treated as not effective since the customer didn’t notice the offer.
  • Informational offer: offer received -> offer viewed(offer effective).

How many offers did we send to customers?

We sent about 167184 offers sent to customers, about 43% are BOGO type, 42% are Discount type and 16% are Informational type.

On average, each customer received more than 4 BOGO and Discount offers and around 2 Informational offers.

Which type of offer is more attractive to customers?

BOGO is the kind of offer that received the highest percentage of views, while more people completed the discount offer type.

Among 30499 received BOGO offers, 83.44% of offers are viewed by people which is the most viewed offer type, but only 50.82% of BOGO offers are used.

Among 30543 received Discount offers, 70.21% are viewed, about 12% lower than BOGO type, but a higher percentage of offers are used.

71.09% of Information offers are viewed by customers.

How long do people take an action?

There are 6 waves of offer events in this experiment. Usually, offers are viewed one day after the customer received the offer. We can also see that as the offer frequency goes up, people are more likely to view and complete offers.

How many offers are wasted?

An offer that is wasted may be caused by it being used without notice, or it was not redeemed, etc. BOGO offer is wasted most, which is more than 53%. Half of the discount offer is wasted. 26% informational offer is wasted.

What kind of people are more likely to waste an offer? What offer characteristics will lead to waste?

  • People Side: Categorize customers by demographic characteristics: data from the ‘profile’ dataset, including age, gender, income, membership_in_years, etc., and transaction history: from the ‘transcript’ dataset for customer past transaction history, such as average spent, etc.
  • Offer Side: Offer information: from the ‘portfolio’ dataset for offer information, such as offer type, difficulty, channels, reward, etc, and form Transaction history: mixed information offer usage(has been wasted or not), etc.

Customers younger than 30 years old are much more likely to waste an offer because more than half of the offers didn’t be used.

Customers with lower income than 50k are much more likely to waste an offer. On the other hand, Starbucks' offers are most effective to customers whose income level is between 80k to 100k.

Male customers are more likely to waste an offer.

Offer with 0 difficulty, 0 reward and 3-day duration is the informational offer, which has a relatively lower wasted rate.

Offer with the difficulty of $7 to be qualified for reward is surprisingly least likely to be wasted, which is the middle level of difficulty. Offer requires more than $10 to qualify for reward has a much higher wasted rate.

Offer duration time has a slight influence on whether an offer will be wasted or not. A 5-day duration is slightly more efficient for an offer valid time.

Offer with $3 as the reward is least like to be wasted, while an offer with $5 reward is most likely to be wasted.

Modeling

Feature Selection — Correlation Matrix Heatmap

Correlation states how the features are related to each other or the target variable. Correlation can be positive (an increase in one value of a feature increases the value of the target variable) or negative (an increase in one value of a feature decreases the value of the target variable)

Heatmap makes it easy to identify which features are most related to the target variable, we will plot the heatmap of correlated features using the Seaborn library.

With the help of the correlation matrix, especially the first vertical column showing the correlation between the target variable ‘wasted’ and other input variables, we select the below features as input variables:

  • ‘spent_total’
  • ‘spent_avg’
  • ‘gained_total’
  • ‘difficulty’
  • ‘duration’
  • ‘age’
  • ‘income’
  • ‘membership_in_years’
  • ‘mobile’
  • ‘social’
  • ‘web’
  • ‘gender_F’
  • ‘gender_M’
  • ‘gender_O’

The goal is to predict whether customers will waste an offer or not. The output should be discrete variables (0 — unwasted or 1 — wasted). Therefore, we choose classification models. Here, I will run three different supervised models and pick the most optimized one.

I will use three ML algorithms: Decision Tree, Random Forest and K — Nearest Neighbors (KNN).

Since I use classification models, there are some metrics provided by the Scikit-learn library to evaluate classification model performance.

Metrics:

  1. Confusion Matrix (not a metric but a tool): a tabular visualization of the model predictions versus the ground-truth labels. Each row of the confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class. As we can see diagonal elements of this matrix denote the correct prediction for different classes, while the off-diagonal elements denote the samples which are misclassified.
  2. Accuracy Score: the percent of correct predictions, which equals the number of correct predictions divided by the total number of predictions.
  3. Precision: the number of correctly identified members in the class. There are many cases in which classification accuracy is not a good indicator of the model performance. One of these scenarios is when the class distribution is imbalanced (one class is more frequent than others). In this case, even if we predict all samples as the most frequent class we would get a high accuracy rate, which does not make sense at all (because the model is not learning anything, and is just predicting everything as the top class).
  4. Recall: the fraction of samples from a class that is correctly predicted by the model over the whole number of elements of this class. Recall reflects a classifier’s performance with respect to false negatives (how many did we miss), while precision gives us information about its performance with respect to false positives(how many did we catch).
  5. F1 Score: combines precision and recall to give higher priority in evaluation. It is the harmonic mean of Precision and Recall. F1 = 2 * (precision * recall) / (precision + recall).

Decision Trees is a tree-based algorithm. A decision tree is simply a series of sequential decisions made to reach a specific result. We use the default classifier settings with random_state=40.

Random Forest: is a tree-based algorithm that leverages the power of multiple decision trees for making decisions. Here we start with random_state=40, max_depth=3 for random forest classifier.

K — Nearest Neighbors(KNN): k-NN examines the classes/values of the points around it (its neighbors) to determine the value of the point of interest. KNN classifies the new data points based on the similarity measure of the earlier stored data points. Here, we use the KNN classifier with n_neighbors=10, leaf_size=30.

Eveluation & Refinement

Models Eveluation

From the above models’ performance results, Random Forest is the winner with the leading performance overall in this case.

Both Decision Tree and Random Forest have higher performance, around 77% accuracy than the 72% accuracy of KNN model. The Decision Tree model has 100% accuracy in training but drops to 77% in the testing dataset. This is most likely because of the overfitting of the training data. Compared to other machine learning models, overfitting in decision trees can easily happen as the learning algorithms can produce large and complex decision trees that perfectly fit training instances.

Why does Random Forest outperform the Decision Tree? — Random Forest is a collection of decision trees and the average/majority vote of the forest is selected as the predicted output. It does not rely on the feature importance given by a single decision tree. In other words, it is more robust and accurate than a Decision Tree.

As for K-Nearest Neighbors model in this scenario, not only the accuracy is not as good as Random Forest, but another drawback of KNN model is the cost of time. Basically, KNN is to explore your neighborhood, assume the test data point to be similar to them, and derive the output. A majority voting is applied over the k nearest data points and comes up with the prediction, which will need much more runtime in large size of sample data.

Therefore, I choose the Random Forest algorithm to proceed with optimization further.

Refinement — Hyperparameter Tuning

Grid Search Cross-Validation

Grid Search Cross-Validation exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter and evaluates all combinations we defined.

We will try adjusting the following set of hyperparameters:

  • n_estimators = number of trees in the forest
  • max_depth = max number of levels in each decision tree

K-Fold Cross Validation

As of now we have divided the input data into two parts and run training and testing respectively with the classification model. This method could be not reliable as the training and testing data do not always have the same kind of variation as the original data, which will affect the accuracy of the model. Cross Validation solves this problem by shuffling the dataset in order to remove any kind of order and dividing the input data into multiple groups instead of only two groups.

K-Fold Cross Validation is one of the common cross-validation methods. Here we divide the data into K=5 folds. We will have 5 sets of data to train and test our model. In other words, the model will get trained and tested 5 times, and for each time we will use one fold as test data and the rest all as training data. What makes it great is for every iteration, the data in training and test fold changes which adds to the effectiveness of this method.

# Input variables and target variable
X = model_df.drop('wasted', axis=1)
y = model_df['wasted']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)

# Define KFold
kf = KFold(n_splits=5, shuffle=True, random_state=40)

# Random Forest Classifier
rf = RandomForestClassifier(random_state=40, n_estimators=100)

scoring = {'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score, average = 'weighted'),
'recall': make_scorer(recall_score, average = 'weighted')}

# Optimize by hyperparameter tuning
params = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
}

# Grid Search
grid_no_up = GridSearchCV(rf, param_grid=params, cv=kf, scoring='accuracy').fit(X_train, y_train)
y_pred_test = grid_no_up.predict(X_test)
print(grid_no_up.best_estimator_)

Grid Serch Cross Validation returns the best parameter set is RandomForestClassifier(max_depth=15, n_estimators=200, random_state=40).

Then we apply this set of parameters to train the random forest model with K-Fold Cross Validation.

# Define KFold
kf = KFold(n_splits=5, shuffle=True, random_state=40)

# Train model with best parameters
rf_opt = RandomForestClassifier(max_depth=15, random_state=40, n_estimators=200)

# Cross validation with k-fold
cv_opt = cross_validate(rf_opt, X_train, y_train, cv=kf, scoring=scoring)
cv_opt_df = pd.DataFrame.from_dict(cv_opt)

# Confusion matrix:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_test);

print('Random Forest K-Fold Cross Validation Scores:')
display(cv_opt_df)

print('Average Score of K Fold Scores: \n{}'.format(cv_opt_df.mean()[2:].round(2)))

Modeling Conclusion

Through the Hyperparameter Tunning with Grid Search with Cross Validation, we found the best parameters for our Random Forest model are RandomForestClassifier(max_depth=15, n_estimators=200, random_state=40)

With the best parameters, we then applied K-Fold Cross validation to train the model. K-Fold Cross Validation significantly reduces underfitting as we are using most of the data for training(fitting), and also significantly reduces overfitting as most of the data is also being used in the validation set.

As a result, the accuracy has an obvious improvement from 77% to 83%.

A Random Forest classification model with K = 5 fold Cross Validation method is able to predict whether an offer will be wasted or not with an accuracy of 83%.

--

--