Udacity Course — Data scientist: Starbucks Capstone Challenge

8 min readDec 12, 2023

Section 1: Project Definition

Project Overview

This project analyzes Starbucks coupon distribution events. By conducting data analysis on user information, transaction information, and coupon details, we aim to identify predictive outcomes of coupon effectiveness for different user groups. The goal is to implement targeted coupon distribution strategies.

Technical and data support for this project is provided by Udacity.

Dataset

The data is contained in three files:

portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

profile.json

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

transcript.json

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

Problem Statement

Coupons are a means to attract consumer spending through information transmission or discounts. Such methods have been proven effective in multiple industries. However, there are costs involved with these methods: on one hand, there is the cost of marketing, and on the other hand, there is the economic cost resulting from discounts. Therefore, the implementation of coupon policies needs to be precise.

The current coupon policy involves random distribution or simple distribution based on the amount spent. However, these methods often fall short of achieving the best results. On one hand, the characteristics of users are complex, and random distribution or simple calculations are insufficient to meet objectives; on the other hand, the diversity of coupons is complex, including factors such as start times, discount amounts, and methods of distribution, making it difficult to coordinate effectively; finally, the factors influencing transaction information are complex, and it is difficult to deduce from the outcomes which policy would be best.

Therefore, the key question is: how can we predict the effectiveness of giving specified coupons to specified users?

Metrics

Accuracy

Accuracy refers to the proportion of correctly predicted valid coupons when user information and coupon information are inputted into the model. This is a business evaluation metric.

Confusion Matrix

The confusion matrix is a critical tool for binary classification models. It records precision, recall, F1-score, and support, which are used to evaluate the effectiveness of the model.

Section 2: Analysis

Data Exploration

profile.json

# Gender Distuibution
M    8484
F    6129
O     212

# Age Distribution
0 - 9           0
10 - 19       205
20 - 29      1369
30 - 39      1526
40 - 49      2309
50 - 59      3541
60 - 69      2991
70 - 79      1782
80 - 89       831
90 - 99       254
100 - 109      17
110+         2175

# Income Description
count     14825.000000
mean      65404.991568
std       21598.299410
min       30000.000000
25%       49000.000000
50%       64000.000000
75%       80000.000000
max      120000.000000

portfolio.json

# Channel Distribution
email     10
mobile     9
web        8
social     6

# Offer_type Distribution
bogo             4
discount         4
informational    2

Merge 3 datasets

This picture shows the merged dataset whose person is “005500a7188546ff8a767329a2f7c76a”. Each row records one transaction from this person.

Data Visualization

As can be seen from the figure, the vast majority of users have 3–6 coupons, of which 5 coupons have the most users, accounting for 30%. This shows that the data is of good quality and not biased.

This graph depicts the number of successful transactions by gender and age group. Among them, the 50–60 year old group has the most trading volume. In contrast, the number of transactions by women is relatively large.

Compared to the figure above, this graph shows the number of purchases made using coupons. You can see that in terms of age, it can also be roughly viewed as a normal distribution. From the perspective of gender, the gap between the amount of male consumption and the amount of female consumption has narrowed, and even the number of male consumption in the 50–60 and 60–70 age groups has exceeded that of women.

Section 3: Methodology

Data Preprocessing

[576, 0.0, 2.0, 10.0, 1.0, 1.0, 1.0, 0.0, 1.0, 33, 72000.0, 2017, 4, 21, 0]
[168, 0.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 0.5, 118, 65404.9915682968, 2018, 4, 25, 0]
[576, 0.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 0.5, 118, 65404.9915682968, 2018, 4, 25, 0]
[0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.5, 40, 57000.0, 2018, 1, 9, 0]
[168, 0.0, 3.0, 7.0, 1.0, 1.0, 1.0, 1.0, 0.5, 40, 57000.0, 2018, 1, 9, 1]
[336, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.5, 40, 57000.0, 2018, 1, 9, 0]
[408, 0.0, 5.0, 20.0, 1.0, 1.0, 0.0, 0.0, 0.5, 40, 57000.0, 2018, 1, 9, 1]

This is an example of dataset. The last column is the label and the others columns are features. And then, the standardization process is shown as below.

# standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# split the training set and the test set
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

Implementation & Refinement

I have tried to use MLP, Decision Tree, RandomForest and SVM to implement classification and used GridSearchCV to optimize the parameters.

classifiers = {
    'MLPClassifier': {
        'model': MLPClassifier(max_iter=1000),
        'params': {
            'hidden_layer_sizes': [(50,), (100,)],
            'activation': ['tanh', 'relu'],
            'alpha': [0.0001, 0.01]
        }
    },
    'DecisionTreeClassifier': {
        'model': DecisionTreeClassifier(),
        'params': {
            'criterion': ['gini', 'entropy'],
            'max_depth': [10, 20, 30]
        }
    },
    'RandomForestClassifier': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': [10, 50, 100],
            'criterion': ['gini', 'entropy'],
            'max_depth': [10, 20, 30]
        }
    },
    'SVC': {
        'model': SVC(),
        'params': {
            'C': [0.1, 1, 10],
            'kernel': ['rbf', 'linear']
        }
    }
}

The results shows as the following picture.

MLPClassifier - Best Score: 0.6816249517825786 - Best Params: {'activation': 'tanh', 'alpha': 0.01, 'hidden_layer_sizes': (50,)}
DecisionTreeClassifier - Best Score: 0.6681774894304027 - Best Params: {'criterion': 'gini', 'max_depth': 10}
RandomForestClassifier - Best Score: 0.6813627426702373 - Best Params: {'criterion': 'entropy', 'max_depth': 10, 'n_estimators': 50}
SVC - Best Score: 0.67823496166749 - Best Params: {'C': 1, 'kernel': 'rbf'}

Section 4: Results

Model Evaluation and Validation

Classification Report for MLPClassifier:
              precision    recall  f1-score   support

           0       0.70      0.67      0.69     11671
           1       0.67      0.70      0.69     11213

    accuracy                           0.69     22884
   macro avg       0.69      0.69      0.69     22884
weighted avg       0.69      0.69      0.69     22884

Confusion Matrix for MLPClassifier:
[[7866 3805]
 [3345 7868]]


Classification Report for DecisionTreeClassifier:
              precision    recall  f1-score   support

           0       0.70      0.64      0.67     11671
           1       0.66      0.72      0.68     11213

    accuracy                           0.68     22884
   macro avg       0.68      0.68      0.68     22884
weighted avg       0.68      0.68      0.68     22884

Confusion Matrix for DecisionTreeClassifier:
[[7457 4214]
 [3184 8029]]


Classification Report for RandomForestClassifier:
              precision    recall  f1-score   support

           0       0.71      0.65      0.68     11671
           1       0.66      0.72      0.69     11213

    accuracy                           0.68     22884
   macro avg       0.69      0.69      0.68     22884
weighted avg       0.69      0.68      0.68     22884

Confusion Matrix for RandomForestClassifier:
[[7607 4064]
 [3149 8064]]


Classification Report for SVC:
              precision    recall  f1-score   support

           0       0.70      0.66      0.68     11671
           1       0.67      0.70      0.69     11213

    accuracy                           0.68     22884
   macro avg       0.68      0.68      0.68     22884
weighted avg       0.68      0.68      0.68     22884

Confusion Matrix for SVC:
[[7739 3932]
 [3308 7905]]

Justification

Model Selection Comparison Process

When choosing a model for the business objective of stimulating consumption using coupons, our primary focus is on the recall metric for label 1, which represents the successful use of a coupon. The reason for this focus is to maximize the identification of all potential customers who would respond positively to the coupon incentive, thus reducing the chance of missing out on potential sales.

We have three candidate models, each with its classification report and confusion matrix. Here’s a brief comparison:

MLPClassifier shows the highest overall accuracy at 69% and shares the highest F1-score for label 1 with the RandomForestClassifier. Its precision and recall are balanced for both classes.

DecisionTreeClassifier has a slightly lower accuracy at 68% but stands out with the highest recall for label 1 at 72%. This indicates it is the best at identifying customers who will use the coupon, albeit at the expense of a higher false positive rate.

RandomForestClassifier also has an accuracy of 68% but has a lower recall for label 1 compared to the DecisionTreeClassifier. It provides a balanced approach but does not excel in the recall for label 1.

Model Conclusion

Given the business requirement is to maximize the identification of customers who will use coupons (label 1), the DecisionTreeClassifier is the recommended model. Its recall of 72% for label 1 is the highest among the models, suggesting that it will be the most effective at capturing potential coupon users. While this may result in a higher number of false positives (customers who won’t use the coupon but are predicted to do so), this is acceptable within our strategic framework, as the cost of misidentifying non-responsive customers is likely lower than the opportunity cost of missing out on responsive ones.

Section 5: Conclusion

Reflection

I used machine learning methods to find the optimal model and optimal parameter combination for doing the classification task. This model is used to predict the consumption behavior of a certain type of coupon for a certain user at a certain event. The result of classification is whether to consume. The interesting thing about this model is that I have added a time dimension, i.e. even if the same offer is given to the same user, the effect is not necessarily the same. For example, in real life, the effect of pushing coupons at 3am and 10am must be different. The difficulty lies in the data processing, how to define whether the coupon is valid, and the corresponding data processing method.

Improvement

The next step will be to further expand user characteristics. The expanded user characteristics are the consumption characteristics of users. Through the consumption records of 3 months, the consumption characteristic vector is sorted out. Specifically, the total time range is 720 hours, with a total of 120 time dimension values. Each time point corresponds to two dimensions of preferential policy and preferential policy status. For example, at time 0, based on the second type of preferential policy, the status is viewed. Based on this, the three-part feature matrix is constructed, and the feature vector is abstracted by LSTM, which is combined with the user’s own attribute vector to form a new user feature. This is also a multimodal modeling approach.

Acknowledge

Thanks to Udacity for course support.