Data Scientist Nanodegree Capstone — Starbucks Project

12 min readDec 1, 2023

Section 1: Project Definition

Project Overview

The project aim to used to create the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers.
Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable traits. People produce various events, including receiving offers, opening offers, and making purchases.
As a simplification, there are no explicit products to track. Only the amounts of each transaction or offer are recorded.
There are three types of offers that can be sent: buy-one-get-one (BOGO), discount, and informational. In a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount. In a discount, a user gains a reward equal to a fraction of the amount spent. In an informational offer, there is no reward, but neither is there a requisite amount that the user is expected to spend. Offers can be delivered via multiple channels.
The basic task is to use the data to identify which groups of people are most responsive to each type of offer, and how best to present each type of offer.

Data Dictionary

profile.json

Rewards program users (17000 users x 5 fields)

gender: (categorical) M, F, O, or null
age: (numeric) missing value encoded as 118
id: (string/hash)
became_member_on: (date) format YYYYMMDD
income: (numeric)

portfolio.json

Offers sent during 30-day test period (10 offers x 6 fields)

reward: (numeric) money awarded for the amount spent
channels: (list) web, email, mobile, social
difficulty: (numeric) money required to be spent to receive reward
duration: (numeric) time for offer to be open, in days
offer_type: (string) bogo, discount, informational
id: (string/hash)

transcript.json

Event log (306648 events x 4 fields)

person: (string/hash)
event: (string) offer received, offer viewed, transaction, offer completed
value: (dictionary) different values depending on event type
offer id: (string/hash) not associated with any “transaction”
amount: (numeric) money spent in “transaction”
reward: (numeric) money gained from “offer completed”
time: (numeric) hours after start of test

Problem Statement

The primary objective of this project is to identify key determinants and construct a predictive machine learning model that can anticipate the factors influencing a customer’s completion of an offer, irrespective of the offer category. I am particularly interested in exploring demographic variables, as I believe they have a greater impact compared to the types of offers.

To begin with, in order to effectively analyze the data, we will conduct an Exploratory Data Analysis (EDA) to uncover the data’s patterns and attributes. This initial phase will provide deeper insights into the data and involve data cleansing procedures. Subsequently, utilizing the available data, we will address the aforementioned inquiries utilizing visualizations and a variety of machine learning models. These models will be trained on a merged dataset comprising the portfolio, profile, and transactional information.

Evaluation Metrics

In this project, the accuracy metric is used as a method of evaluation to assess the performance of the model. Accuracy is a commonly used metric in classification tasks, and it measures the proportion of correctly predicted instances out of the total number of instances in the dataset.

In the provided classification report, the accuracy is reported as 90.83%. This means that the model correctly predicted the outcome for approximately 90.83% of the instances in the dataset.

Accuracy is a suitable choice for evaluation in this project because it provides a comprehensive overview of the model’s overall performance. However, it is important to consider other evaluation metrics such as precision, recall, and F1-score, especially when dealing with imbalanced datasets or when the cost of false positives and false negatives differs significantly.

In this case, the classification report also includes precision, recall, and F1-score for each class (0 and 1), along with their respective support values. These metrics provide insights into the model’s performance for individual classes, allowing for a more detailed assessment of its predictive capabilities.

Section 2: Analysis

Data Preprocessing

In order to gain a comprehensive understanding of the datasets, it is necessary to conduct an exploratory analysis. This involves various tasks such as examining missing values, visualizing data distributions, and more. During this stage, my objective is to delve into the data and determine which features are significant for supporting the implementation of the model.

By carefully analyzing the data, I aim to uncover insights that can guide the selection of important features. This process involves investigating patterns, relationships, and trends within the dataset, ultimately enabling us to make informed decisions regarding the inclusion of specific features in the model.

Through this exploratory analysis, we can harness the power of the data to inform our feature selection strategy, enhancing the overall effectiveness and performance of the model implementation.

Portfolio dataset

Convert the column ‘channels’ into 4 different columns: email, mobile, social and web.

Rename the column ‘ID’ to ‘offer_id’.

Profile dataset

Change the datatype of ‘became_member_on’ column from int64to datetime64.

Change the column name ‘ID’ to ‘customer_id’

Transcript dataset

Replace the space in column event with ‘-’:

Change the column name from ‘person’ to ‘customer_id’

Convert the column ‘event’ to 4 different column: offer-completed, offer-received, offer-viewed, transaction.

Convert the column ‘value’ to 2 different column: offer_id and amount.

Merge dataset

After we preprocess 3 datasets, we can concat them into one.

Swap key-value in the data of offer_id column

Swap key-value in the data of event columns

Then export dataset to data.csv file to ready for next step.

2. Data Exploration & Data Visualization

Gender count: from the bar graph, we can see that males are more than 50% of all the users. Over 150,000 of the Starbuck profiles collected is males while about 110,000 Starbuck profiles is females. And a few customers identified as Others.

Income distribution: The income is skewed to the right a little bit. It seems like the highest count is around the $50,000 — $70,000. Large amount of income (> $110,000) accounts for a small amount.

Age distribution: We can see that there are outliers in the `age` data. That’s a number of people whose age is close to 120 (118). The average age is concentrated around 50–70 years old. And we can see that the distribution for age seems to be normally distributed.

Offer_type: for this part, i used ‘Plotly’ library to plot the chart to make it more attractive. And easy to follow with the data.
- The percentage of `Bogo` offer viewer is 83.44% (30499 users received the offer and 25449 viewed it)
- The percentage of `Bogo` offer completed is 51.37% (30499 users received the offer and 15699 completed it)
- The percentage of `Discount` offer viewer is 70.21% (30543 users received the offer and 21445 viewed it)
- The percentage of `Discount` offer viewer is 58.64% (30543 users received the offer and 17910 completed it)

Different type of offers received by the users

Offer_id:
- We notice that the number of `offer_received` is quite uniform compared to `offer_id`

- The highest number of `offer_viewed` is with `offer_id`: 4, 8, 9, 10. The lowest is `offer_id`: 0, 5, 6, 7

- The highest number of `offer_completed` with `offer_id`: 8, 10. Lower is the `offer_id`: 3, 5, 7, 9

Offers_id received different types of offers

Offer received by users:
- BOGO Offer Received by User: the ratio of 2–6 bogo offer is highest.
Highest is 3 offer-received with 2892 users

Informational Offer Received by User: the ratio of 0–2 infomational offer received is highest. Highest is 2 offer-received with 5131 users

Discount Offer Completed by User: the ratio of 2–6 discount offer completed is highest. Highest is 2 offer-completed with 3188 users.

Section 3: Methodology

Data Preprocessing (II)

The second preprocessing part aim for getting the data ready for data modeling part. I perform some cleaning step in this part:

One-hot encoder for ‘gender’ and ‘offer_type’ columns.

Drop all null values and fill N/A values with 0 in ‘income’ columns. Because if an offer completed it can’t be $0.

Create a subset for data modeling. I chose all the variables that i think it will have an effect with customer decision to make an offer completed.

After that, i use MinMaxScaler to normalize the range of independent variables/features.

2. Data Modeling

After the data is prepared for modeling. I will use some algorithim to build the model. Like: LogisticsRegression, DecisionTree, RandomForest, KNeighboor, AdaBoost, XGBoost.

Here we can see that AdaBoost and XGBClassifier have the highest accuracy with 89.26% and 90.53%. So I will be focusing in XGBClassifier. And tunning this model to get higher accuracy.

3. Refinement

After selecting the XGBClassifier as our model, the next step is to perform model tuning to improve its performance. To achieve this, we will utilize the GridSearchCV technique along with a parameter grid specifically designed for the XGBClassifier.

The objective of model tuning is to identify the optimal values for the hyperparameters of the XGBClassifier. To accomplish this, we employ an exhaustive grid search approach, where we systematically try out every possible combination of hyperparameter values. Additionally, we incorporate cross-validation into the process to ensure robustness and prevent overfitting. In this case, we have chosen a cross-validation value of 5.

While the exhaustive grid search is an effective method for determining the best hyperparameters, it is important to note that as the number of parameters and cross-validations increases, the search process can become time-consuming. Therefore, it is crucial to strike a balance between the level of granularity in parameter exploration and the computational resources available, to ensure an efficient and effective tuning process.

Best parameters: {‘learning_rate’: 0.01, ‘max_depth’: 5, ‘n_estimators’: 1000} Best score: 0.9089104488742545

Section 4: Results

Model Evaluation after tunning

The results after i tunned the model:

The accuracy has been improved from 89.26% to 90.83%. While the improvement isn’t significant, the XGBClassifier model is still the most accurate one that we have in term of accuracy.

Features Important

I used XGBClassifier model to find the Feature Importance of each feature. Feature Importance refers to techniques that calculate a score for all the input features for a given model . The result of my model:

Base on the result, we can see that feature `reward` and `time` are the most important features (83.42% and 12.14%)

Justification

I have made a Comparing the models part to compare the models with each others.

Based on the comparison table of the models I provided, the XGBClassifier and AdaBoost appear to be the top-performing models based on accuracy. Here are the reasons why these two models are considered among the best:

1. XGBClassifier: The XGBClassifier is known for its exceptional performance in various machine learning tasks. It utilizes an ensemble of decision trees and employs a gradient boosting framework to optimize model performance. Some reasons for its effectiveness include:

- High accuracy: The XGBClassifier achieved an accuracy of 0.908345, which is the highest among the models you evaluated.
— Robustness to outliers: XGBoost is designed to handle outliers and missing values effectively, making it a reliable choice for datasets with such characteristics.
— Regularization techniques: XGBoost incorporates regularization techniques such as L1 and L2 regularization, which helps prevent overfitting and improves generalization ability.
— Feature importance: XGBoost provides insights into feature importance, allowing for better understanding of the underlying patterns and relationships in the data.

2. AdaBoost: AdaBoost, short for Adaptive Boosting, is another popular boosting algorithm that combines multiple weak classifiers to create a strong ensemble model. Here’s why AdaBoost stands out:

- High accuracy: The AdaBoost model achieved an accuracy of 0.892642, which is among the top performers in your comparison.
— Robustness to noise: AdaBoost is known for its capability to handle noisy data and outliers effectively, leading to improved model performance.
— Sequential learning: AdaBoost assigns higher weights to misclassified instances, focusing on the more challenging data points during subsequent iterations. This enhances the model’s ability to learn from difficult examples.
— Versatility: AdaBoost is suitable for various classification tasks and performs well across different domains.

Both XGBClassifier and AdaBoost are popular choices for solving classification problems. XGBClassifier is often favored for its high accuracy, speed, and ability to handle complex datasets. AdaBoost, on the other hand, is known for its robustness to noise and versatility in handling different types of data. Ultimately, the choice between these models depends on the specific requirements of the problem at hand and the characteristics of the dataset.

Section 5: Conclusion

Reflection

My solution to solving this scenario has been splitted into 3 steps:
- First, I preprocessed the data.
- Second, I do EDA to know the data and see what it means.
- Finally, I use the data after cleaning to build the model and choose the best one to tunning it. Then evaluate.

Improvement

Here are some points that I think I can consider for improving my model:

Feature Engineering: Explore and analyze the existing features in my dataset to identify potential transformations, interactions, or combinations that could enhance the predictive power of my model. Feature engineering techniques such as scaling, encoding categorical variables, creating new features, or extracting meaningful information from existing features can often lead to improved performance.
Hyperparameter Tuning: Fine-tune the hyperparameters of my model to optimize its performance. This can be achieved through techniques like grid search, random search, or Bayesian optimization. Adjusting parameters such as learning rate, regularization strength, maximum depth, or number of estimators can significantly impact the model’s accuracy and generalization ability.
Model Evaluation Metrics: Consider using additional evaluation metrics beyond accuracy, such as precision, recall, F1-score, or area under the ROC curve (AUC-ROC). These metrics provide a more comprehensive understanding of the model’s performance, especially in cases where the dataset is imbalanced or the cost of false positives and false negatives differs significantly.

The specific techniques and approaches for improving my model may vary depending on the nature of my dataset and the problem I’m solving. It is essential to experiment, iterate, and continuously evaluate the impact of each improvement to find the most effective strategies for enhancing my model’s performance.

Data Scientist Nanodegree Capstone — Starbucks Project

profile.json

portfolio.json

transcript.json

Written by Nguyễn Minh Hùng