Starbucks Capstone Project

Iris Yao
Iris Yao
Nov 6 · 9 min read

This is a capstone project of Udacity Data Scientist Nanodgree. The data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. A customer who received the offer would be rewarded when his/her cumulated consumption over the designed threshold. The data set includes 3 files, they are:

  • portfolio, recorded offer types
  • profile, customer profiles
  • transcript, recorded when person received the offer, viewed the offer, completed the offer and their consumption.

I started from clean the dataset, then answered several business questions regarding user demographics, and created a machine learning model to predict whether the offer would be completed or not.

Problem Statement

For each offer type: BOGO (buy one get one free) and Discount, I tried to understand the below questions:

  • Which demographics are more likely to complete it?
  • Which demographics spent most?
  • Predict who will view and complete the offer?

Solution

In order to solve the first 2 problems, I split the offer data by 4 different behavior types:

  • 1. viewed and completed offer in the required duration
  • 2. viewed but not completed offer in required duration
  • 3. didn’t view but still completed offers
  • 4. Neither viewed nor completed offers

Then I aggregated the offer types with cleaned profile and portfolio data, in order to find out the insights of demographics by different consuming behaviors.

The following steps will be taken in order to build the model to answer question3.

  1. Gather offer and client data
  2. Clean and transform the data (preprocess for machine learning)
  3. Train a classifier to predict which offer will be viewed and completed.

Data Exploring and Clean

1. Portfolio Data

  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) -
  • channels (list of strings)

There are 3 types of offer, BOGO, discount and informational that sent to customers through channels like mobile, email, social and web. Each offer has different difficulty and reward. For example, a user could receive a discount offer buy 10 dollars get 2 off, and thereafter If the customer accumulates at least 10 dollars in purchases, the offer is completed and reward will be given. However, informational is slightly different. This type of offer has no reward so that we wouldn’t find any records as offer completed in the transcript file.

2. Profile Data

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

17,000 customers were contained in profile and their personal profile were recorded when they sign up their account of app. Some information isn’t accurate. There are 2,175 person who are 118 years old with no information relevant to gender and income, so I dropped these rows.

From the statistics of profile data, 41.3% females , 57.2% males and 1.4% other genders consist of the 14,825 customers. Over 60% of their age is between 45 to 80. And their average income is around 65,000 per year.

Then I grouped members by age and income, from the two charts below, 80% Starbucks customers are above 45 years old, and about 75% of customers have more than 50k annual income, 24% are with higher income — greater than 80k.

3. Transcript Data

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

The 306,534 records of this file come from apps. Each person might receive each type of offer for several times. We will capture 4 different consumption types by action time.

  • 1. viewed and completed offer in the required duration
  • 2. viewed but not completed offer in required duration
  • 3. didn’t view but still completed offers
  • 4. Neither viewed nor completed offers

I split the data by three event types: offer started, viewed and completed. Then aggregated the data by offer and person id by action time.

Eventually, I tagged out 70158 independent offers:

  • viewed and completed offers: 34.0%
  • viewed but not completed offers: 41.1%
  • didn’t view but still completed offers: 7.2%
  • Neither viewed nor completed offers: 15.4%
All Offers Record:  70158
viewed and completed in time Offers: 23864 ,Percentage: 0.340146526412
didn't view and completed offers: 28859 ,Percentage: 0.411342968728
viewed but didn't complete offers: 5049 ,Percentage: 0.0719661335842
Niether viewed nor completed offers: 10780 ,Percentage: 0.153653182816

Business Questions about Type1 and Type3 Groups:

In my analysis, I focused on clients’ income and age information for type1 and type3 offers, because they are the most ideal clients for Starbucks we can figure out in the limited time.

  1. Among Tyep1 (viewed and completed offer group), what is the demographic?

Income:

From the below charts, people with higher income (>65k) are more likely to view and complete offers, and they didn’t show a strong preference between BOGO or Discount offer.

For lower income (30–50k) groups, they are less likely to complete offers, and they are more prefer Discount than BOGO.

Age:

young generation (18~30) are less likely to view and complete offers, >46 groups are more likely to view and complete offers. And there is no obvious difference in offer type preference for all age groups.

2. Among Type3 (didn’t view but still completed offer group), what is the demographic?

This group would be the most ideal clients for Starbucks because of the proved strong consuming intention and ability.

Income:

from the left chart below, people with less than 65k annual income are more likely to complete offers without viewing them, and unfortunately, these customers were not influenced by the offer because the customer never viewed the offer.

Age:

the difference by age isn’t obvious, but younger groups(<45) are more likely to purchase without viewing offers.

Modeling

  • prepare data for modeling: one hot coding and normalization.
  • break data set into 3 parts
  • apply Logistic Regression, Random Forest and AdaboostClassifier.

I used the eligible offers table (70158 rows) for machine learning modeling purpose.

Prepare Data:

Normalize column

Scaling attributes to a specified maximum and minimum (usually 1–0) can be achieved through the preprocessing. MinMaxScaler class.

The purposes of using this method include:

1. The stability of attributes with very small variance can be enhanced.

2. Maintain entries of 0 in sparse matrix.

Algorithm & Techniques

Finally, we have completed our preparations and can start training our models. First, we need to choose an algorithm to train the model. In order to select the most suitable model for the task, it is necessary to test. The test is as follows: We will select multiple ML models to train on three data sets, then we will compare the training speed of each model with the average accuracy of the results. Then, in order to improve the results, we will train the selected model on the whole data set, and then do analysis.

Here we use selectkbest to do feature selection, so that to know about the influence for completion. And 3 algorithm to train data, they are

Logistic Regression

RandomforestClassifier

AdaboostClassifier

I chose to use Pipeline for training data. Pipeline implements streaming workflows with pipelines for all steps, which makes it easy to reuse parameter sets on new data sets, such as test sets.

Pipeline can connect many algorithmic models in series, such as feature extraction, normalization and classification to form a typical workflow of machine learning problems. There are two main benefits:

1. Call fit and predict methods directly to train and predict all algorithm models in the pipeline.

2. The parameters can be selected by grid search.

In a word, pipeline is very useful and can prevent data loss. We can easily call it to complete feature selection, data splitting, grid search, prediction, intersection validation, evaluation model, and so on. Let’s make it.

  • make a pipeline to features selection, training, prediction, record time, and scoring, cross valuation.
  • save accuracy, f1 score, cost time to each dictionary.
pipeline
logistic regression
random forest
AdaBoost Classifier

Metrics

Accuracy-accuracy is the most intuitive performance index. It is only the ratio of correctly predicted observations to total observations. Some people may think that if we have high precision, then our model is the best. Yes, accuracy is a good measure, but only when you have a symmetrical data set, the false positive and false negative values are almost the same. Therefore, you must look at other parameters to evaluate the performance of the model.

Accuracy = TP + TN / TP + FP + FN + TN

F1 score — F1 score is the weighted average of accuracy and recall. Therefore, both false positive and false negative were taken into account in this score. Intuitively, it’s not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially when classes are unevenly distributed.

F1 score = 2* (recall * precision) / (recall + precision)

So I want to combine two indicators to judge the model.

Results

Model Evaluation and Validation

From the above result table, all three models all have higher than 80% accuracy rates. AdaboostClassifier has slightly better performance than other two, but the processing time is 5 times longer than Random Forest. Even the AdaBoost Classifier didn’t cost too long, but if we apply this model in large datasets in the real industry, it will arise a significant burden in daily data science and analytics work. So in the end, I think RandomForestClassifier is the best choice.

Refinement

After training the whole eligible offer dataset by Random Forest, we have the result below:

After training again, the accuracy of the random forest is around 78%, the F1 value is 65%, and the model runs at a very fast speed, so I think that’s a good choice.

Conclusion:

In the business analytics part, older (>46) and higher-income customers (>65k) and are more likely to view and complete offers. For lower-income (30–50k) groups, they are less likely to complete offers, and they are more prefer Discount than BOGO. Younger (<45) and lower-income groups (<65k)are more likely to purchase without viewing offers.

In machine learning modeling part, RandomForestClassifier is the best choice because it has not only a decent accuracy rate but also incredible processing speed.

Improvement:

As for further improvement, I think we can further improve the model b in two aspects:

  1. further investigation on metrics and conducting a thorough feature engineering. For example, we can have offer receiving hour, season information, and so on.

2. further investigation on channels information: why people didn’t view offers, is any other better way to reach out them.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade