How much do you spend at Starbucks ?

Gautam Bhatheja
The Startup
Published in
12 min readSep 22, 2019

Do you like Starbucks ? How much do you spend there ? Do you avail the offers given by Starbucks ?

I was intrigued by the above questions. So, I decided to explore myself. Mainly I was interested how much do people spend at Starbucks, so I decided to build a model that will predict the amount of product a customer will purchase.

Metric Used:

I will be using Root mean squared error (RMSE) as the metric to evaluate my model performance.

Why did I choose RMSE as the metric ?

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.

Some advantages in using RMSE:

1) It is simple and intuitive to understand for a layman as it has the same units as that of target value. It simply means how far our predictions are from the actual values.

2) It is easy and not computationally expensive to calculate.

3) Since we are squaring the values, hence the positive and negative error values do not cancel out.

4) Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors compared to small errors.

5) RMSE is the default metric of many models because loss function defined in terms of RMSE is smoothly differentiable and makes it easier to perform mathematical operations.

I looked at some other options like Mean Absolute Error (MAE), Mean Bias Error (MBE), Mean Squared Error(MSE), R-Squared; but found RMSE better suited to my purpose because:

1) Mean Absolute Error (MAE): MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. Which simply means MAE gives equal weight to large errors as well as small errors. Hence, RMSE is more useful when large errors are particularly undesirable, which was my case. Also, MAE takes the absolute value, which is undesirable in many mathematical calculations.

2) Mean Bias Error (MBE): If the absolute value is not taken in MAE (the signs of the errors are not removed), the average error becomes the Mean Bias Error (MBE) and is usually intended to measure average model bias. MBE can convey useful information, but should be interpreted cautiously because positive and negative errors will cancel out. Which for our purpose is not desired.

3) Mean Squared Error (MSE): RMSE is just the square root of the MSE. Thus, it do not have the same units as that of target value, but the square of them. Therefore, i did not used it.

4) R-Square: R-squared says nothing about prediction error. We cannot use R-squared to determine whether the coefficient estimates and predictions are biased, which is why one must assess the residual plots. R-squared does not necessarily indicate if a regression model provides an adequate fit to your data. A good model can have a low R2 value. On the other hand, a biased model can have a high R2 value. Therefore, i did not use R-square as metric

You can find the full code for this project on my github profile on the link below :

So, let’s get started………

For this project I am going to follow CRISP-DM process, that has the following stages:
1) Business Understanding
2) Data Understanding
3) Prepare Data
4) Data Modeling
5) Evaluate the Results
6) Deploy

1) Business Understanding

Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. Also, not all users receive the same offer, and that is the challenge to solve.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. Informational offers also have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, it is assumed that the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

2) Data Understanding

The data I will be using is available from Udacity Starbucks Capstone Project. It contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

The data is contained in three data sets:

  1. portfolio — containing offer ids and meta data about each offer (duration, type, etc.)
  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

2. profile — demographic data for each customer

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

3. transcript — records for transactions, offers received, offers viewed, and offers completed

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

I did Exploratory Data Analysis (EDA) for understanding the data better. I got the following results:

  1. Distribution of age of people: Here was an interesting finding. So many people with 118 age. This means either Starbucks has found Fountain of youth or there is some error in the data. After some research and exploration i found it was the code for missing values. We will deal with it later.

2. Income level of customers: People from all income groups drinks from Starbucks. Good for Starbucks!

3. Membership date of customers: 2017 was a good year for Starbucks membership program it seems.

3) Prepare Data

This is the most difficult and time consuming part of any data science project. And this project was no different.

First, we will be cleaning and pre-processing the data; then we will combine transaction, demographic and offer data; after that some more cleaning, aggregating and feature engineering; and finally feed the data to the model.

  1. For portfolio data:
  • We make separate columns for each channel (web, email, mobile, social). The column has value 1 if the offer runs on that channel, otherwise it is 0.

Cleaning function:

Resultant Data:

2. For profile data:

  • As noted earlier 118 is the code for missing values in age column, so i delete the rows with missing data.
  • I have also created a new column of “memberdays” i.e. the number of days since the user is a member of Starbucks. This will be more intuitive than the “became_member_on” column.

Cleaning function:

Resultant Data:

3. For transcript data:

  • The “value” column has {‘amount’, ‘offer id’, ‘offer_id’, ‘reward’} types of details. In the data ‘offer id’ and ‘offer_id’ are represented separately. But we know they are one and the same thing. I have made “offer_id” column clubbing ‘offer id’ and ‘offer_id’ values. This column will contain the respective offer value if it has one otherwise NaN
  • I have made two separate columns “amount” and “reward” from “value” column. These column will contain their respective amounts if they have one otherwise NaN.

Cleaning function:

Resultant Data:

I removed the duplicates also in the data. They were present in transcript data.

After this, i merged the three data frames to form a single consolidated data frame.

Finally, i have aggregated the data to make a single row for each customer and the corresponding features to that customer as columns.

This whole data preparation process required a lot of exploration and experiments, and took a whole lot of time required to complete the project.

4) Data Modeling

In this stage we will train our model to predict the amount of product a customer will purchase.

I did the following steps in this stage:

  1. Created new features of of ‘year_joined’ and ‘month_joined’ from ‘became_member_on’ column. And dropped ‘became_member_on’ column.
  2. One hot encoded the gender column.
  3. Split the data into features and target label and then into training and testing data sets. I made a function for this.

4. Made a baseline model of Random Forest algorithm with default parameters. The metric I used was Root Mean Squared Error (RMSE) and I got the following results:

Baseline Random Forest model performance:- 
Train RMSE error: 43.40822633356865,
Test RMSE error: 100.12625339540412

5. Made a pipeline with MinMaxScaler() and XGBoost Regressor as our model.

6. Used GridSearchCV to tune the hyper parameters of our model to give best results. I used the same metric of Root Mean Squared Error (RMSE) and I got the following results:

Tuned XGBoost model performance:- 
Train RMSE error: 86.93099660022715,
Test RMSE error: 84.0456223669343

Discussion about the final predictor (model):

  1. I have made a pipeline of MinMaxScaler() and XGBoost Regressor(). I have fitted this pipeline on the training data and the used this pipeline to make final predictions on the testing data as well as training data. Finally, i calculated the RMSE on my predictions. Testing error got decreased from our baseline model testing error.
  2. Making a pipeline makes our model really robust and easy to use. It has the following advantages:
    i) We can fit and predict on the pipeline. So, the Scaling and final model predictions happen in one step only, instead of first scaling the data and then using model to make predictions.
    ii) Use of MinMaxScaler() for standardizing our data through a pipeline fits on our train data and transforms it. But it only transforms (and not fits again) the test data when pipeline is used for predicting. So, on one hand we don’t have to call fit and transform on train data and transform on the test data separately (which saves effort); and on the other hand, this does not allow information from testing data creeping into the training data when we are using GridSearchCV() to find the best parameters for our model. It makes our model robust to such information leakage errors.
  3. I have used MinMaxScaler() to standardize my data in (0,1) range, so that all the features have the same scale of the data and hence do not affect our model adversely. Having features data present in different scale of values can severely decrease model performance. Hence, the model react to small changes in the data set quite well.
  4. Estimator in my pipeline is XGBoost Regressor() which is an implementation of gradient boosted decision trees designed for speed and performance. It has several advantages which makes it very robust. They are:
    i) Regularization: XGBoost has in-built L1 (Lasso Regression) and L2 (Ridge Regression) regularization which prevents the model from over fitting.
    ii) Handling Missing Values: XGBoost has an in-built capability to handle missing values. When XGBoost encounters a missing value at a node, it tries both the left and right hand split and learns the way leading to higher loss for each node. It then does the same when working on the testing data.
    iii) Effective Tree Pruning: A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm. XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
    iv) Parallel Processing: XGBoost utilizes the power of parallel processing and that is why it is much faster than GBM. It uses multiple CPU cores to execute the model.
  5. I have used GridSearchCV() to find the best parameters for my model. It uses the default 3-fold cross validation for its task. This makes my pipeline easily adjustable to any change in the data, as then in this stage the model will get optimized for that data only. This combined with use of pipeline makes the whole process robust.
  6. The robustness of the model can be evaluated by observing that the testing RMSE got reduced compared to our baseline results.

Parameters that were fine tuned:

I fine tuned the following parameters of XGBoost Regressor()

  • n_estimators in the range [50, 75, 100, 125]
  • max_depth in the range [3, 4, 5]
  • learning_rate in the range [0.03, 0.06, 0.1]

5) Evaluate the Results

From the above model performance we can see that our tuned model performs better than our baseline results. It shows that our fine tuning works well.

Baseline Random Forest model performance:-

  • Train RMSE error: 43.40822633356865
  • Test RMSE error: 100.12625339540412

Tuned XGBoost model performance:-

  • Train RMSE error: 86.93099660022715
  • Test RMSE error: 84.0456223669343

Though train error has increased in tuned model, but test error has reduced. This indicates that our baseline model was over fitting to some extent, while our tuned model is not over fitting.

Justification of full process and steps followed:

  • In this project i have followed CRISP-DM process that is the process you should follow to solve any data science problem. It gives you a proper framework to solve the problem, otherwise you may get really confused about what next steps to take.
  • I made functions when required to follow DRY principle and reused those functions. My suggestion is to do that as it saves a lot of time.
  • I first explored the data and on the way found some problems with the data. It was unclean with many missing values.
  • I found age=118 in the data and that too a lot data with age=118. At first i was astonished, but then thinking and researching a bit i found that it was the code for missing value.
  • After cleaning the data, i had to somehow combine the 3 data frames that i had. This was a bit difficult because of different ways it was stored. It required some thinking and experimentation to find the right way to do it.
  • After merging the data i aggregated the data and processed it more. I also made some features that were more intuitive.
  • After the data was ready to be fed into the model, i split it into training and testing data as per practice to solve any data science problem.
  • A baseline prediction result was obtained, after which i improved upon those results by making a pipeline and using gridsearch cv to tune hyper parameters of the model.
  • Finally i got the prediction results on training and test data, which were listed above.

Reflection:

  • I was an interesting and relevant problem. I enjoyed the full process of CRISP-DM, the whole process was followed in a step by step manner above which really helped me to proceed in the right direction.
  • I found the merging and consolidating of the three data sets in the right manner quite challenging. It required a lot of experimenting and exploring before i could do that. It is rightly said that data cleaning and pre-processing forms the most important part of any data science project.

Improvement Scope:

  • One aspect of the implementation that can be improved is the feature engineering part of the project.
  • If some more relevant features could be created by collecting some extra data or using some intuitive business domain knowledge, then the prediction error could have been reduced further.
  • It would also would have led to some additional interesting findings about the customers.

6) Deploy

The last stage of the process is this blog. I have showcased my findings here and i hope you enjoyed reading it.

--

--