Recommender System for Starbucks

Starbucks Capstone

13 min readJan 9, 2022

Project Definition

Project Overview

As part of the Data Scientist Nanodegree at Udacity my capstone project is building a Recommender System using the Starbucks dataset (provided as part of the program). The Starbucks Capstone project is quite exciting not only because its the last project for the program but also because Starbucks is a success story for the ages. The dataset provided is quite comprehensive with demographic, portfolio (promotions/offers) as well as transactions data. The data is contained in three files:

portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

profile.json

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

transcript.json

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

I analyzed the Starbucks dataset in detail by performing data exploration and visualization, data preprocessing and implementation of the FunkSVD algorithm, model evaluation and validation and refinement with the goal of building a Recommender System whose goal is to recommend which offer or promotion should be given to which customers.

I successfully implemented my Recommender System and achieved an accuracy of 95% on the test dataset for this project.

Problem Statement

Once every few days, Starbucks sends out an offer to users of its mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

My task is to combine transaction, demographic and offer data to determine which customers respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You’ll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

We are given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

We must keep in mind that there are nuances in customer behavior. For example a customer using the app might make a purchase through the app without having received an offer or seen an offer. To give an example, a user could receive a discount offer buy 10 dollars get 2 dollars off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the “buy 10 dollars get 2 dollars off offer”, but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer. For our purposes we are interested in the workflow: offer received, offer viewed and offer completed to simplify the analysis.

My objective is to build a Recommender System that can be used to better target users with offers. I could have also also used stated my problem as a simple classification problem and used decision trees to achieve the same result.

Metrics

Algorithm need metrics to measure their performance and to determine if they are solving the underlying problem efficiently and effectively. To solve the underlying problem algorithms need to be tuned for the specific problem and dataset and this tuning is performed on the hyperparameters. The algorithm together with the set of hyperparameters selected will produce the desired solution. For our Recommender System we are going to be using the FunkSVD algorithm which has number iterations, number of latent features and the learning rate as hyperparameters. Recommender System problems generally use the Mean Squared Error (MSE) and Squared Sum of Errors (SSE) metrics to measure their performance. FunkSVD, which is an optimization algorithm, uses the MSE internally to measure convergence toward the desired solution — which is to minimize the MSE. When we test performance of the algorithm on the test dataset we will use the SSE metric. Our goal would be to select the set of hyperparameters which give the best SSE. The SSE is a proxy for accuracy.

Analysis

Data Exploration

Here I perform some data engineering and cleaning tasks starting with the profile dataset.

We immediately see three obvious problems here. None values in the gender column, invalid ages (118) and NaN values in the income column. It turned out that profiles with None in the gender also had Nan in the Income, so we need to fix this. I also updated all the invalid ages to NaN. Finally I delete all the NaNs.

We also have a date when a profile became a member. It would be useful later to understand if the number of days as a member has an impact on the member’s behavior. So we engineer a new feature called member_since_days.

Now we look at the portfolio dataset. The portfolio has 10 promotions/offers which differ in terms of:

Duration — the period for which the offer is valid in days
Reward — The reward given for completing the offer
Difficulty — Minimum spend required to complete an offer
Channels — The channels on which the offer is valid

We can see that we can easily perform one-hot encoding on both channels and offer_type features to produce a dataset that’s ready for analysis and modeling.

Finally I turn to my transcript dataset.

The transcript dataset is very interesting. The event column indicates whether the transaction is an offer received, offer viewed, offer completed and transaction. We want to extract the offer_id column for our FunkSVD algorithm letter so that we have a user-item table. We have two options to deal with the event column. We can do a pivot on person and event so that we count the events we want or we we can do a one-hot encoding on event. For now just for the analysis I will do a one-hot encoding.

The last thing I do then is to combine the portfolio, profile and transcript into a single dataset that has all the features from cleaning and engineering. I also created a single data_cleaning_pipeline method that calls all the above methods into a single one and merges the datasets into a single df.

Data Visualization

Now we perform some exploratory data analysis and visualization.

The obvious distributions which give us a good glimpse of the data are distribution by income, distribution by member days, distribution by age, distribution of income by age, distribution by gender and distribution by event.

These distributions are very insightful and we can already start asking some questions in our minds as well as identifying some observations:

Looking at the distribution of events we can see that the number of offers received is greater than the number of offers viewed which is greater than the number of offers completed.
In terms of gender, it appears, males are leading females in the coffee drinking charts but does this mean that males are likely to respond to offers more than females?
The majority of our profiles are just below 60 and interestingly that age group has the highest income.
The majority of our profiles are seating on around 1500 days and below as members which suggests that the majority of our customers are new coffee addicts

Methodology

Data Preprocessing

Data preprocessing is performed in 6 steps, including saving processed data to a sqlite database in my process_data.py function.

The above data_cleaning_pipeline function integrates all preprocessing activities into a single pipeline. load_data loads all data from the three json files into pandas dataframes so that I can perform data cleaning and visualization as above. clean_protfolio, clean pipeline and clean transcript cleans the respective datasets and readies them for machine learning while save_data saves all my data to a sqlite database for later use with the machine learning pipeline using FunkSVD. This data_cleaning_pipeline returns a pandas dataframe that is generic enough to be used by any machine learning algorithm.

Implementation

Here we perform matrix factorization using the FunkSVD algorithm to make recommendations. FunkSVD is appropriate here because we have some offers that users did not interact with and so we factorize the matrix so that we can complete those blanks. FunkSVD works with three matrices, a user matrix, a latent feature matrix and item matrix. Here our items are the offers in the portfolio and users are the users from profile in the transcript dataset. What’s really cool about FunkSVD is that you don’t need to know information about the users like (age, income, etc), it works out these latent features itself.

The full algorithm is implemented in a number of steps:

Data loading

Here we load our data from the database from the data cleaning pipeline. The transcript dataset provides the core data we need for producing the user-item matrix.

Creating user-item matrix

Here we create the user-item matrix. There will be a user-item matrix for both the training and test data. The intersection of a user and an offer will be the number of times a user performed our desired offer received -> offer viewed -> offer completed workflow. The rest of the entries where the desired workflow was not performed will be 0’s. This is the problem we are solving using FunkSVD.

Splitting data into training and test dataset

We need to split our data into a training and test set. This is because we want to test our algorithm on data it hasnt seen before. Performance on an unseen dataset tells us how good our algorithm will perform in the real world.

The FunkSVD algorithm

Here I implement the basic of FunkSVD without regularization which takes our user-item matrix, the number of latent features, a learning rate and the number of iterations. We iterate through the matrix to complete the missing values. We then return a user by latent features matrix and a latent feature by offers matrix. This is what we will use to make predictions.

Fit

The fit step merely combines all the above steps into one. Given a transcript and portfolio dataframe, split the data into a train and test dataset, create a train and test user-item matrix, run FunkSVD on train user-item matrix and return thh results.

Refinement

The FunkSVD algorithm does not have grid search capabilities out of the box. For that I had to write my loop to try out different latent features, in this case from 1 to 51 to see which latent features produced the highest MSE. I did the same on the test dataset to understand which latent features produce the same sum of squared errors.

Results

Model Valuation and Validation

I performed model evaluation and validation using three primary functions:

Model evaluation

We evaluate the model obtained from the fit function by testing it with the test dataset. The evaluate_model function returns the SSE and we use this to see how well our model performs. Our SSE corresponds to our accuracy.

I ran my fit method above with different hyperparameters meters to obtain the hyperparameters that have the best results on the test set using the Mean Squared Error (MSE) as the performance metric. I ran FunkSVD with a fixed learning rate of 0.0001 and 250 iterations. The variable that I changed was the latent features, running it from 1 to 50 latent features.

Interestingly, per the above figure the MSE always decreases as latent features increase. The MSE decreases rapidly before 15 latent features and then decrease a lot slowly thereafter.

Validation dataset accuracy vs Latent features

For each of the latent features I then ran FunkSVD model on the test set to see which one produces the best sum of squared errors. 0.95, which is obtained at 18 latent features is the best accuracy that can be obtained (interesting I noticed that I get different numbers when I run on the model or in the console). After that the sum of squared errors oscillates and never exceed 0.95.

Making offers

Here I start making offers using the selected model and our make_offer function. Given a user id I return all the offers with their computed scores in descending order.

The recommendation function returns all the offers ordered by the score as above. For the above user its recommending discounts first.

Justification

The final model I selected had 18 latent features and produced an SSE of 95%. These models take hours to run but the results are worth it. Based on the latent features selection and the results from evaluation above I am confident that this model is usable in practice. Adding regularization might add slightly to the results. I didnt try other methods like pure classification using decision trees so I can not compare the results. However I do think there would be some merit in doing so.

Conclusion

Reflection

This was a huge project with significant practical implications. Recommender Systems are a growing field and the Starbucks project gave me lots of hands-on and practical insights on data science.

I started off with data cleaning and engineering and this took a long time to complete. But the process is worth it because it gave me a good understanding of the data and a clear indication of the model to use to perform the recommendations I needed.

I then went on to perform data visualization. Once data is clean and has been appropriately engineered visualization becomes a simple tasks. I was able to see insights such as distribution by income, distribution by gender, distribution by event type and distribution my member days. People with income between 50k and 70k are the largest consumers. Males put through transactions more than females. The age group driving most coffee gulping is around 60 years.

Deciding on building a Recommender System was intuitive but I know I could have spent more time evaluating other techniques to solve the problem. For example, I could have used pure classification techniques which perhaps would have give me more information on the importance of different features on the dataset.

I implemented the basic FunkSVD algorithm used in class without regularization. This is because I stated the problem as a matrix factorization problem to fill in missing values in a user-item matrix. I also split the data into training and test set using a train/test split of 70%.

Having built my algorithm I then ran it using different hyperparameters and eventually I selected 18 latent features. I also found it surprising that the algorithm gave different results for different runs in the jupyter notebook or in the console. For example while I got 18 latent features in the notebook, I got 27 latent features in the Anaconda console. What also found interesting is that as the number of latent features increased the SSE didnt go in the other direction like with other classification solutions. This suggests that FunkSVD is perhaps more robust to overfitting than other algorithms so the basic form without regularization was ok. My algorithm achieved an accuracy of 95%.

Improvements

The basic FunkSVD method above does not have regularization. Adding regularization will reduce overfitting and so is a good practice for machine learning models.
The biggest problem with FunkSVD is that it wont work for new users as it hasn’t seen them before. I would need to develop a rank-based offer based on say revenue generation (income), popularity or other demographic information.
My user-item matrix can be improved as well. Currently I am just counting the number of times a user executed our desired workflow — offer received -> offer viewed -> offer completed. The validity of these workflows need further testing.
I could also restate this problem as a pure classification problem and run other models to see what the results would look like. Directly solving this, for example, using decision trees would enable me to analyze features more closely to understand whether there are underlying response patterns based on gender, income, number of days as member, the nature of the offer and other details.