Starbucks Capstone Udacity Data Science Nanodegree

Radhika Jangi
6 min readNov 21, 2019

--

Project and Problem

Many corporations would like to understand what goes into the decision to purchase goods by the customer. Which factors influence a potential customer into considering an offer from the company and purchasing an item as a result of the offer? This project attempts to answer this question with a simulated data set of customer and offer information from Starbucks. Some of the customers receive certain offers through the Starbucks app, while others receive other offers, or none at all.

There are three provided files. The portfolio file describes offer characteristics including its duration and the amount a customer must spend to complete it (difficulty). The profile file contains customer demographics and when they became a member of the Starbucks rewards program. The transcript file describes customer purchases. It also details whether an offer is viewed, received, or completed.

In order to answer the question above, I built a machine learning model that could predict whether an offer would be successful in causing a transaction based on a compilation of all of the data provided. To evaluate the strength of the model, I went with F1-scores, which measure the precision and recall of the model. I used a random forest classifier, and then GridSearch to fine tune the hyperparameters of the model to improve the F1 score. I also calculated feature importance on the data to see which feature affected the success of an offer the most.

Data Exploration/Cleaning

Initial portfolio data set

For the portfolio data, I one hot encoded channels and offer_type. Offer_type could be handled by using the get_dummies() method, but in order to break down the lists in channels, I used the MultiLabelBinarizer(). I also renamed id to offer_id to match the offer_id column in transcript.

Initial profile data set

The first thing I noticed was that in the age column, there were a few customers that had their age listed as 118, which seems highly unlikely. These rows were also associated with no listed gender and NaN values for the income. I ended up dropping these rows because it looked like the only null values in this data set were linked to the 118 year old “customers”. I also one hot encoded gender and turned became_member_on into a datetime object column instead of an integer column.

I then decided to get an idea of the spread of some customer demographics

Distribution of Income by Gender

From these histograms, it looks like there are a lot more male customers than female, but, the female customers are earning slightly more than the male customers. They tend to earn a median of just under $80,000, while male customers earned a median of $70,000.

Another demographic I was interested in was how user membership changed over time. As time goes on, the amount of users joining the app increases, which makes sense as more people are going for mobile ordering over traditional in store ordering.

Initial transcript data set

For transcript, I first renamed person to customer_id to clarify which type of id this is. I then decided to drop all the rows in transcript with a customer id that didn’t appear in profile. Transaction/offer data is not useful unless it can be traced to a customer profile. I then proceeded to one hot code events into offer viewed, received, and completed. I also got transaction from this column. I then looked at the value column. This contains an offer id if the row is associated with an offer, or an amount if the row is associated with a transaction. Some rows also have rewards associated. I managed to extract either the offer ids or the amounts from the value column as separate columns. The redundant event and value columns are dropped. In this data set, 45.4% of the records are transactions and 54.6% are offers.

Combining

At this point, I decided to split transcript into two dataframes, offers and transactions, for further processing. For both, the time variable was converted into days to match the duration variable in portfolio. I then proceeded to combine the 4 dataframes into a maste dataframe called starbucks_df which contains a column that contains a 0 if an offer is unsuccessful, and a 1 if the offer is successful. The combine_dfs() method determines success by taking a customer id and pulling all associated offers and transactions for it. Then it looks at the window of time for each offer that the offer is valid, and searches for transactions that fall within that window, and assigns a 0/1 in a Boolean array for that offer. The offers for that customer are arranged into a list of dictionaries containing the success variable.

This method loops through all of the unique customer ids in profile to create the new combined master starbucks_df.

Portion of the starbucks_df data frame. Contains success variable

Model

I wanted to see if a machine learning approach can predict whether an offer will be used for a transaction or not based on training and testing data derived from the starbucks_df dataframe I generated above. The first step would be to split the dataframe into feature data and target data. The target is if an offer is successful or not. For the features, I omitted customerid, offerid, and became_member_on because those features do not seem as if they would contribute heavily to an offers success.

I decided to measure the strength of my model using F1-scoring, which is based on the recall and precision of the model.

F1 Score formula

I was debating between using a DecisionTreeClassifer() versus a RandomForestClassifer(), so I calculated their F1-scores.

F1 scores for both classifiers

Since the RandomForestClassifier() had a higher F1 score, I went with that classifer for my model. I performed GridSearch and fit my model and rechecked the F1 scores, which improved greatly:

F1 scores for training and test data

I also calculated the feature importance for this data and plotted them out.

Plot of Importance by Feature
Feature importance

It looks like the total amount spent in transactions affects the offer success the most. This feature takes up the majority of the importance, but is followed by income and offer duration.

Conclusion

Based on the model above, it looks like the most influential factor in determining whether an offer will be successful or not was the amount of money a customer has spent on transactions. Logically, this makes sense, as a customer will most likely be thinking about how much money they have spent on previous purchases. However, the model doesn’t tell us whether high spenders or budgeters are more likely to make a transaction based on an offer. In the future, a feature categorizing high to moderate to low spenders based on the totalamount column could help elucidate this. Starbucks could even implement this feature to target high spending or low spending customers.

The data itself could be diversified as well. Whether users purchase items via the app or in store could be an interesting variable to determine if offers are being used in transactions. Regional data could also be useful, as some Starbucks don’t accept the same offers as others, even if a customer receives an offer, which could count as an unsuccessful offer even if the customer was attempting to use the offer. Regional data may also affect customer profile data as well.

Overall, the data allowed us to make some interesting discoveries regarding what influences an offer acceptance to turn into a successful transaction.

--

--