I Analyzed Starbucks Data. Here’s What I Found

Ebubechi E. Ezenwanne
Analytics Vidhya
Published in
6 min readApr 30, 2020
Photo by Hans Vivek on Unsplash

Data speaks. It sure had some recommendations for Starbucks. All I did was listen to it.

Project Overview

The dataset for this project is a simplified version of the real Starbucks rewards mobile app. This project sought to use transaction, demographic and offer data to analyze how different customers responded to different offers and thus, build a model to predict if a customer would respond to an offer or not.

Problem Statement

An offer could be buy-one-get-one (BOGO), discount, and informational. In a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount. In a discount, a user gains a reward equal to a fraction of the amount spent. In an informational offer, there is no reward, but neither is there a required amount that the user is expected to spend. Offers can be delivered via multiple channels.

We are interested in answering:

  1. How do different factors such as demographics affect response to offers?
  2. What factors based on the data available, have the most impact on offer completion?
  3. What steps can we take with the answers from No 2, to increase offer completion rates?

Data Sets

The data is contained in three files:

  • portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers complete.
Photo by Zoe Holling on Unsplash

Strategy

Here’s how I approached the project:

  • Clean, process and combine the data from offer portfolio, customer profile, and transaction. Each row of this combined dataset will describe the customer demographic data, offer’s attributes, and whether the offer was successful. The possibility of someone completing an offer without even viewing it should be taken into account and taken care of. Transactions should only be considered when the person viewed and completed the offer.
  • Build a model to predict the offer success based on the provided customer demographics and the offer attributes. Compare Logistic regression, Random forest classifier and KNeighbors classifier. (The model that with the best performance would be further fine-tuned to get the final model)
  • Obtain the important feature columns that influence the success of an offer and leverage data visualization to answer the questions that were framed above.

Metrics

The success of the project would be determined by the accuracy and F1-score of the model built. The model should have accuracy and F1-scores of above 75% so we ensure it performs well on new data sets.

Photo by Lukas Blazek on Unsplash

Exploratory Data Analysis

The Portfolio data set was relatively small and had no missing values.

The Profile data set had 12% of values missing in gender and income. Males made up about 50% of the data set. The mean age was 62 years.

Figure 1: Age Distribution

From the age distribution, most customers were in the range of 40 to 70 years old with a lot of missing data which was recorded as 118 years old in the dataset.

Figure 2: Income Distribution

The mean income was $65,000. 2% of customers had outlier income but accounted for about half of the transactions.

Figure 3: GenderDistribution

The analysis showed that female customers, on average, were older and earned higher. Those who did not enter their ages also did not enter their income and such rows were dropped.

Transcript data has no missing values and had 4 event types: Transaction, Offer Received, Offer Viewed, and Offer Completed.

Figure 4; Event Distribution

Data Preparation Steps

Portfolio Data:

* One-hot encoding for channels and offer type then append to the dataset.
* Drop original columns after encoding.
* Rename columns.

Profile Data
* One-hot encoding for channels and offer type.
* Convert data types
* Drop null values (118 is used to represent null for age)
* Drop original columns after encoding.
* Rename columns.

Transcript Data
* No need to drop null values as there are no null values
* One-hot encoding for event column.
* Encode value column and split it into different columns.
* Merge offer id and offer_id columns by assigning values of offer id column to offer_id column.
* Assign zeros to null values in offer amounts and rewards
* Change time column from hours to days.
* Rename column names.
* Remove customer ids that are not available in profile dataset.
* Drop original columns that have been encoded.

Next, I created a transactions data frame that contained the receipt, view and completion of an offer for each customer in a single row as it was previously spread across different rows.

Model Training and Evaluation

Before training the model, I had to drop some columns:
1. I dropped reward_gained since it was same as offer_reward
2. I dropped offer_completed and transaction complete.
3. I dropped a column for each set of one-hot encoded columns to avoid multicollinearity

Next, I merged the 3 cleaned data frames and split into test and train sets.

I tried 3 models: Logistic Regression, Random Forest Classifier, and KNeighbors Classifier. Random Forest Classifier produced the best results with Precision, Recall and F1-Score of 79%.

I fine-tuned the parameters and trained the final model which had Precision, Recall and F1-Score of 80%.

Figure 5: Most predictive Features

Feature Importance

From our analysis, we can see the five most important features to predict offer success are:

  1. Offer difficulty — Money the customer spent to get the offer.
  2. Income.
  3. Age of customer.
  4. The reward for Offer.
  5. Offer Duration
Figure 6: Important Features

Conclusion

We surpassed our goal of 75% accuracy and F1-score as we had 80% accuracy and F1-score.

From the Analysis, we can answer the questions posed at the beginning:

  1. We can see that customer demographics such as income and age play a huge role in predicting the success of an offer. Gender does not play such a huge role though.
  2. Besides income and gender, the difficulty of an offer, the reward it offers the user and the duration it lasts for are key factors that determine the success of the offer. Hence, attention must be paid to these 3 factors. Offers with higher rewards would have a higher completion rate. Also, offers should have a longer duration and lower difficulty for higher completion rates.
  3. From the second plot, we also see that social media channel has a huge weight and higher impact on offer completion that other channels. Mobile and email channels contribute far less because they are present for all kinds of promotions and thus do not provide additional information.

Future Improvements

  • Test other machine learning models with different parameters.
  • Leverage model in a web app for Starbucks to predict offer completion based on a customer’s profile.
  • Build a model to test how much a user would spend in response to different offer types.

What next?

You can check out my notebook here.

If you enjoyed reading this article, please recommend and share it to help others find it!

--

--