How Can Starbucks Determine the Effectiveness of Their Campaign Offers?

Published in

The Startup

11 min readJul 2, 2020

This article is a capstone project from the Udacity Data Science Nano degree program. Look forward to interesting insights.

Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

The task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You’ll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

We have been given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user views the offer. There are also records for when a user completes an offer.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

Business questions

Here are the questions that we plan to answer with this data:

1) To determine which demographic groups, i.e. age, gender and income groups, respond best to the offer types.

2) To assess the characteristics of the demographic groups who respond the least to offers

3) To build a model and predict what features affect a user’s response to an offer.

Data description

The data is made up of three datasets:

portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

profile.json

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

transcript.json

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

Data transformation

Here are the transformation procedures for the portfolio data

· Changed the name of the id column to offer_id

· One-Hot encoded the channels column into dummy variables

· Transformed offer duration days to hours

· Dropped the channels column

Here are the transformation procedures for the profile data

· Changed the name of the id column to customerid

· Converted the outlier value encoded as 118 to N/A in the age column

· Dropped all missing values as the age column outlier has been encoded to N/A

· Extracted the year that users became members and created a new column called start_year

· Created a new column called membership_days, which is a difference between the number of days since the user signed up to become a member and the current date (June 2020)

· Dropped the variable became_member_ on

Here are the transformation procedures for the transcript data

· Changed the name of the person column to customerid

· Removed customer id’s that are not in the profile data frame

· Expanded the value column to create offer_id, amount and reward columns

· Dropped the value column and combined the 2 offer id columns (offer id and offer_id) into one using the def function

· Applied the def function to create offer_id_new column, which is the combined column

· Dropped offer id and offer_id column and renamed offer_id_new column to offer_id column

Transforming the offer id columns into one column

Data Visualization

Visualizing the profile data, we can better understand the demographic qualities of the Starbucks app users. The bar chart below shows the year in which each Starbucks users joined the app. The graph shows that 2017 was the year where most users became members while 2013 was the year with the least members.

Examining gender, the graph suggests that there are more male app users than female app users.

Visualizing the transcript data, transaction was the highest type of event while offer completed was the lowest type of event. This is because not all members who use the Starbucks app receives offers and among those who received the offers some users do not complete the offer during its validity period.

More data cleaning and exploration…

In order to derive more insights, we are going to merge the three datasets. First, we merge the transcript data with portfolio data using offer_id. Secondly, we merge the resultant merged data with the profile data using customerid

The next step in this data exploration is to create dummy variables from the event column. Lastly, we perform label encoding on the gender column where Male is mapped as 1, Female is mapped as 0, and Other is mapped as 2. We want to look at the interaction between offer types and event.

From the visualization above, we can see that users received more discount and BOGO offers than informational offers. BOGO offers were viewed more than discount offers while informational offers were viewed the least. The reason is that users have more incentives to purchase Starbucks products with BOGO offers. Informational offers don’t have the completed status. Discount offers had the most completed status.

Assumptions for creating the class variable

An offer is successfully completed if the following occurs:

1. For informational offers: — After the offer is received, it should be viewed within the validity period of the offer. Then, a customer should spend the required amount for the completion of that offer before the offer end time. Note, that transactions should be after the offer is viewed

2. For Discount and BOGO offers: — After the offer is received, it should be viewed within the validity period of the offer. There should be an offer completed entry where a customer should spend the required amount for the completion of that offer before the offer end time. Note, that transactions should be after the offer is viewed.

Different flag variables were created to capture the time for each event such as time at which an offer was received, viewed etc. Similarly, flag variables were created to ensure that each type of event occurred before the end of an offer i.e. variables such as viewed on time flag and completed on time flag.

A unique identifier was created for each offer sent because some offers were sent more than once to the same customer. Here are the columns created.

We can begin to answer some business questions now that our data is clean.

1) Determine which demographic groups, i.e. age, gender and income groups, respond best to the offer types.

The Male demographic had more successful offers than the Female gender group. We will ignore the Other gender group since it does not have enough data to provide a clear decision. Hence, this means that Male users are more likely to respond to offers than their Female counterpart.

We see that the 60–80K income range had the most successful offers while the income range 100K+ had the least successful offers.

We see that the 40–60 age range had the most successful offers while the age range less than 20 years had the least successful offers. Teenage users may not consume much coffee compared to other age groups.

2) Assess characteristics of the demographic groups who respond the least to offers

From the graphs above, we can see that the Female gender group responds the least to offers as well as the 100+K income range and the less than 20 age range

DATA PREPROCESSING

The data preprocessing stage involves creating target and features variables. Our target variable is offersuccessful. This indicates whether a user successfully responded to an offer or not. In order to select the feature variables, we need to use the correlation matrix as a selection tool. Since logistic regression is one of the models we will be using in this project, a correlation matrix will help reduce the chances of multicollinearity. In this stage, we will eliminate one of the highly correlated variables. If the correlation is greater than 0.6, it is considered a high correlation.

In the figure above, the correlation matrix suggests that the following were highly correlated: viewed flag and viewed on time flag, start year and membership days, completed flag and completed on time flag, BOGO and reward, discount and duration, informational and duration, informational and difficulty, difficulty and mobile, difficulty and duration, BOGO and discount, offer successful and completed valid, completed on time flag and completed valid . We will be dropping the following features start year, viewed flag, completed valid, reward, discount, informational, mobile, and completed flag.

Here is the rationale behind the correlated variables that were dropped, membership days was selected instead of start year because the former gives a numerical presentation of the users’ maturity unlike using years which have to be encoded to be of meaningful use. Completed on time flag and viewed on time flag represents a better measure of the success of offers unlike viewed flag and completed flag. BOGO had more views than discount. Reward and offer types had similar effects on users. Since BOGO had already been selected, reward was dropped. Informational and mobile were both highly correlated with difficulty. Hence, both variables were dropped.

When we check the correlation matrix for the second time to confirm that the highly correlated variables have been removed, the correlation matrix below shows that duration and difficulty were highly correlated. We will drop duration since it has been accounted for in flag variables such as completed on time flag and viewed on time flag.

In preparation for modeling, numerical variables such as difficulty, age, total_amount, time, income, and membership_days were scaled using the MinMaxScaler.

MODELING

A model directory was created to store the three classifiers that will be used to determine whether a user responds to an offer or not. After the model is split into testing and training data in the preprocessing stage, we created a def function to evaluate the model using accuracy and F1 scores. The examined models: Logistic Regression, Random Forest and XGBoost classifiers were tuned to determine the best parameters.

The final model was Random forest. Both Random Forest and XGBoost had very similar accuracy and F1-score. Since the data was not highly imbalanced, both models produced very similar result.

Tuned result of the Random Forest classifier for the trained data

Result of the Random Forest classifier for the test data

The best tuned parameters for the final model: Random Forest

The Random Forest Feature Importance estimates that the following variables were most influential in predicting the class label offersuccessful.

The top 10 most important variables are displayed below:

The most important features that were greater than 0.005 and influential in the model were viewed_on_time_flag, completed_on_time_flag, membership_days , total_amount, BOGO, social and difficulty.

Conclusion: A call to action for Starbucks Marketing team

In summary, here are the insights from the data in which Starbucks can use to create a marketing strategy for its future campaigns: -

The 40–60 age range, Male, and the 60–80K income range respond best to offers
BOGO was the most significant offer type that determined the success of an offer
The total amount of a transaction plays a key role in determining whether a customer responds to an offer or not
Social media was the most significant channel that determined the success of an offer
The minimum required spend to complete an offer and membership days also determines the success of an offer
The <20 age range, Female and the 100+K income range respond the least to offers

The data has been successful in defining a target demography that respond best to offers, an offer type that is most effective for campaigns, a campaign channel that is suitable for the targeted audience, and other influential factors such as transaction amount, membership days, and difficulty.

Improvement

For areas of further study, building a Regression Model to predict the magnitude of how transaction amount, income, age, difficulty, and membership days affects the success of an offer is a good place to start. One can further explore the label differences in the categorical variables such as income, age, gender etc. These changes can be deployed to the web.

For the source code and a detailed explanation of this project, go to my GitHub profile.