Starbucks Offers Data Science Approach

9 min readJul 11, 2020

(https://www.diegocoquillat.com/en/starbucks-apuesta-por-el-big-data-y-la-inteligencia-artificial-para-luchar-contra-el-estancamiento-economico/)

Introduction

As part of my capstone project for Data Scientist Nanodegree in Udacity. I chose the Starbucks Offers project. In this project. I analyzed a data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app.

Starbucks sends out an offer to users of the mobile app every other day. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free).

Not all users receive the same offer, and that is the challenge to solve with this data set.

my task is to combine transaction, demographic and offer data to determine will the user respond to offers.

Business Understanding

The problem statements I am trying to answer are here are:

What are the distributions of gender, age, and income of customers in general?
What are the distributions offer types alone and based on gender, age, and income of customers that completed the offers?
Will a customer respond to an offer?

Data Understanding

The data is contained in three files:

portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

1- portfolio.json

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informationl
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

2- profile.json

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

3- transcript.json

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

Data Preparation

In order to proceed and answer the problem statement question, we need first to prepare the data for analysis and modeling. this will be done by clean each dataset them merging the cleaned datasets by proper ids.

To clean the:

1- portfolio dataset the following was performed:

Copied the dataframe to wrangle the data.
Created binary columns from each channel in the ‘channels’ column.
Created binary columns from each offer type in the ‘offer_type’ column then merge them with cleaned original dataframe.
Dropped ‘offer_type’ and channels columns from cleaned original dataframe.
Dropped duplicated row if they exist.

2- profile dataset the following was performed:

Copied the data frame to wrangle the data.
Convert ‘became_member_on’ column type to date.
Created dataframe binary columns from each gender in the ‘gender’ column then merge them with cleaned original data.
Set missing data in the ‘age’ column to nan instead of 118.
Categorized customers into their proper age category: [18–35] young adults, [36–55] adults, and 55 and above elderly.
Created dataframe binary columns from each age category in ‘age_category’ column then merge them with cleaned original data.
Dropped duplicated row if they exist.
Renamed ‘id’ column to ‘user_id’.

3- transcript dataset the following was performed:

Copied the dataframe to wrangle the data.
Created new ‘offer_id’ and ‘transaction_amount’ columns with null values
Extarcted ‘transaction_amount’ column values from value column values
Extracted ‘offer_id’ column values from value column values
Created binary columns from each transcript event type in ‘event’ column then merge them with cleaned original dataframe
Dropped ‘value’ and ‘event’ columns from cleaned original dataframe
Dropped duplicated row if they exist
Renamed ‘person’ column to ‘uesr_id’.

Then I merged the cleaned data set in order to do the Data Analysis / Modeling.

Data Analysis / Modeling

In this section, I will attempt to answer the problem statement questions we stated in the Data Understanding section using the dataset that resulted from the Data Preparation section.

the questions are

What are the distributions of the gender, age, and income of customers in general?
What is the distribution of the offer types alone and based on the gender and age category of customers that completed the offers?
Will a customer respond to an offer?

Data Analysis

To Answer the first two question will use Data Analysis and visualizations.

- What are the distributions of the gender, age, and income of customers in general?

You can see from the gender distribution in profile dataset the male profiles counts are greater than females, and that ‘other’ represents 1.4% of the values.

You can see from the age categories distribution in profile dataset that a little above half (51.9%) of profiles belong to the elderly category followed by 32.9% belong to adults category and young-adults category represent 15.2% of the values.

You can see from the age distribution in profile dataset that the average income is 54 of the majority profiles ages clustered below the average.

You can see from the income distribution in profile dataset that the average income is ~65,400 of the majority profiles have income are clustered below the average.

- What are the distributions offer types alone and based on gender, age, and income of customers that completed the offers?

**Distributions of Offer Types in Completed Offers**

You can see that offers are most popular are Bogo (Buy one get one) followed closely by Discount offers then Informational offer.

**Distributions of Offer Types Based on Age Category in Completed Offers**

As you can see the elderly category has the highest offer completion counts with Bogo offers are the most popular offers followed closely by Discount offers then Informational offer.

Follows the elder in the highest offer completion counts is the adults then young adults both have the same offer types distribution as the elderly.

**Distributions of Offer Types Based on Gender in Completed Offers**

As you can see the males have the highest offer completion counts with Discount offers are the most popular offers followed closely by Bogo offers then Informational offer.
Follows the males in the highest offer completion counts are the females then ‘other’s both have the have offer types distribution as Bogo offers are the most popular offers followed closely by Discount offers then Informational offer.

Data Modeling

To answer the last question will use Data Modeling.

To answer the question:

- Will a customer respond to an offer?

we will look into records of transcripts where the customer has viewed the offer from the merged data set (df_starbucks_master) and use the ‘offer completed’ as the target for our model, thus eliminating any transaction related columns and rows.

we start by fetching offer completed and offer viewed rows then merge them together.

Then we start removing data columns that are not related to the model or had been replaced by other columns such as:

removed data columns: gender, age_category, age, offer_type, event, offers received, offer viewed, transaction, transaction_amount, user_id, offer_id, and became_member_on.

then we'll fill the missing values in the income column with income column mean.

the remaining features:

time: time in hours since start of test.
difficulty: minimum required spend to complete an offer.
duration: time for the offer to be open, in days
reward: reward given for completing an offer.
social: if the offer was sent through the social media channel or not.
web: if the offer was sent through the web channel or not.
mobile: if the offer was sent through the mobile channel or not.
email: if the offer was sent through the email channel or not.
bogo: if the offer is of type BOGO or not.
discount: if the offer is of type Discount or not.
informational: if the offer is of type Informational or not.
income: customer’s income.
gender__F: if the customer’s gender is female or not.
gender__M: if the customer’s gender is male or not.
gender__O: if the customer’s gender is other or not.
young-adult: if the customer’s age is between 18 and 35.
adult: if the customer’s age is between 36 and 5.
elderly: if the customer’s age is 56 and above.
offer completed: (the target)column if the customer’s has completed the offer or not.

Evaluation

In this section, I’m going to evaluate the results of the models.

Metrics:

As we have a simple classification problem, Accuracy is the metric I chose to evaluate my models. This is because we want to see how well our models are performing by evaluating the ratio of correctly classified predictions from the total predictions of the model.

From the table above, you can see that KNeighbors, Decision Tree, Random Forest, and SVM the large difference in accuracy between the test and train accuracy this means the models are overfitting. So I chose AdaBoost classifier as it had the best results to search for the best parameters to get better results.

After using Grid Search with AdaBoost I managed to get slightly better results as shown here:

Best parameters form AdaBoost: 
{'algorithm': 'SAMME.R', 'learning_rate': 0.3, 'n_estimators': 200}Training Accuracy:  0.81055199756 
Testing Accuracy:  0.803964623361

Almost a 0.03% increase.

Conclusion

So Finally, After analyzing the data I found that:

BOGO offer have are more popular as higher completing offers counts and follows Discount offer so you might Starbucks might want to send focus on sending offers of these types.
Males have a higher count in completing offers so Starbucks might want to send more offers to them
The elderly age category has a higher count in completing offers so Starbucks might want to send more offers to them.

Also, The question that wanted to answer was whether a customer will respond to an offer or not. am to answer it My technique to answer the qusetion we going through the following steps.

Fetching offer completed and offer viewed rows then merge them together.
Removing data columns that are not related to the model or had been replaced by other columns.
Splitting the dataset to train and test datasets.
Testing several algorithms of the train and test datasets and assess them base on the model accuracy.
Choosing the best model then refining its parameters using a grid search.

So the result was the algorithm the had the best classification result was AdaBoost with accuracy of 80.39%.

Improvements

To make our results even better, I feel information about customers was limited we only had just age, gender, and income. To find demographic groups respond best to offers, it would better to have more features of the customers.