Udacity Data Science Capstone: Starbucks Project
Section 1: Project Definition
High-Level Project Overview: The purpose of this project is to use customer data to help Starbucks become more profitable. The basic task is to use the data to identify which groups of people are most responsive to each type of offer, and how best to present each type of offer. We are using the data to identify what demographic group of customers Starbucks should send offers to. The data is simulated to mimic the customer behavior on the Starbucks rewards mobile app.
Description of Input Data: The 3 datasets are:
- portfolio— contains information about the types of offers offered by Starbucks
- profile — contains demographic data for each customer
- transcript — contains transactional data (i.e. records for offers received, offers viewed, transactions, and offers completed)
Users have the following attributes:
· age (int) — age of the customer
· became_member_on (int) — date when customer created an app account
· gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
· income (float) — customer’s income
There are 3 offer types:
· bogo
· informational
· discount
Problem Statement: To identify which offers should be presented to what demographic groups for most profitability for the business.
Strategy for solving the problem: My approach for this project is to take a heuristic approach to determine what offer I should send to each customer.
Metrics: I will be comparing the percentages of offer responses in different demographics (E.g. 60 percent of men customers who were 25 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).
In terms of what it means to respond to an offer, it is when a user views an offer and then completes a transaction to complete the offer.
It is important to note that for informational offers, there is no “offer completed” event associated with it.
Discussion of the expected solution: Since I am taking a heuristic approach, I will be using visualizations and percentage values associated with the visualizations to determine which offers to present to each demographic.
Section 2: Analysis
Exploratory Data Analysis: In order to familiarize myself with the data, I did some data exploration and took a look at the distributions of various columns of interest:
An issue I noticed in the profile dataset was there were many erroneous rows with age=118, gender=None, and income=NaN. These rows were removed from the dataset. We can see the age graph on the top left no longer contains the outliers. This is also a part of the data pre-processing step below.
Section 3: Methodology
Data Preprocessing: To begin working with the data, I combined the 3 datasets using the offer id and person id’s as keys. As described above, I removed outliers in the data.
Next, in order to make the channels column from the profile dataset and the value column from the transcript dataset easier to work with, I converted them to be represented as multiple columns. For the channels column this was done with the get_dummies function:
For the value column, this was done with the apply(pd.Series) function:
I also created quartile buckets for the income and age columns to group customers more easily based off of their demographics. The groupings are as follows:
Age (in years):
- young: (17.999, 41.0]
- middle_aged: (41.0, 55.0]
- old: (55.0, 66.0]
- oldest: (66.0, 101.0]
Income (in $):
- low_income: (29999.999, 48000.0]
- medium_income: (48000.0, 62000.0]
- high_income: (62000.0, 78000.0]
- highest_income: (78000.0, 120000.0]
In addition, I used the already existing gender buckets but did not consider “Other” since the sample was too small. The genders are either “M” or “F”.
In order to analyze the data, I split the final dataset into 3 by filtering on the offer_type. The 3 separate datasets are for the 3 types of offers: BOGO, discount, and informational.
On initial analysis, I realized looking at one column at a time was not accurate (e.g. only age) because it does not consider other characteristics at the same time. For example, if I saw from the each of the univariate distributions that young people, females, and low income people each have the highest offer response rate, this does not necessarily mean that young females with low income will have the highest response rate. It is important to consider all of the factors at the same time.
In order to refine this, I examined multiple columns at once. In particular, I examined age, income, and gender all at once. Because age has 4 categories, income has 4 categories, and gender has 2 categories, this lead to 4x4x2=32 categories to examine.
Section 4: Results
The following graphs give us a visualization for the offer response rate. The blue is the total number of offers viewed while the orange shows the total number of offers completed (i.e. the orange is a percentage of the blue). It is more favorable when there is more orange for a given bar as that means the offer response rate is higher. There are 8 graphs for each of BOGO, discount, and informational offers. The actual percentage values of the bars can be seen in “Section 5: Conclusion.”
BOGO: For example, we can see that in general, low_income earners had higher offer response rates while high and highest income earners had lower offer response rates.
Discount: Here we see that younger people seem to have higher offer response rates.
Informational: Compared to BOGO and discount, there is a lot more blue in the bars suggesting that less people respond to informational offers.
Section 5: Conclusion
The following image summarizes the graphs from above:
The “Max Response Rate” shows us the highest response rate for the given demographic and the offer type (i.e. BOGO, discount, or informational) that was associated with it. There are 32 total demographic categories. Some highlights include:
- For highest response rates, there were 18 discount, 13 BOGO, and 1 informational offer. This tells us informational offers aren’t very popular compared to discount and BOGO.
- The overall highest response rate of 88% came from young females with low income.
- The overall lowest response rate of 67% was shared by highest income males who are young, middle_aged, and old. In general, it seems high income customers are less inclined to respond to offers.
In conclusion, the above table tells us which offer type to send to each demographic type. For example, the only time you would send an informational offer is to young females with highest income.
Takeaways
One takeaway I learned is how important, tricky, and time-consuming the data cleaning and processing step is. The data exploration is vital to familiarize yourself with the data and get an understanding of it. This will allow you to recognize weird values/outliers and clean the data. Getting an understanding of the data will also allow you to perform reconciliation checks on the data.
Improvements: A possible improvement to this experiment would be to consider more parts of the data. Due to time-constraints, I wasn’t able to examine the channels column, but it would be interesting to see if the channel through which the offer was communicated had an impact on whether an offer was viewed or completed.
Acknowledgements: I would like to thank Udacity for giving me the opportunity to work on this project!