Analytics Vidhya
Published in

Analytics Vidhya

How Starbucks’ users behave on Offer.

Udacity’s DataScience Nano Degree Capstone Project.

Introduction :

This is the Analysis of the Behavior of the StarBucks’ user on the offer, For this analysis, Starbucks has provided the DataSet that contains simulated data that mimics customer behavior on the Starbucks rewards mobile app.

Project Overview :

In these data set Starbucks sends an offer to their users over the given period of time and they have collected each event happened as a log in a file, an offer can be Informational, Discount or Bogo, some user might receive the same offer again, some might not receive the same offer.

For this purpose, they have sent offers to their users and recorded each event in a log file.

Types of offers that Starbucks sends.

  • Discount: Discount on some amount of purchase
  • Bogo: Buy one get one free offer
  • Informational: Just an Advertisement

They have provided three Dataset for this purpose :

  1. portfolio - containing offer ids and metadata about each offer (duration, type, etc.)
  2. profile - demographic data for each customer.
  3. transcript - records for transactions, offers received, offers viewed, and offers completed.

Problem Statement :

In this analysis, The basic task is to find which demographic group is more responsive to which offer, as these 3 DataSets contain Information about each offer, each user, and each event that happened, I am going to combine these datasets to make a big one and will do some feature engineering to extract some features that are helpful for the analysis.

My strategy to achieve this is :

  • First I am going to do some preliminary analysis of each dataset.
  • Processing all three DataFrame to the form that is appropriate for analysis
  • Extracting some features.
  • Combining all three DataFrame into a big one.
  • Splitting the combined DataFrame based on each event.
  • Analysis of the Data with Visualization of offer viewed and offer completed categories.

After we get two datasets of offer viewed and offer completed. I am going to split it based on different demographic groups based on age, Income, Gender, year in which members became.

Metrics :

As the main task is to find which demographic group is more responsive towards which offer, my metrics that I am going to use here is.

this matric clearly mentions the proportion of the offer completed out of the viewed. I’ll apply this matric to all the demographic groups and compare their results.

Description of DataSet:

The description of each column in the dataset is as follows :

  1. portfolio.json
  • id (string) - offer id
  • offer_type (string) - a type of offer ie BOGO, discount, informational
  • difficulty (int) - the minimum required to spend to complete an offer
  • reward (int) - the reward that is given for completing an offer
  • duration (int) - time for the offer to be open, in days channels (list of strings)

2. profile.json

  • age (int) — age of the customer
  • became_member_on (int) — the date on which the customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer-id
  • income (float) — customer’s income

3. transcript.json

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer-id
  • time (int) — time in hours since the start of the test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

Data Exploration and Visualization :

I have done some preliminary analyses on each dataset and their findings are listed in the points below.

DataSet 1 - portfolio.json :

Portfolio DataSet (figure 1)
  1. DataFrame has 10 rows and 6 columns.

2. There is no null value in the DataFrame.

3. DataType of each column is

DataTypes of each Column (figure 2)

4. Multiple channels have been used to send the offer.

The proportion of each offer (figure 3)

5. There are three types of offer, Namely BOGO, information, Discounts.

6. Total Number of offers is 10 and out of them, 4 are discounts, 4 are BOGO(buy one get one free) and 2 are informational (advertisement).

7. The Rewards and Difficulty columns contain the value in the dollar.

8. Unit of Duration is in no of days.

9. column channel, offer_type, and id contains categorical values.

portfolio.json file has an id column that is needed to be cleaned, It contains the value in the form of a hash. It also has a channels column that has a value of different channels in a list, it is also needed to be cleaned.

DataSet 2 - profile.json :

Head of DataSet 2 (figure 4)
  1. The number of rows is 17000 and Number of columns are 5.

2. There are 5 columns in the DataFrame.

3. Columns that contain categorical values gender, id.

4. Columns that contain missing values are gender,income and age ( as missing age values are encoded as 118 ).

5. DataType and non-null count of each column are shown below.

DataTypes of each column. (figure 5)

6. Maximum value of age is 118 and Minimum value of age is 18.

7. Maximum income is 120,000 and minimum income is 30,000.

8. Proportion of gender Male - 57.2%, Female - 41.3% and Others - 1.4%.

(figure 5)

Id column contains hash values that are needed to be mapped to an integer so it would become easier to work with the data.

The data type of the become_member_on column is int64 which is needed to be changed to date datatype.

It looks some user’s age is 118 and Value in the gender column is None and their income is also NaN .Well according to DataSet the user who has not given the information about age is encoded as 118, However, I think, It is bad data as only user id hash has been provided and I am not going to consider it in further analysis. but, I am not removing them right now, since I have to map them to Transcript DataFrame. and it will become a lot easier to remove them after combining two data frames.

DataSet 3 - transcript.json :

  1. The shape of the Transcript Dataframe is 306534 rows and 4 columns.

2. There are four columns of the data frame.

3. The data type of each column is

DataTypes of each column (figure 6)

4. There are no null values in the dataset.

5. The columns that contain categorical values are [‘person’, ‘event’, ‘value’]

6. count of each event in transcript DataFrame is

Count of each event (figure 7)

Here, the person column is needed to be cleaned, as it has the hash value which is needed to be changed to integer, so to make the rest of the work easier.

value column contains a dictionary of offer_id, amount, reward based on the event that happened. A separate column is needed to be created for these values.

Data Preprocessing:

1. Portfolio DataFrame :

(figure 8)

Steps that I have followed.

  • Map each hash id to an integer.
  • Create each column for each channel by using one-hot encoding.
  • Drop Channels column.
  • Rename id column to offer_id and rewards to offer_reward.

After the above steps, DataFrame looks like the image below.

(figure 9)

2. Profile DataFrame :

(figure 10)

Steps that I have followed.

  • Map each hash id to an int.
  • Divide the age into the age group of interval 10 years.
  • Change the DataType of become_member_on column to DataTime.
  • Rename the id column to person_id.
  • Create an Income group of intervals of $30,000.
  • Create a column based on the year in which the user becomes a member
(figure 11)

3. Transcript DataFrame :

(figure 12)

Steps that I have followed.

  • Map each hash user id to an int.
  • Rename person column to user_id.
  • Create a separate column depending on the event from the “value” Column.
  • Remove value column.

Now DataFrame looks like this.

(figure 13)
  • Mapping each hash of offer id to its int id.

Eventually, the data frame looks like the image below.

(figure 14)

Merging the three data frames into the big one.

(figure 15)

Columns that this data frame contains are :

['user_id', 'event', 'time', 'offer_id', 'amount','reward','gender',
'age', 'became_member_on', 'income', 'age_group', 'income_group',
'become_member_in', 'offer_reward', 'difficulty', 'duration',
'offer_type', 'channel_email', 'channel_mobile', 'channel_social',
'channel_web']

Implementation :

For this Analysis, I opted to answer My findings through Data Visualization.

In this analysis as I am interested in the people who have completed the offer from those who have viewed it.

The algorithm that I am going to be using is dividing the data set finally merged DataSet into Four different DataFrame based on an event in the event column. Namely offer received, offer viewed, transection, offer completed.

then, using the offer viewed and offer completed data frame to get the count of offer viewed and offer completed by the users based on different demographic groups.

then using the matric that is (No of offer viewed/No of offer completed), I will analyze and plot the visualizations of the result.

Complications occurred during the coding process :

The complications I faced during my whole analysis was cleaning of offer_completed data frame, as this data frame had users that have not viewed the offer but completed it by making some transactions.

so, in theory, These people were not influenced by the offer, and It was required to remove them because they might affect our end results.

Above, the code that I used to remove the users that did not view the offer.

Refinement :

Initially, I chose to plot my whole analysis based on the ratio of offer completed to offer viewed based on the different demographic groups.

But, I think analysis based on different offers will give more information offer wise.

so, Instead of just calculating the ratio offer completed to offer viewed for each demographic group, now, I am going to calculate results for each demographic group with each type of offer (i.e. Informational, Bogo, Discount), as it will give deep information about the type of offer people interested in.

Analysis of the combined DataSet.

Based on age group :

The above graph shows How many people got influenced by the offer, It is a stacked barplot with the dark blue bar represents the count of people who viewed the offer and the light blue represents the count of those who completed the offer.

From the graph, It turns out that people within the age group 50–59 have completed most offers.

Ratio (offer completed/ offer viewed)

Here, On the left, the table shows which age groups are has what ratio of offer completed to offer viewed.

The number of views of age group 20 - 29 is very less (i.e. 44.66) but their offer viewed to offer completed ratio is high.

The ratio of offer completed to offer viewed is low for the age group 10 - 19.

The ratio of offer completed to offer viewed based on the age group of each type of offer :

Note : As no one has completed the informational offer its percentage is zero.

The below plot represents how many people have viewed which type of offer.

From the above plot, It turned out, people of all age groups are viewing BOGO(buy one get one free) offer the most, on the second number it’s Discount and Informational at the last.

The below plot represents how many people have completed which type of offer.

and Here in the above graph, people of all age groups are completing discount offer most often than BOGO and the most interesting thing that I got to notice is no one has completed the information offer, Although it was not an offer It was an advertisement, people still did not get influenced by the offer.

On comparing the plot of completed and viewed offers, one more thing I noticed that people of all age groups are viewing BOGO offer the most, however, the height of discount offers is greater than the BOGO offer in the completed offer plot, (i.e. in each age group, people are completing discount offers more than the BOGO)

Based on gender :

It shows, how many people got influenced by the offer based on gender, the dark blue bar represents the viewed offer, and the light blue represents the completed offer.

Count of offers viewed based on gender.

The table on the left shows the count of offers viewed based on gender, (Male - 28301, Female - 28301, Others - 773).

The table on the left shows the count of offers completed based on gender, (Male - 13814, Female - 12654, others - 453).

The ratio of offer completed to offer viewed based on gender :

Here, Females have a high conversion rate of 60.87, Males have a conversion ratio of 48.8, and others have 58.6.

Note : By conversion rate I mean people have seen the offer made some transection and completed the offer, so ratio no of offer completed to no of offer viewed is conversion ratio, as people got influenced by the offer.

The ratio of offer completed to offer viewed based on the gender of each type of offer :

Note : As no one has completed the informational offer its percentage is zero.

The above plots show the distribution of offer viewed and offer completed based on the type of offer.

Based on income :

Here, I have divided the income group based on the following intervals.

  • low income - $30000 to $60000
  • Average income - $60000 to $90000
  • High income - $90000 to $120000

People with low income have viewed the offer more than others, but still, their completion rate is low than that of the average income people, average income people viewed less offer, however, their conversion rate is high.

Count of offers viewed based on income group.

Here, on the left, count of offers viewed based on income group, (Average — 21239, High — 7643, Low— 20978).

Here, on the left, count of offers completed based on income group, (Average — 12306, High — 4980, Low — 9635).

The ratio of offer completed to offer viewed based on income group :

Here, people with average low income have a conversion rate of 45.9%, and that of Average income is 57.9% and that of High income, it’s 65.15%.

The ratio of offer completed to offer viewed based on income group of each type of offer :

Note : As no one has completed the informational offer its percentage is zero.

above barplot shows, Bogo offer is popular among each income group, the discount is on that second, and informational offers are on the third.

Based on membership year :

The above plot shows as time pass people are getting inactive, as both offer viewed rate and offer completion rate are decreasing for older members.

People who became a member in 2017 have completed most offers.

In the year 2018, the figure shows a short bar because the data may have been collected in the year 2018 and the year was going on and few people may have registered till that time.

Count of offers viewed based on the become member in.

The ratio of offer completed to offer viewed based on membership year:

The ratio of offer completed to offer viewed based on membership year of each offer type:

Note : As no one has completed the informational offer its percentage is zero.

out of all the offers, BOGO(buy one get one free) is the one that people have viewed most.

Justification :

As my whole point of the analysis was based on the influence that people have on the offer in their purchasing habits. I mean, after seeing the offer if people really want to complete the offer.

As the matric that I chose to answer the result was

The results that I got after the analysis represents the ratio of viewers that completed the offer.

This clearly represents the interest of the different demographic groups, as the higher the ratio more interested are they in completing the offer.

Conclusion :

Reflection :

In this article, I’ve just presented my observation based on the analysis of the Starbucks Data-Set.

In this project, I tried to analyze the problem, that is how people are responding to the offer of each type.

Here, I just started analysis with preliminary analysis of the three datasets.

then, I processed all the data frames to the form appropriate for analysis

then, I did Some feature extraction of the features important for analysis.

then, Combining all the data frames into the big one.

Analysis of the Data with Visualization of offer viewed and offer completed categories

Results that look interesting to me are :

  • The most viewed offer is Bogo but the most completed offer is a Discount.
  • No one has completed the Informational offer.
  • As time is passing, people are getting inactive towards offers.

and For me, the analysis part was interesting, as it gave me some of the results that I was not expecting.

Improvement :

In this analysis, I opt to answer my findings using analysis, one Improvement that can be done is by analyzing the transaction data frame, how many transactions people of the different demographic groups make to complete the offer, and What are their time interval, as this can give information about what happens if the duration of the offers is increased. will it result in more offer completion?

The analysis is available on my GitHub profile here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store