DATA SCIENCE STARBUCKS CAPSTONE PROJECT — UDACITY

Yash Singh
4 min readAug 26, 2020

Introduction :

I want to thank udacity for providing me this opportunity for a challenging project on starbucks data.

The Aim of my project is to check what people respond to what type of offers and how they respond to it.There are 3 types of offers provided by the starbuck BuyOneGetOne , Discount and informational . How the customer responds to the offer will depend on customers demographics as well as which include their income , age , gender , datof becoming a member.

The response of customer to an offer will also depend on various factors of the offer itself like reward , difficulty , time and duration.

The data set contains 3 files :

  1. Portfolio
  2. Transcripts
  3. Profile

DATA CLEANING and Modification

The cleaning of data is a pretty tricky part. To gather useful data and filter out uneccesary information not meeting our target we defined various function that does the task of cleaning and storing data properly. Nan values and any unusual values are removed from the column. The next task is to asses what factors to be taken into consideration while modelling of the data. While training the data to various factors i found that while taking reward and difficulty dataframe into consideration the data starts to overfit Giving a accuracy of 99 % on all Algorithms, so i had to drop the columns.

DATA EXPLORATION :

Data exploration done on profile and comparing them.

(Males vs females vs Others)(count)
Date of becoming a member (counts) .Year 2017 had the highest new member count
Age Distribution and Income Distribution of Customers. The median income is in the range of 70–75 k approx where as Median age of customers is 50–60

Corr Plot bewtween various factors after cleaning of data to find what factors most strongly relate with completed Bogo and Discount offers.

from the above heat-map we can see that duration is highly correlated with bogo and discount offers that is more the duration
1.more likely will the customer respond to the offer
2. another thing we notice is that customers with higher income will respond to bogo offer than discount offer
3.The correlation between BOGO offer and Reward is pretty high
4.The customers will prefer to choose Bogo offer if there is a reward associated with it.

1.Cleaning portfolio data: Offer_type and Channel columns are filtered using one hot encoding and stored in a separate column, by creating new column for every unique channel type and offer type

SLicing columns using one hot encoding

2. Cleaned Profile Data:

Extracted New column days with respect to current time and the membership_start_date by calculating the difference between the two.Dropped null values associated with income assigned gender(M ,F ,O) to gender values 1 , 0 ,-1 respectively. Checked for any unusual age and dropped it.Generated membership_year based on the membership_start_date.

3. Cleaning Transcript data:

Creating a new column for each type of event and storing it in 3 separate columns using one hot encoding

  1. Offer Viewed
  2. .Offer Completed
  3. Offer Received

Combining the Cleaned datasets :

new data frame is created which stores cleaned Transcript , profile and portfolio data in a single dataframe (new_df).

new_df formed by combining profile, transcripts and portfolio data

Next task is to check how many bogo and discount offers are completed by the users and store them in a separate column which will be our target variables for the model.

For this task we will compare bogo offers with offer complted and discount offers with offer completed and the resulting value will be 1s or 0s depending on the completion of the respective offer. New_bogo and new_discount columns are created which will be our target variables

DATA MODELLING :

  1. The data is split into train and test data where new_bogo and new_discount are the target variables and age, income, gender, days , types of channel(web,mobile ,social), duration and time are the deciding parameters for our model.

MODEL EVALUATION :

I have used 4 models with default parameters for the training of our model:

  1. RandomForestRegressor
  2. Logistic Regression
  3. K Neighbour Classifier
  4. DecisionTree Classifier

All the above models are checked on parameter of accuracy as it is a classification problem and it is best to check for correct predicted values out of all data points. The following is the accuracy chart of all the models of train and test data of bogo

Decision Tree classifier has the highest accuracy on training and test data with 99% and 96 % acccuracy.

--

--