Introduction
The project is a capstone challenge from the Udacity Data Science Nano degree program with simulated data from Starbucks rewards mobile app. Starbucks sends out offers to users once every day through its mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Problem Statement
The project aims to build and deploy a Machine Learning Model that predicts whether a customer will complete an offer and find features that are important for that to happen. There are 3 datasets given with each related to the other in some way. The aim is to find how they can be combined meaningfully to extract the information needed for the models.
The Data Sets
portfolio.json — This data set contains the unique offer types with their respective IDs. It contains the channels through which offers are sent, the rewards for each offer, duration and how much is spent on the offer by Starbucks.
profile.json — The profile data set gives us the demographic information of the customers. It contains the customer identification, age, income, gender and when the membership started.
transcript.json — This also contains the customer id, events, value and time. The events contains transactions, offer received, offer viewed and offer completed. The value contains either an offer id or transaction amount depending on the record. Overall its very messy and needs to be sorted for ease of use.
Metric Used
The Root Mean Squared Error is the metric used to evaluate the models. Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.
The RMSE statistic provides information about the short-term performance of a model by allowing a term-by-term comparison of the actual difference between the estimated and the measured value. The smaller the value, the better the model’s performance.
Data Assessment and Cleaning
Each of the 3 datasets was assessed for quality and tidiness issues. After the assessment and cleaning the datasets can be analyzed exclusively and then together.
Portfolio cleaning — The issues observed and cleaned in the data set include :
- Changing position of id
- Changing id to offer_id
- Encode Channels and Offer type
Profile cleaning — The issues observed and cleaned in this data set are:
- Changing id to customer_id
- Changing position of id to first column
- Changing became_member_on to date type from integer
- Dummy the gender column
- Removing 2175 missing values in the gender and income column
Transcript cleaning — The issues observed and cleaned were:
- Renamed person to customer_id
- Changed position of customer_id column to first
- Separate value column into offer_id and amount and reward
- Split event column into 4 and dummy
- Convert time in hours to days and rename column
Analysis
The ages of the customers are distributed between a minimum of 18 and 101. The age 118 was found to be a code for missing values as reported by Starbucks. The data set is made up of 57% males, 41% females and 0.01% others. The income of the customers is between $30,000 and $120, 000.
When age is plotted against income in a line plot it can be seen that the income increase with age as expected. The number of members has also been increasing year on year. However after 2018 it starts to drop off.
The female members earn more on average than the male and other gender despite having fewer numbers than the male members.
From the transcript data, as can be seen in the bar plot below we see that the transactions are highest as it goes down gradually from offer received to offer viewed and eventually to offer completion.
At this point we do not know the kind of offers that are sent to each customer, which customer viewed which offer and the offers that are eventually completed. To get that we will have to merge all 3 data sets.
Merging Datasets
First the clean transcript and profile datasets are merged on customer id column. Then clean portfolio is merged to them. This creates a data set with features from all the 3 datasets above.
The most completed offers are disc_3 while the least is bogo_2. As seen on the completed offers Bar Chart.
The plot of completed offers by Gender shows more males have completed offers than females. The other category shows no significant differences between the offers. The difference between the male and female completed offers might be due to the males having a higher number than the females.The offers disc_2 and disc_3 have the most conversions.
The Income Histogram of completed offers is normally distributed. Those earning between 50000 to 80000 have the highest numbers of completed offers.
MODELING
Preprocessing — Before putting the data set features into a model, further preprocessing was needed. The offer_id, gender and year columns were dummied to allow them to be used in the model. The data is aggregated so that every row has a customer with all the features related to that customer.
After aggregation some columns were found to have null so values so they were filled with the mean as they were few. For time, income, days_as_member, duration, age, and amount which had a wide range of values they were normalized using the minmax_scale.
Model
The following steps were taken in the Model:
- The data was then split into training and test set using the test train split function with a test size of 0.3 and random state of 42.
- The algorithm used for the first model is Sklearns Random Forest Regressor.
- A function was created to evaluate the model using the RMSE. The Baseline Random Forest model performance:- Train RMSE error: 0.06511640713205401, Test RMSE error: 0.1382053523726937
4. To further tune the model and improve performance, used Pearson correlation coefficient to find those variables that were best correlated with the target variable.
5. The correlated features were then used as variables in the Random Forrest Regressor and other models. Correlated Random Forest model performance:- Train RMSE error: 0.0684497031253978, Test RMSE error: 0.15443823705529727
6. The model with the best results RMSE and cross validation was the XGB Regressor. To find better results it was tuned using GridSearchCV fitted in a pipeline. This allows the model to find the best performance parameters. 5 parameters were used for the tuning:
- ‘max_depth’ = [3, 5, 6] → 6 is optimum
- ‘booster’ = [‘gbtree’, ‘gblinear’] → gbtree is optimum
- ‘num_parallel_tree’ = [1, 2] → 1 is optimum
- ‘model__learning_rate’: [0.03, 0.06, 0.1] → 0.1 is optimum
- ‘model__n_estimators’: [50, 100, 150, 200]→ 200 is optimum
7. Tuned XGBoost model performance:- Train RMSE error: 0.05766602233052254, Test RMSE error: 0.10836191475391388
8. The cross validation test score was 0.11725761365145668
9. Finally the feature importance were plotted in bar chart seen below
Conclusion
1.FEATURE ANALYSIS
From the visualization above the 3 features are the most important respectively:
1. reward_x
2. channel_email
3. discount
Now lets take a look at each of the above features on its own merit:
1. The reward_x appears to be the most significant feature when it comes to offer completion. Its possible that the higher the reward the higher the chance of the offer being completed. The rewards are creating value by motivating Starbucks customers to try a product by completing the offer. Starbucks can use the rewards to create a programs that rewards customer loyalty. Delivering increased value in form of rewards to profitable customers turns them into loyal customers; and that loyal customers become even more profitable over time.
2. Channel email doesn’t appear to be as influential a feature as reward_x but it does show the delivery system most effective in offer completions. Starbucks can use this by sending more through emails and further fine tuning the process to find out why the channel is effective.
3. The 3rd feature in order of importance is discount which is an offer type. This is not surprising as customers are known to respond positively to discounts. Giving customers a discount might just be the thing to draw them in and become a recurring customer. And new customers mean new opportunities for cross-sells and up-sells, meaning more revenue in the long run as well. Therefore Starbucks might look at how to give more discounts to its customers instead of other offers.
2.RESULTS EVALUATION
All the models from above were all analyzed using the Root Mean Square Error(RMSE) and cross validation.
RMSE is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.
The RMSE from our models above are:
- Baseline Model: Train RMSE error: 0.06461127565148764, Test RMSE error: 0.13643719402592924.
- xbg Regressor model performance:- Train RMSE error: 0.05210123211145401, Test RMSE error: 0.12047526985406876.
- Random Forest Model: 0.06368740514313077, Test RMSE error: 0.15087425084601191.
- Decision Tree model performance:- Train RMSE error: 0.0, Test RMSE error: 0.21780048291891418.
- GridSearchSv Tuned Model: Train RMSE error: 0.05766602233052254, Test RMSE error: 0.10836191475391388.
Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. The best cross validation score was from xgbRegressor at 0.142.
As can be seen from above the GridSearchSv model performed the best as it had the least errors both in the Train and Test sets and the cross validation. In the Random Forest Regression Model the training data performed better slightly than the Baseline model but the test is worse off. This might be explained by overfitting in the Baseline model.
The xbg Regressor model also performed best among all the other models with a cross validation error of 0.142. Even though the Train RMSE was a bit lower than the tuned RMSE, the tuned test RMSE was better. The cross validation also became better at 0.11725761365145668.
3. IMPROVEMENTS AND CONCLUSION
- The first improvement that can be made is on preprocessing of data. It was quite challenging finding a way to work out how to put all the data together meaningfully. I chose to aggregate the data to make customer offer events and transactions together. There can be other ways to bring the data together by keeping the individuality of each transaction intact.
- Someone with deeper coffee retail industry knowledge might be able to select better target and feature variables or even come up with new ones that will make it easier to predict which customers will complete offers.
- We can also go further by finding out what the optimal rewards and discounts are to get more customers completing the offers.
- A proper experiment can be set up, for example, A/B testing to determine the effects of sending offers through emails or other platforms. Different reward systems and discounts can also be tested for effectiveness.
- It will definitely be useful to dig deeper and find features that affect specific demographics to find what makes completions successful. Income demographics and age are definitely interesting routes that can be explored.
In conclusion Starbucks can send more of its discount offer through the email channel and also have a good reward system. By creating a reward program that keeps customers loyal over a long period of time while also encouraging them to spend more on Starbucks products.
If you are interested in the code for this work, it can be found in Github.