This project is my Capstone Challenge for Udacity’s Data Scientist Nanodegree. The project is in collaboration with Starbucks where we were given simulated data that mimics customer behavior on the Starbucks rewards app. The offer could be purely informational or it could include a discount such as BOGO (buy one get one free).
From the data we received, it appears that Starbucks sent 10 different offers to its customers via a variety of different channels.
For this project, we received 3 datasets —
- Portfolio — dataset describing the characteristics of each offer type, including its offer type, difficulty, and duration.
- Profile — dataset containing information regarding customer demographics including age, gender, income, and the date they created an account for Starbucks Rewards.
- Transcript — dataset containing all the instances when a customer made a purchase, viewed an offer, received an offer, and completed an offer. It's important to note that if a customer completed an offer but never actually viewed the offer, then this does not count as a successful offer as the offer could not have changed the outcome.
The purpose of this project to complete an exploratory analysis to determine which demographic group responds best to which offer type. I will also create and compare different predictive models to evaluate which features contribute to a successful offer.
The performance of each trained predictive model was measured using a test dataset as this was a binary classification outcome I used AUC, accuracy, f1 score, and confusion matrix’s as the performance metrics.
Profile Distributions — Age and Income by Gender
The two images above show the KDE plots for income and age for both men and women. Overall the median age of the customer in the datasets given was 60, with a similar distribution for both genders.
The mean salary overall was $65,404, however, you can see the distributions of income are different for men and women, with women in the dataset having a slightly higher average salary.
Merging the Data
In order to explore the analysis further, we need to merge the datasets together and determine whether an offer was successful.
Above is an image showing the data in the Transcript dataset for a single customer, after it has been manipulated. It’s possible for a customer to complete an offer without actually viewing the offer meaning that the offer was not successful.
In order to solve this problem, I decided to transform the dataset to show every combination of customer_id and offer_id only once but count the number of times each event was present in the dataset. Then if the count of ‘offer viewed’ and ‘offer completed’ both are equal to or greater than 1 then I have determined this as a successful offer. However, this method does not differentiate between offers that were successful multiple times and offers which were only successful once.
Profile and Offer Distributions by Offer Outcome
In the dataset, there were 8464 Men, 6149 Women, and 212 users with gender ‘Other’. As we can see there was a roughly even number of each offer type being sent out. Due to the gender class imbalance in the dataset, we can see that the count of men receiving offers is higher than the count for women. This appears to be in proportion to the gender imbalance present in the dataset.
Offers 2 and 3 appeared to have performed the worst where they achieved no successful offers. Looking at the Profile dataset these are both information offers. From brief inspection, it appears that offers 4,5 and 7 have performed fairly well.
Looking at those offers that were successful- do we see a similar gender class imbalance?
The graphic above shows the successful offer types by gender. We know that the gender distribution for each offer type was quite constant with men receiving the most offers. We can see that for offer types 9 and 10 there were more successful offers for women than men even with fewer women receiving offers.
Was there a difference in income between successful and unsuccessful offers?
There appears to be quite a noticeable difference in the distribution of incomes between successful and unsuccessful offers.
The median income is lower for unsuccessful offers, however, this distribution does have a right skew. Whilst the distribution of income for successful offers has a higher median and follows more of a Gaussian distribution.
Can we see a difference in the characteristics of offers?
There does appear to be some differences between successful and unsuccessful offers. Offers with a difficulty of 0 are informational offers, which we saw previously did not produce a successful offer, the graphic shows that offers with a difficulty of 20 are more likely to be unsuccessful.
Looking at duration it appears that offers with a duration of 4 days or fewer did not produce a single successful offer. Where the probability for a successful offer has a peak at 7 days.
We will now use 4 different models to try and predict whether an offer will be successful using offer and customer characteristics. Each of these models was trained on a training dataset and evaluated on a testing dataset to avoid overfitting and to see how the models would perform on unseen data.
Logistic Regression models are great for binary classification problems. They have a few underlying assumptions — they assume the class outcomes are interdependent of each other and require no or little colinearity between independent variables — essentially the interdependent variables or features should not be highly correlated.
Here we can see that the feature ‘email’ is highly correlated with every other interdependent variable so this feature was not included in the predictive modeling using Logistic Regression.
Looking at our data it’s clear that we have a class imbalance. This is an issue because our model could just predict that the offer will not be successful (denoted as class 0 on the graphic) and could still achieve a reasonable score for accuracy.
We can solve this by a random oversampling. Where we will randomly sample data points from successful offers with replacement until we have the same number of data points for unsuccessful offers.
Before we can train a Logistic Regression model we need to handle missing values. We had missing values present for both income and age. These missing values were imputed with the medians of their respective columns because the distributions had a slight right skew and imputing the mean would affect their distributions slightly.
I have decided to use MinMaxScaler from Sklearn to normalize the features hopefully improving the accuracy of the classifier.
Initially, the model was trained using the liblinear solver and it achieved an accuracy of 76% and an f1 score of 0.77.
As you can see the model has performed reasonably well — it shows good sensitivity (the ability to classify a positive outcome — true positive rate) however the specificity (the ability to correctly classify a negative outcome- true negative rate) is letting the model down.
To try and improve the classifier I decided to use GridSearch to hyper tune the parameters, where I changed the penalty and inverse regularization strength C.
The tuned Logistic Regression model achieved an accuracy of 75.8% and an f1 score of 0.77 on the testing dataset. This means that tuning the model has not given us an increase in performance.
Support Vector Machines
Support Vector Machines (SVM) use support vectors that maximize the margin for the decision boundary between classes. Sometimes it’s not easy to separate the classes via a linear plane so SVM will transform the space in which the data exists to make it easier to separate the classes.
It’s important to ensure that the features are appropriately scaled for SVM models — I decided to use the StandardScaler method.
I initially trained the model using a linear kernel function. This model achieved an accuracy of 73.5% and an f1 score of 0.77. This has not performed as well as the Logistic Regression model.
As we can see here the model has a better sensitivity than the Logistic Regression model, however, has a lower specificity. The number of false positives (where the model predicted an offer would be successful when it was actually unsuccessful) was 3320 compared to 2363 for the Logistic Regression model.
I initially performed a GridSearch to tune the parameters but it took too long to complete, so to try and improve the performance of the model I manually change a few parameters starting from changing the kernel function from linear to RBF (Radial Basis Function).
This model achieved an accuracy of 78% and an f1 score of 0.79. This is an improvement from both the previous SVM model and the Logistic Regression model.
The RBF kernel has improved the specificity of the model and there are fewer false positives compared to using the linear kernel, however, the number of false negatives (where the model has predicted the offer would be unsuccessful when in fact it was successful) has increased meaning the sensitivity of the model has decreased.
To try and further improve the performance of the SVM model I will change the regularization parameter C from 1 to 100 whilst still using the RBF kernel.
This SVM model achieved an accuracy of 79% and an f1 score of 0.8, which is an improvement from the previous SVM model for both accuracy and f1 score.
Looking at the confusion matrix for this model the number of false positives has been reduced improving the specificity of the model, whilst the number of false negatives has ever so slightly increased. But overall it's clear this has led to an improvement in performance.
Linear Discriminant Analysis (LDA)
LDA is a dimensionality reduction technique that estimates the probability that a given set of inputs belongs to a given class. The output class is the class with the highest probability.
LDA is more sensitive to outliers than the previous models. I have used the turkey method to remove outliers from the features corresponding to income and the number of days a user has been a member.
The model was trained using the solver ‘lsqr’ and achieved an accuracy of 76% and an f1 score of 0.78. This model achieved the same accuracy as the Logistic Regression model but achieved a slightly better f1 score.
Below we can see the coefficients the LDA model has given to each feature we used to predict whether an offer would be successful.
Here we can see that the classifier has larger coefficients for users with gender ‘Other’ — this is very surprising considering the large imbalance for gender in the dataset. Discount offers had the largest coefficient compared to other offer types including BOGO potentially indicating they were more successful. It’s not surprising that informational offers received a negative coefficient.
It appears that offers 1,4 and 7 all had positive coefficients showing they have a positive effect towards an offer being successful. This model gave ‘social’ the largest coefficient, this is referring to the channels in which the offer is advertised.
This is the final model we will use to predict whether an offer would be successful.
AdaBoost is a decision tree algorithm that doesn't require scaled data. However, it is sensitive to outliers so again I will remove outliers using the Tukey method.
I performed GridSearch to find the best parameters which turned out to be a learning rate of 0.2 and the number of estimators as 1500.
This achieved an accuracy of 77% and an f1 score of 0.78.
Looking at the AUC for each of these models, which plots the specificity against the sensitivity, it again shows that SVM (SVC) performed the best.
On the graphic, the dotted line would present an AUC score of 0.5. This would indicate that the model performed very badly when distinguishing between classes, so the further the AUC line is to the upper left the better the model has performed.
As we are predicting whether an offer would be successful we care equally about how the model performs in predicting both classes. This means that I am going to put a preference on the model's scores for accuracy. SVM using an RBF kernel and a regularization constant of 100 scored the highest for accuracy and f1 score achieving 79% and 0.8 respectively. If the dataset was larger it could have yielded more accurate results.
As offers, 4 and 7, had positive coefficients in the LDA model and produced more successful offers than unsuccessful offers I would say these offers were quite effective in comparison with the others.
Duration also received a positive coefficient from the LDA model this makes intuitive sense considering offers with a duration of 5 days or less resulted in no successful offers.
The feature ‘become_member_on’, which corresponds to the year that the customer signed up for Starbucks Rewards, has a negative coefficient. This implies that offers had less of an effect on newer members.
Improvements and Reflection
My method in manipulating the transcript data meant that offers that were successful multiple times could not be differentiated from offers that were successful once. With a different method where this differentiation is taken into account, this could result in more accurate predictive models.
My analysis did not include information regarding the transaction amounts. To improve it could be useful to use predictive models to predict the transaction amount for a user in response to different offers. This information would be very useful to use in evaluating the different offers as some offers could have poor conversion rates but they could produce large transactional amounts.
How I determined whether an offer was successful resulted in informational offers being unsuccessful. By default information offers would have no data points with the event ‘offer completed’ meaning that it’s impossible for an informational offer to be successful. An informational offer can certainly be successful but it’s hard to construct a success metric that can accurately evaluate whether the offer resulted in transactions. As informational offers will have a negligible cost it would be interesting to see how the offer affects the rate of transactions. As I can imagine sending informational offers too frequently could backfire and result in fewer transactions.
For the predictive models, we were limited by the number of features we had for the customer. If more features were present we could find more optimal demographics and could aid in better classification results.
- Performed an exploratory data analysis on the datasets looking at how the different demographics of Starbucks Rewards users responded to the different offer types.
- Preprocessed the data to ensure it was appropriate for the predictive algorithms.
- Used various models to predict whether an offer would be successful based on a variety of features about the offer and the user.
You can find the Jupyter notebook that contains the analysis here.