Predicting Influence in Starbucks Offer/Customer Datasets

A novice’s journey in the Udacity Data Science Nanodegree Capstone Project

erik james mason
The Startup
15 min readJan 20, 2021

--

Photo by Laura Chouette on Unsplash (reframed by author)

The Starbucks Capstone Project:

This article details the capstone project from Udacity Data Science Nanodegree program which is compromised of simulated data, containing offer portfolio details, customer profile details, and transcripts of interactions from the Starbucks rewards mobile app.

The task is to combine the available datasets and determine which demographic groups respond best to which offer types.

Please see notebook in the opening section (1 Udacity’s Introduction) for more details from Udacity concerning the project.

Outline for this article:

  1. Data
  2. EDA (exploratory data analysis)
  3. Preprocessing/Engineering
  4. Modeling
  5. Issues & Conclusions

Potential Business Questions:

  1. What offers and offer types tend to perform well and why?
  2. Which, if any, demographic groups can be identified in relation to responsiveness to offers?
  3. Inversely, which offers/demographics do not perform/respond well?
  4. If an appropriate model can be created, which metric should be addressed for optimization?

Project Aim/Solution outline:

  1. Work through the data and start to construct a product (DataFrame) that can be used for modeling
    - We have a few datasets with values that need to be altered/adjusted
    - Then we will have to join these datasets together to get correlated insight
  2. Develop the product to specify or create features that will enable successful modeling
    -Many of the features are informative, but perhaps don’t have predictive information innately. We will have to enhance the product to establish better predictive information among the features
  3. Create a model
    - This is a supervised machine learning task for classification
    - There will likely be an imbalance in the target variable, so we will need to address the imbalance either in the training data or in the classifier.
    - We have the option to use many variations including ensemble classifiers, and gradient boosting classifiers
  4. Fit and evaluate the model/Tune the model for best performance

Data

Photo by Christopher Burns on Unsplash

The data for this project was comprised of 3 different .json files: portfolio.json, profile.json, and transcript.json

Portfolio.json

The first file is portfolio.json which contains offer id’s and metadata (reward, channels, difficulty, offer_type)

Profile.json

The second file is profile.json which contains demographic data by id(gender, age, became_member_on, income)

Transcript.json

The last file is transcript.json which contains records by id for event, value, and time (recording time of this study)

The Processed Data:

Cleaned and Merged Dataset (Panoramic)

If you have a wider computer monitor, you may be able to see the processed dataframe in the image above, otherwise -

  1. The portfolio dataset splits offer_type and channels into dummy variables which are concatenated to the original portfolio dataframe. Columns are renamed for ease of visual recognition.
  2. profile dataframe “became_member_on” is converted to datetime format. Columns arerenamed for easy of visual recognition and merging.
  3. transcript dataframe is processed to parse “value” column for “amount” and “offer id/offer_id”, which are parsed out to their own columns.
  4. transcript and profile dataframes are merged to include all values.
  5. Merged dataframe from previous step is merged with portfolio dataframe to include all values.
  6. columns : ‘customer_id’, ‘event’, ‘time’, ‘amount’, ‘offer_id’,
    ‘event_instance_days’, ‘gender’, ‘age’, ‘became_member_on’, ‘income’,
    ‘offer_reward’, ‘channels’, ‘offer_difficulty’, ‘offer_duration’,
    ‘offer_type’, ‘offer_type__bogo’, ‘offer_type__discount’,
    ‘offer_type__informational’, ‘channels_email’, ‘channels_mobile’,
    ‘channels_social’, ‘channels_web’

From this processed dataset, we can effectively explore the data.

EDA (exploratory data analysis)

Photo by L B on Unsplash

Firstly, a look at the offers and offer types:

As suspected, the combined and processed dataset reveals that out of all the events (offers received, viewed, completed or transaction), we can see that though many offers are received, fewer are viewed, and even fewer are completed. Additionally, there is a fairly balanced amount of transactions compared to offers.

Currently, we do not have any information as to which of these transactions are related to an offer - but we can tell that, more than likely, a number of the transactions were directly influenced by an offer.

Total Records of Events by Offer Types and Offer ID

Out of the three Offer Types (Discount, BOGO, and Informational), we can make several observations:

  1. Currently, there is no way to definitively assert that any transaction is associated with an informational. As can be seen in image above, there is no offer completed values associated with it.
  2. Discount Type offers and BOGO type offers have a similar “offer received” records count, but Discount Offers have less “offer viewed” counts and yet more offer completed.
  3. Some interesting observations could be made about individual offers at the amount of “offer viewed”/”offer completed”. For instance, offer “2906b8” did not have the most amount of “offer completed” for Discount offer types, but it nearly had the exact same number of “offer completed” to “offer viewed”. This sort of observation reveals that this particular offer could be more higher performing than others.
  4. Interestingly enough, offer “0b1e1539” has a higher “offer completed” count than “offer viewed”. This reinforces that an offer can be completed even if it has not been viewed”. It may be too soon to say for this particular offer, but the offer itself may be attractive enough to perform well without generating response.
Visual support for Observation 2 in previous section

For other second main business question, let’s look at the demographic groups:

amount of instances by event and gender

An important thing to note during this exploration is that the count of “Male” indicated gender values is much higher than the others. This is pertinent because looking at overall values would make it seem as it the “Male” gender group is much more responsive, when this may not be the case. In fact, “offer completed” only slightly outnumbers the “Female” gender group even though the “offer received” count is much higher for “Male”.

amount of instances by event, faceted by gender

This visualization helps make the “offer completed” amount comparison more apparent.

Total Records of Events by Offer Type: Gender Side by Side Comparison
Total Records of Events by Gender: Side by Side Comparison

We can see that there some odd observations, such as “Female” discount for “offer viewed” and “offer completed”. The amount is higher for “offer completed” than “offer viewed” — there may not be a consistent explanation for this, other than the offer is still participated in but perhaps did not need the promotion to be utilized.

From this visualization, we can see that “0b1e1” and “2906b” from discount types and “9b98b” from BOGO types tend to have more “offer completed” regardless of “Gender” group. The only common factor (seemingly) is that the duration is higher for all 3 (7–10 days).

Although a bit disorienting, there is some value to glean from comparing the demographics-oriented columns against themselves:

Distribution of Age and Income by Gender
Total Counts by categorized demographic bins

Though some observations have already been addressed, such as the imbalance of overall counts of “Male” versus “Female, we can specifically look at which demographic groups of Gender, Age, and Income (split into balanced quartiles) received which type of offers and also which of the groups tended to interact with the offers.

In the last graph, we can see that Males in the Youngest age bin and Lowest Income bin received the most amount of offers (4664 offers), but completed only 1495 offers — a 32% possible interaction with offers (offer completed divided by offer received). For comparison, Females in the Highest income bin of the Oldest age bin had a 61% possible interaction.

Preprocessing/Engineering

Photo by Igor Kasalovic on Unsplash

Up until now, much of the data has been termed as “possible interaction” or “related/influence”. But there is another dimension to the data that has not yet been addressed — temporality.

Every “event” is marked by a value in the “time” column, which is defined as the amount of time that has passed since the study was initiated.

We are unable to identify every “offer completed” event as an effective influence of an offer unless it occurred in a logic series of occurrence (Offer received — offer viewed — offer completed). “Offer completed” values can occur before being viewed, indicating that the offer was not effective for influencing the customer’s interaction.

To help guide us and create a boundary, we also have the “offer_duration” column, which helps indicate if the logic series of occurrence also occurred within the boundary of the timeframe of the offer.

so in psuedo-code:

This line of logic may help us ascertain which offers were possible valid influencers versus ones that were completed but not influenced.

This gives us something like:

Being able to easily identify which offers were probably influencers will greatly illuminate performance of the offers.

Now there are some issues created in pursuit of this objective:

“Offer Type” values of “informational” do not have register an “offer completed” event with a “time” value. Liberties were taken in assuming that if an individual received, then viewed an informational offer, which resulted in a transaction within the same timeframe — this is likely due to influence from the informational offer.

From the graph above, you can see that this logic allows a considerable amount of informational offers to be registered as a valid influencer. Is this reasonable? Perhaps, but there is no guarantee.

A possible alternative would be to differentiate the consequent steps by “offer type” to remove the possibility of registering ambiguous events as possible influencers. However, the original intention was to identify which of all offers performed well with which demographic groups (in terms of responsiveness).

Now we can see which demographics are most responsive to offers, as a whole. For instance, the income_bracket of “high” seems to generate a lot of responsiveness whereas the income_bracket of “low” gets much less, although the count of “offer_valid” is nearly the same between the “Male” gender.

It is possible to add specific offer_type values to this graph, but dimensionally, they only can represent in the hover text — hence being excluded from this article at this time.

Modeling

Photo by Drew Graham on Unsplash

Though Preprocessing/Engineering was the title of the previous section, there is still some preprocessing steps available to help curate the appropriate data to train the model:

  1. Scaling/Normalization
  2. Feature Selection
  3. PCA
  4. Oversampling/Combination (Over/Undersampling)

Scaling/Normalization:

Some of the dataset column values are continuous (age, income) and could be treated as such, but would potentially need to be scaled. At the very least, this would help avoid asserting greater importance to the higher values. This step would not be necessary if a binary classification approach (by use of encoding or dummy variables) was determined to be most fitting.

Feature Selection:

It is unlikely that all features contribute equally as far as correlation or predictive information. For this step, I thoroughly enjoyed and employed the xverse python package for its Weight of Evidence method.

PCA:

In effort to reduce dimensionality where it may not be necessary (which could potentially improve model performance and processing efficiency), PCA (Principle Component Analysis) could help increase the interpretability of the dataset.

Oversampling/Combination (Over/Undersampling):

As a whole, the dataset is slightly imbalanced in favor of the “offer_valid” value of “0”. While some models perform fine with imbalance or specifically address imbalance (such as imbalanced-learn or imblearn python package intends to accomplish), synthetically oversampling the minority class can help improve model performance in regard to a desired metric. I compared models that address imbalance and SMOTE (Synthetic Minority Oversampling TEchnique) and found the results to be comparable.

My input data design:

Ultimately, through vigorous testing of all possible combinations of Feature Selection, PCA, and Oversampling/Combination-Sampling; I found that the most appropriate design was as follows:

  1. Feature Selection:
    - Many of the features themselves are imbalanced in their distribution. this caused the model to assume that Customers who became a member in 2016 to be more important (there were a significant amount of customers who joined in 2016 more than every other year)
    - When something like member_start_date is encoded for start years, it gives many “red herring” features to the model. Since the model will think that 2016 is very important, it will attempt to draw on the feature which can and did give poor performance since it’s not very related to our problem of valid influence of offers to demographics
  2. PCA:
    -Initially, I thought PCA would be a good way to address the lack of predictive information in the features when constructed into a very wide-dimension product. As it turns out, I could not easily get the PCA method to understand which group of features to properly decompose and the model’s performance dropped drastically
  3. Oversampling/Combination (Over/Undersampling):
    - Using SMOTE() proved very effective in balancing the model’s performance where as using combinations like SMOTETomek did not. I assume that the dataset itself is not quite big enough or imbalanced enough to use Combination Sampling
    -While synthetic sampling proved effective, it is also similarly addressed and effective with adjusting class_weight or using BalancedRandomForestClassifer from imblearn

My model’s design:

I created a few functions to simply loop through numerous “default” estimators to baseline the performance of the input data.

Example Output for Model Performance Iterator Function

Using this as a baseline, I created another function to test specific models with specific parameter values.

Example of Individual Model Testing Function

Through research, reading, and similar testing means (plots/tuning improvements), I ended up testing:

  1. RandomForestClassifier
  2. AdaBoostClassifier
  3. GradientBoostingClassifier
  4. ExtraTreesClassifier
  5. KNeighborsClassifier
  6. RidgeClassifierCV
  7. LogisticRegression
  8. BalancedBaggingClassifier
  9. BalancedRandomForestClassifier
  10. XGBClassifier
  11. LGBMClassifier
  12. CatBoostClassifier

Some of the list (AdaBoost, Ridge, LogisticRegression) were unable to perform as well as the others with default settings and were excluded from further testing.

Some showed slightly inferior performance in comparison with something similar (BalancedBaggingClassifier vs. BalancedRandomForestClassifier, BalancedRandomForestClassifier performed better overall in several tests).

Some like KNeighbors did not perform well, except for when combined in a pipeline, vote, or stack ensemble.

Developing Model’s design:

Once I had a several model’s that seemed to improve with some tuning, I used RandomizedSearchCV to tune more appropriately.

Example of RandomizedSearchCV

I did originally attempt to use GridSearchCV but even after hours (close to days) of tuning, it often yielded that the best parameters were the default or ones I already suspected through previous testing. RandomizedSearchCV is more efficient and yields comparable results.

I found that I preferred different models for different aspects of model performance, which lead me to use StackingClassifier. I had attempted other python packages and also VotingClassifier, but the simplicity and efficacy of StackingClassifier made it the obvious choice after experimentation.

Metrics

Understandably, accuracy is a highly desirable metric, which is the number of correctly predicted values of all values. However, between the imbalance in some of the features (“gender”, “age’, “income”) and some of the nuances introduced in attempting to identify valid influencing offers, I considered that the better metric to address our business case could either be:

A. Precision

Precision would give us a good look at how precise or accurate our predicted positives are.

B. Recall

Recall would show us all relevant cases in our dataset.

For instance, this is the plotted confusion matrix of a GradientBoostingClassifier that has an accuracy of 70%.

Though getting a higher accuracy is desirable, what is the business question and case that we ought to be striving to address?

Is it more costly to send offers to unresponsive customers? Does receiving unwanted offers affect their loyalty?

Being that an individuals responsiveness may be a bit unpredictable, I suggest that it is most appropriate to find the most relevant points — optimizing the model for recall should prove that we are not missing opportunities to send offers to individuals who are more likely to be responsive.

Through various methods to test models, my final model was a StackingClassifier, which is comprised of ExtraTreesClassifier, GradientBoostingClassifier, XGBClassifier, KneighborsClassifier, RidgeClassifierCV, and finally BalancedRandomForestClassifier.

Each individual classifier was tuned using RandomizedSearchGrid hyperparameter tuning, if needed.

My model’s general parameter tuning:

Untuned models seemed to perform similarly or worse

The class weights were a bit of “balancing” act with using oversampled methods in combination. But with even a small adjust (+/- 0.01), the recall/precision could change drastically.

I found that most estimators did not need explicit training_loss, gamma, or depth parameterization — in fact, many of them suffered because of too much tuning.

Originally, I expected the accuracy was due to max_depth and similar values being too low — but I found that many of the estimators performance greatly dropped with values greater or less than their default.

I did find very similar performance in removing BalancedRandomForestClassifier, using SMOTE(), and also adjust class_weights, but the addition improved performance enough that I decided to ultimately include it.

Model Evaluation/Validation:

my final model was the beforementioned StackingClassifier.

I chose this because I believed that determining the valid influence of offer to demographic group would be the most beneficial business use but might require different combinations of strengths of estimators. I originally used KNeighborsClassifier and RandomForestClassifier and simply expanded upon those prospects to include Gradient Boosting estimators and Extra Trees/Balanced Forest estimators.

The base estimators were ExtraTreesClassifier, GradientBoostingClassifier, XGBClassifier, KNeighborsClassifier, RidgeClassifierCV, BalancedRandomForestClassifier. This is quite the wide ensemble, but other than minor adjustments, the performance stayed about the same with similar combinations (excluding RidgeClassifier/ExtraTreesClassifier, selecting only GradientBoostingClassifier or XGBClassifier, etc).

The final estimator was the XGBClassifier, which seemed to perform better in testing than the default LogisticRegression.

StackingClassifier has a Cross-Validation splitting strategy built in to train the final_estimator. The default is 5-fold cross validation as an increase to 10-fold produced no difference.

My final model performance:

Issues & Conclusions

Photo by Anirudh Ganapathy on Unsplash

Issues:

There were certainly some issues in this process that need to be addressed.

Because of the effort to include the “offer type: informational”, there were many records that were deemed to be valid influencers that the validity of that is uncertain. The amount of records is enough to make many of the models believe that this is an important feature. Perhaps it is, and informational offers are truly highly effective, but it is hard to be certain.

The lower accuracy score in the model’s performance is certainly a bit concerning at first. However, I feel confident that rather than just predicting the “event’ column which leads to a “highly accurate” model but adds no real business value, using this approach to ascertain which offers are influential to certain demographic groups is the true business value goal.

I did notice that the wider the dimensionality of the data became, the features and their classes became more thin and imbalanced. But without the wider dimensionality, the data suffers from ambiguity which impacts any supervised learning model’s performance.

I had tried implementing a Support Vector Classifier (SVC) with the understanding that SVM (Support Vector Machines) can efficiently handle wide dimensionality of data, however, the fitting time was extensively long. I was able to test the default values and some minor tunings, but the performance was slightly inferior to the StackingClassifier and the fitting time was too costly. I would consider further testing either on its own or as a replacement in the StackingClassifier.

I had also tried LightGBM & CatBoost as other Gradient Boosting classifiers with their unique approaches of efficiently, outputs, and categorical value handling. While I very much enjoyed the expanded capabilities and functions of both, they ultimately did not improve the StackingClassifer, nor did they perform better themselves or other combinations.

Unfortunately, this is the sort of situation that capturing more data would greatly improve its efficacy. My only other consideration is reevaluating the logic and processing of the “offer_valid” — There may be another way to enhance or smooth this process to create more predictive quality.

However, considering that the previous amount of positive “offer_valid” was about 38% of the total, using this model would increase capturing relevant potentially responsive customers to influential offers up to 74%. Though I do intend to revisit this project to attempt to improve the overall performance of the model, this would potentially be a very effective increase to customer interaction and responsiveness.

Again, please review my notebook for more detailed information on the processes and data analysis. I very much welcome any critiques and reviews!

Thank you very much for your time and attention!

--

--

erik james mason
The Startup

Data this, data that - doing data things as Director of Data for nonprofit cross-sector towards diversity & equity. ML/DL Enthusiast and just generally enthused