Prediction of Customer Shopping Behavior

How can we predict whether a customer will use a discount based on the frequency of their purchases?

Colleen Wang
INST414: Data Science Techniques
5 min readMay 3, 2024

--

Stakeholders/Decisions Informed:

Marketing predictions are becoming increasingly valuable as the competition for customer attention expands. The study of customer trends and what their purchasing habits are most likely to be can hold significant findings for stakeholders in the marketing industry hoping to increase customer engagement. By studying data about customer trends, we can make predictions about customer shopping habits and whether they are likely to engage with promotions and discounts. A key question that can be answered by exploring data about customer shopping trends is “How can we predict whether a customer will use a discount based on the frequency of their purchases?”

The decisions the answer to this question could inform are relevant to marketing directors and promotions teams within a retail company who are trying to maximize customer engagement. These are the specific stakeholders pertaining to this question because the analysis of data about customer trends can help allocate resources effectively, target customers more accurately, and tailor marketing efforts to maximize sales and customer satisfaction. Analyzing data about customer trends can help inform stakeholders of decisions surrounding discount strategies, promotional campaigns, and pricing optimization.

Data/Ground Truth Labels:

To answer the proposed question, a dataset containing information about customer shopping trends and discount usage are important. The data should contain metrics about promotions as well as customer information such as location, amount spent, frequency of customer. These fields are relevant to the question because these metrics can be analyzed and can help us predict whether a customer is likely to use a discount based on their previous purchases. An analysis of these fields can improve forecasting in retail marketing and inform stakeholders of decisions surrounding promotional strategies and customer engagement. I collected a subset of this data on Kaggle, a free resource for open data sets. The fields contained in this data set are:

  • Customer ID
  • Age
  • Gender
  • Item Purchased
  • Category,Purchase Amount (USD)
  • Location
  • Size
  • Color
  • Season
  • Review Rating
  • Subscription Status
  • Shipping Type
  • Discount Applied
  • Promo Code Used
  • Previous Purchases
  • Payment Method
  • Frequency of Purchases

The ground truth labels for whether a discount was applied were generated based on the ‘Discount Applied’ column in the dataset. This column indicates whether a discount was applied for each transaction recorded in the dataset. If a discount was applied, the corresponding entry in the ‘Discount Applied’ column would be marked as ‘Yes’ or ‘No’. The model learns from this labeled data to predict whether a customer will use a discount based on their frequency of purchases. Therefore, the ground truth labels were directly derived from the transaction records in the dataset. This is relevant to the question as it forms the basis for understanding the relationship between the frequency of purchases and the likelihood of a discount being used.

Classification Model/Features:

For this analysis, I chose to use a classification model because the target variable, “Discount Applied” is binary meaning there is a yes for discount applied and no for discount not applied. Therefore, the task is to classify customers into two categories: those who are likely to use a discount and those who are not. This makes classification the appropriate model because the predicted feature is categorical.

The features used for the supervised model include ‘Frequency of Purchases,’ which is a numeric variable indicating how often a customer makes purchases. This feature serves as the predictor for whether a discount is applied or not. Additionally, other features such as customer demographics such as age and gender, and purchase history could also be incorporated into the model for more comprehensive predictions but for simplicity, only frequency of purchases is considered in this analysis.

Incorrect samples:

The trained model can make classification mistakes due to various reasons such as outliers, inconsistent features, or influence by external factors that are not a part of the analysis.

The prediction model incorrectly misclassified these samples:

  • Customer 839 was predicted to have no discount applied based on their frequency of purchases of “every 3 months” however they used a discount.
  • Customer 321 was was predicted to have a discount applied based on their frequency of purchases of “fortnitely” however they did not use a discount.
  • Customer 366 was predicted to have no discount applied based on their frequency of purchases of “quarterly” however they used a discount.
  • Customer 1096 was predicted to have no discount applied based on their frequency of purchases of “every 3 months” however they used a discount.
  • Customer 1659 was predicted to have no discount applied based on their frequency of purchases of “every 3 months” however they used a discount.

The misclassified samples in the decision tree classifier are likely from the complexity of customer shopping behavior and the scope of the model. The prediction relies on the frequency of purchases by customers however other factors such as individual preferences, seasons, or amount spent might influence discount behavior. We can tell from the decision tree report that the model struggles with predicting instances where there is a discount more than it struggles with instances where there is “no discount.”

Answer:

The analysis of customer shopping trends revealed insights about predicting customer behavior in terms of using a discount. The analysis utilized a decision tree classifier that was trained on features from the dataset. I used “frequency of purchases” as the predictor variable and “discount applied” as the target variable to be predicted. Marketing directors and promotions teams can apply insights from these predictions to enhance discount strategies and customer engagement. The model suggests that customers with more frequent purchases are more likely to use a discount. The classification report shows the performance and accuracy metrics of the decision tree which is also important to draw conclusions from the analysis. While there is some correlation between the frequency of purchases and discount usage, other factors not captured in the dataset play a role in determining whether a customer will use a discount.

Data Cleaning/Bugs:

The chosen data set was fairly consistent with no major cleaning steps necessary. However, some common bugs others might encounter could include missing values, outliers, or inconsistencies in categorical variables. For the analysis, some categorical variables needed to be converted into binary representations in order to be understood by the decision tree classifier such as “frequency of purchases.”

Limitations/Bias:

Despite our analysis providing insights into the relationship between the frequency of purchases and discount usage, it has several limitations. The analysis has limitations related to the scope of the features used to predict customer shopping trends. There may be other factors that influence discount usage, such as customer demographics, preferences, or external promotions. This could lead to biased predictions or incomplete understanding of customer behavior. Another limitation is the fact that this dataset is generated using ChapGPT which although presents a good basis to predict customer shopping habits, it limits the reliability and validity of the analysis. This could also lead to a biased data set due to it being generated by a machine learning program.

Here is a link to my GitHub repository that contains the code I have developed for this assignment: https://github.com/cwangg/INST414-Modules/tree/main/module-6

--

--