„What drives the customer spendings for groceries at a wholesale club? “ — Understanding customer spending behavior at a wholesale club

Published in

TechLabs

6 min readNov 28, 2020

This project was carried out as part of the TechLabs “Digital Shaper Program” in cooperation with the Marketing Center Münster (Term 2020/01).

Abstract: In our project, we want to identify what drives consumers to pay more or less money for their grocery shopping. Does the weekday play a role or do consumers react to certain discounts or brand types? With the help of the skills we learned in the TechLabs Data Science R track, we try to find an answer to this question. To do so, we work with a Kaggle dataset from an American retailer which tracks the customers’ history of all transactions made within a timeframe of 1,5 years.
After putting a lot of energy into data cleaning and organizing, we concentrate on identifying drivers of customer spendings by using a panel regression with fixed effects in R.

Idea / Background of the Project:

As we are all studying marketing, our group is very interested in the topic of consumer behavior, especially shopping behavior and if there are factors that influence spendings on a shopping trip. Identifying certain drivers with the help of Data Science and R sounds like an intriguing challenge that can help us to learn more about handling and analyzing big data in the area of marketing research.

The Problem and how we want to solve it:

Retailers collect a lot of data about their customers but rarely use it to analyze their strategy or to identify certain drivers of consumer behavior. Therefore, we want to find out what influences the customer spending for groceries. In the beginning, the dataset of our retailer seems to work well for this kind of research question. However, after diving deeper into the data, we realize that we need to clean, reorganize and merge a lot of the data. Important insights that we gain are that our data was not collected from a regular supermarket but assumably from a wholesale market and that the information we can derive from the dataset is very limited, e.g. brand and item IDs are given but we have no information which real-life brand or item belongs to these IDs.
Looking at the relationships between our independent variables and dependent variables, we find solely linear relationships. The linear relations in combination with our household panel data lead us to the decision to choose a panel regression with fixed effects as our research method.

The Key-Question:

What drives the customer spendings for groceries at a wholesale club?

Our Methods and how we used them to solve our problem:

All members of our group are enrolled in the data science track with R, so we concentrate on using our newly learned data science skills to find an answer to our research question.
In the beginning, we try to approach our research question on a rather aggregated company level. We look at the different datasets that we have, combine them into one dataset and do a lot of descriptive analysis to see who the people that shop at the retailer (demographics) are, what product categories are offered by the retailer, which categories are bought the most, how high the total spendings are split by weekday, how often people redeem a coupon and how much money they are saving with coupons per trip etc. To create the descriptive statistics and display them as plots, we use the R packages dplyr and ggplot2.
After looking at the descriptive statistics, we realize that we have to change our approach to an individual level approach. To achieve a better fit of our data and the individual level approach, we decided to only include the product category grocery, delete variables that have no further relevance and create new variables by combining existing ones. For example, we do not only consider the variable sum_coupondiscount but also the variable share_coupon_discount to also include the relative share of the discount in relation to the full price in our analysis. We also add completely new variables like weekday and days_between_purchases to get additional information. We used the R packages plyr and dplyr for merging our data and creating new variables.
After discussing different statistical methods from a simple OLS to a more complicated splines regression, we finally decided to use a panel regression to account for the individual customers and gain insights on a customer level.
As we consider both the variables total_spendings (absolute value of spendings) and deviation_mean_dv_per (relative deviation of each customer from his/her individual mean) to be insightful measures, we choose them both as dependent variables and respectively run two models to identify their drivers. The selection of our independent variables is based on content relevance as well as statistical testing. All our independent variables as well as interaction effects we included can be seen in our R output table below. We use the plm package to run our panel regression models and additionally tested for robustness (for details on the robustness tests, refer to our GitHub Code).

**R Output Tabelle for the two plm functions**

Results:

Over the course of the project, we gather knowledge on the drivers of customer spendings at a wholesale club. We analyze and discuss the impact of discount types as well as of the different brand types in the shopping basket. Even more importantly, however, we now know a lot more about the handling of big datasets, including how to manage and describe data, how to select the right statistical method to best approach the data and how to find and use the right tools in R to do so. To conclude our project and based on our findings, we can give the following recommendations:

1: Managers should avoid focusing purely on coupon campaigns as they are not beneficial when the main goal is to raise spendings. When coupons are sent out to the customers, the coupon format needs to be chosen carefully. For example, sending out a two-for-one discount leads on the one hand to a comparably high coupon share while on the other hand it ensures that the customer pays a higher amount in store compared to getting a 50% discount on a single item.

2: While other discounts in general increase the total spendings, it is important that managers keep the relation of other discounts to the full price of their basket at a low level. Hence, the customer must be encouraged by the other discount to buy more items without decreasing the profit margin of the retailer too much.

3: Managers should avoid motivating the usage of both discount types at the same time, for example by offering them in different time frames or by adjusting the store layout in such a way that coupon items are not placed closely to items that are valid for other discounts. Also, adding restrictions to coupons might be an option to assure that a coupon cannot be redeemed when other discounted articles are purchased. We recommend additional research into which shares of coupons and other discounts are still profitable for the company and how the demand of customers changes with different discount values.

4: As it is not important for a company how much a customer spends per shopping trip, but how much one spends over a bigger time period, for example during one month, no actions need to be taken to influence the shopping frequency. One thing to observe more thoroughly with regard to monthly spendings is the average amount per customer. Furthermore, we do not recommend encouraging customers to purchase on Thursdays without further analysis. Hence, we endorse that the retailer does a customer survey to understand why they prefer Thursdays for grocery shopping to derive future marketing strategies such as POS activities.

For further information on our project please contact us via LinkedIn or check our GitHub Repository: https://github.com/techlabsms/ms-st-20-MCM2-Marketing

Name des finalen Codes: complete R code.R

Dataset: https://www.kaggle.com/vasudeva009/predicting-coupon-redemption-pca

The Team:

Ngoc Ha Vu/ Data Science R/ Descriptive Statistics (LinkedIn)

Miriam Etz/ Data Science R/ Model Formulation

Christina Brinkmann/ Data Science R/ Data Management (LinkedIn)

Frederike Czichowski/ Data Science R/ Hypotheses & Theoretical underpinnings (LinkedIn)