Market Basket Analysis

An Analysis of Grocery Store Products

5 min readAug 14, 2022

Overview

For this project, I will be testing the Market Basket Analysis on a publicly available Kaggle grocery store dataset containing customer purchase orders. It is important to mention that I will be using SQL for this task as I did not see a lot of examples, where SQL was the choice platform to perform this analysis, during my research. The query codes used will be included as a link in the Data Exploration section.

This analysis will answer the following questions

Is there any real association between the products customers buy at a grocery store?
How can the revelation of these associations be effectively applied?

Introduction

Market Basket Analysis is a data mining technique used to discover items that are frequently purchased together and reveal the associations between them, this helps retailers develop marketing strategies that will boost sales.

Although this technique is mostly used by retailers, it can also be applicable to other areas such as; associating purchases with demographic and socio-economic data, fraud detection based on credit card usage, telephone calling patterns, etc. If you watch movies on Netflix or shop on Amazon then you have presumably had items recommended to you based on what people who bought that item or saw that movie are doing. That is a possible use case for this analysis.

Below are some of the important terms to know to make sense of the analysis.

Support: This measures how frequently an association rule happens. That is, how frequently two products, X and Y, are purchased together in transactions. Calculated as XUY count/N, where N is the total count of transactions.

Confidence: This measures how strong an association rule is. It basically lets you know the likelihood of a second product(Y) being present in the basket if the first(X) is. Calculated as XUY/X count.

Expected Confidence: This measures how frequently the second item is purchased regardless of any preconditions. That is, the number of times Y is purchased regardless of if X is present. Calculated as Y count/N, where N is the total count of transactions.

Lift: This calculates the change in the probability of a variable (Y) occurring over two instances; the presence of another variable(X), and the absence of any other variable. It is the ratio between target response(confidence) and average response (expected confidence). Calculated as Confidence/Expected Confidence

If Lift is greater than 1, it means that the target response is more likely than the average response. Therefore, buying X improves the chance of also buying Y.
If Lift is below 1, the target response is less likely than the average response. Therefore, buying X does not improve the chance of also buying Y.
A lift of 1 means that the model (or association rule) does not affect the outcome. Therefore, buying X does not have any impact on the chance of also buying Y.

Data Details

This data set was gotten from Kaggle, it consists of 1 table with 3 columns (member_number, date, and item_description) and 38,765 rows of the purchase orders of people from grocery stores. Date range is 01/01 /2014 to 12/30/2015. One possible data limitation is that it is not current.

Data credit: Groceries Dataset

Data Cleaning and exploration

I used both Microsoft Excel and SQL for these activities, the codes can be found here. Data format is csv and was imported into PostgreSQL via pgadmin, the date style was changed in MS Excel before the import as it was initially incompatible with system settings.

Okay, let’s get into it. Before doing any exploration, I had to make sure there were no null cells or duplicated data that will cause bias or errors. I also checked to make sure that the format for member number was consistent throughout.

For data exploration, I started with finding the most popular item combinations (XUY), here I did a self join of the table on the item_description column to derive this. Self join is basically merging the table with itself. During this step, it’s important to make sure that no two combinations are repeated, and no item is combined with itself. You cannot have combinations like soda and soda, soda and yogurt, and yogurt and soda. If soda and yogurt is already a combination, then yogurt and soda shouldn’t be a combination as well. The next step was to count and rank the itemset to see which combinations were popular amongst shoppers, I created a temporary table to save this result.

Now you know XUY count, you need to also get the individual count of items as well as total count of transactions(N) because you will be using these to calculate the confidence and expected confidence, I used GROUP BY and subqueries largely to achieve this. I grouped the items by their description to get the total occurrence of each one, and grouped by member_number and date to get the number of daily transactions for each customer. To calculate the Expected Confidence and Confidence of each item, you will need to put together all the separate calculations you performed earlier, I was able to achieve this using JOINS and subqueries. I also used this same technique to calculate the Lift.

Findings

There are 14,963 transactions within the data time frame.

Other Vegetable and Whole Milk is the product combination found in most orders, followed by Rolls/Buns and Whole Milk. Whole Milk appears to be a hot item amongst these shoppers!

Lift scores of the top 10 product combinations

From this table, we see that the top 10 product combinations are strongly associated as they all have Lift scores greater than 1.Therefore, we can be very confident that customers will most likely purchase b_item given that they purchased a_item. That is, a customer who purchased Other Vegetables will plausibly purchase Whole Milk, Rolls/Buns, Soda or Yogurt as well.

Recommendations on how to apply analysis result.

Promotions on the associated b_item can be offered at check-out or in-store to boost sales.
Organize associated products within close proximity of each other, this will encourage shoppers to pick b_item if they already picked a_item.
Use cross selling to suggest associated products that are frequently bought together to customers at check-out.