A Beginner’s Guide to Market Basket Analysis

Sushmitha Pulagam
The Startup
Published in
7 min readMay 5, 2020

Have you ever wondered why the items are placed in such a way in your local grocery stores or supermarkets? Based on the items purchased in your cart, how the online stores are suggesting other products making your online shopping experience feasible? How online/offline stores increasing their sales by offering promotions/discounts and coupons?

Well, we can get all these insights through Market Basket Analysis with the Apriori algorithm which internally works with the logic of Association rule mining.

What are these Association rules?

Association rules are combinations of items that occur together frequently in transactions. In a nutshell, it allows retailers to trace relationships between the items that are purchased by the people. The rule can be typically formed as below.

The above rule illustrates that customers who purchased butter and jam together are most likely to buy bread also. One might think that the above rule is quite obvious and why to use these algorithms to trace out. In real life, the transactional data could be enormous and it becomes extremely challenging to get insights by regular monitoring. We need to run these algorithms multiple times until we discover stronger rules which are beneficial for the retailer.

Types of Association rules

Useful rules -> These are the ones we need to filter out which gives actionable information with high quality.

Trivial rules -> These will be known by anyone who is familiar with the business.

Inexplicable rules-> These do not have any relative explanation and do not propose any course of action.

There is no hard way that we can automatically identify or segregate the transactions into the above-mentioned rules. This comes along the way with domain knowledge that put into practice.

In this article, we will go through all the steps on how to get the association rules with the Apriori algorithm along with the R code and visualizations.

Let us assume you have 1000 receipts on your table and we will discuss how to tackle the below questions.

✦ What are the frequent items that were sold?

✦ What items customers are purchasing together?

✦ What items customers are purchasing along with a specific item (Ex: Yogurt)?

To get the dataset, please click here

Each line is a transaction and each cell represents an item.

Snapshot of groceries data from excel

Install and load the below libraries.

install.packages(“arules”)
install.packages(“arulesViz”)
library(arules)
library(arulesViz)

Read the data as a transaction object.

groceries_data<-read.transactions(choose.files(),format = 'basket',sep=',')
summary(groceries_data)

From the summary, we can see the below output.

Summary of the groceries data

Let’s deep dive into more details from the above summary output.

The total number of items sold = transactions * items * density. 9835*169*0.0260 = 43215.

We can see that whole milk has the highest number of transactions, followed by other vegetables. The element length distribution tells us how many transactions are there with one item, 2 items set, and so on. There are 2159 transactions for a single item and 1643 transactions for 2 items and so on. Since the mean is a little higher than the median, we observe that the distribution of the data is positively skewed.

Now we will plot the top 20 frequent items from the data set.

library(RColorBrewer)
itemFrequencyPlot(groceries_data,topN=20,type=”absolute”,col=brewer.pal(9,’Set3'),main=’Items Frequency Plot’)
Frequency plot for the top 20 sold items

Whole milk stands at the top with the highest number of sales followed by other vegetables, rolls/buns, and so on. This graph helps us in identifying the rules for market basket analysis.

How does the Apriori algorithm work?

There are three statistical measures one should aware of before implementing this model.

Support -> It measures the frequency of Item A or the combination of frequency of items A and B together in the total number of transactions.

ItemSet 1
ItemSet 2

It is obvious that ItemSet 1 will have more support when compared to ItemSet2 because the combination of purchasing Bread and Pen is rare.

Confidence->This measure tells us how often the items A and B occur together given the number of times A occurs.

The ratio of transactions having Bread also had Butter in cart
The ratio of transactions having Butter also had Bread in the cart

Here we have to note that ☟

Confidence({Bread} -> {Butter}) Confidence({Butter} -> {Bread})

Support and Confidence are the two hyper tuning parameters in the Apriori algorithm we need to fine-tune until we get the stronger rules.

Lift -> This measure controls for the support (frequency) of the subsequent item while calculating the conditional probability of occurrence for {Butter} given {Bread}.

Model implementation with default values for parameters

association.rules <- apriori(groceries_data)
Output for the algorithm with default values

By default, the Apriori algorithm will consider Support as 0.1 and Confidence as 0.8. From the above output, we can see that no rules were written. Now it is time to tweak the parameters to get some rules.

Model Implementation with different values for parameters

association.rules <- apriori(groceries_data, parameter = list(supp=0.001, conf=0.8,maxlen=10))
The output of the algorithm with Support =0.001 and Confidence =0.8
summary(association.rules)
Summary for the 410 rules generated

From the summary of association rules, we can see that 29 rules are generated with 3 items(lhs+rhs) and the length of 4 items(lhs+rhs) has the most rules i.e., 229.

All the 410 rules (including redundant) have a minimum lift of 3.131 with the maximum being 11.235. Theoretically, the rules with lift >1 are considered as good rules. But we will inspect the top 15 rules where lift > 1 for further analysis after removing redundant rules.

Removing redundant rules

subset.rules<-which(colSums(is.subset(association.rules,association.rules))>1)
length(subset.rules)
subset.asso.rules<-association.rules[-subset.rules]
summary(subset.asso.rules)
Summary for the 319 rules generated

The length of redundant rules is 91. These 91 rules can be removed from 410 rules. Post removing the redundant rules, we are left with 319 rules.

Inspecting the first 15 rules out of 319 rules

inspect(subset.asso.rules[1:15])
Inspection of the first 15 rules

For the 10th rule, the confidence is 1 and we can say that 100% of the customers who bought rice and sugar also bought whole milk.

We can also generate rules for a particular item. For Ex. Below is the code for generating rules only for yogurt.

yogurt.asso.rules<- apriori(groceries_data, parameter = list(supp=0.001, conf=0.8,maxlen=10),appearance = list(default = “lhs”,rhs = “yogurt”))
inspect(head(yogurt.asso.rules))
Rules generated for yogurt

Visualization Rules

For an effective understanding of the rules and values for Support, Confidence and Lift, let us create few plots. Here, we are considering confidence >0.4 and sorted with the confidence parameter.

inspect(top10subRules[1:10]subRules<-subset.asso.rules[quality(subset.asso.rules)$confidence>0.4]
top10subRules <- head(subRules, n=10, by= “confidence”)
plot(top10subRules, method = “graph”, html = “htmlwidget”)
Inspection of top 10 rules to compare with the below graph
Plot for top 10 rules

In the above plot, the size of the nodes is based on the highest Support and the color signifies the highest Lift. The incoming lines represent the LHS and the outgoing lines represent the RHS.

Below is the Parallel coordinates plot.

subRules1<-head(subRules,n=10,by=”lift”)
plot(subRules1, method = “paracoord”)
Parallel coordinates plot

So if a customer has red blush wine and liquor in the shopping cart, there are 11.23(lift) chances that s/he will buy beer along too.

Summary

This article is written to get familiar with the approach on how to perform Market Basket Analysis with sample data. Among all the rules, some might be useless. Hence, utmost care needs to be taken while selecting the values for support and confidence to get the stronger rules.

The association rules mining is not only limited to the Marketing domain, but the same can also be used in other domains such as Healthcare, a certain combination of health risk conditions can result in other complicated diseases. Credit card transactions, to know the likelihood of purchasing the next item based on the customer’s previous purchases and so on…

You can get the R code from my Github profile here.

Thank you for reading and Happy Learning ☺

--

--

Sushmitha Pulagam
The Startup

Business Analytics | Data Science | Machine Learning. Wish to share my learnings towards Analytics community.