Market Basket Analysis Using RStudio

Hello everyone!! it’s good to be back^^

Published in

The Startup

5 min readJul 13, 2020

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

An example of Association Rules

assume there are 100 customers
10 of them bought milk, 8 bought butter and 6 bought both of them
bought milk => bought butter
support = P(Milk & Butter) = 6/100=0.06
confidence = support/P(Butter) = 0.06/0.08=0.75
lift = confidence/P(Milk) = 0.75/0.10=7.5

Ok that’s enough for the theory, let’s start practice!!

In this section, you will use a dataset from the UCI Machine Learning Repository. The dataset is called Online-Retail, and you can download it from here. The dataset contains transaction data from 01/12/2010 to 09/12/2011 for a UK-based registered non-store online retail.

Load the packages

library(arules)
library(arulesViz)
library(tidyverse)
library(knitr)
library(ggplot2)
library(lubridate)
library(plyr)
library(dplyr)

Data Pre-processing

retail <- read.csv("E:\\SHAULA DOCUMENT\\SEM 6\\DATA MINING\\MEDIUM\\MBA\\Online Retail.csv")
#complete.cases(data) will return a logical vector indicating which rows have no missing values. Then use the vector to get only rows that are complete using retail[,].
retail <- retail[complete.cases(retail), ]#mutate function is from dplyr package. It is used to edit or add new columns to dataframe.
retail %>% mutate(Description = as.factor(Description))
retail %>% mutate(Country = as.factor(Country))#Converts character data to date. Store InvoiceDate as date in new variable
retail$Date <- as.Date(retail$InvoiceDate)#Extract time from InvoiceDate and store in another variable
TransTime<- format(retail$InvoiceDate)#Convert and edit InvoiceNo into numeric
InvoiceNo <- as.numeric(as.character(retail$InvoiceNo))#Bind new columns TransTime and InvoiceNo into dataframe retail
cbind(retail,TransTime)
cbind(retail,InvoiceNo)glimpse(retail)

After preprocessing, the dataset includes 406,829 records and 9 fields: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country, Date.

Association rules for online retailer

Before using any rule mining algorithm, we need to transform the data from the data frame format, into transactions such that we have all the items bought together in one row.

library(plyr)
transactionData <- ddply(retail,c("InvoiceNo","Date"),
                         function(df1)paste(df1$Description,
                                            collapse = ","))

Next, as InvoiceNo and Date will not be of any use in the rule mining, you can set them to NULL

transactionData$InvoiceNo <- NULL
transactionData$Date <- NULL#Rename column to items
colnames(transactionData) <- c("items")
transactionData

This format for transaction data is called the basket format. Next, you have to store this transaction data into a .csv file.

write.csv(transactionData,"E:/SHAULA DOCUMENT/SEM 6/DATA MINING/MEDIUM/MBA/market_basket_transactions.csv", quote = FALSE, row.names = FALSE)

Now we have our transaction dataset. Let’s have a closer look at how many transactions we have and what they are.

tr <- read.transactions('E:/SHAULA DOCUMENT/SEM 6/DATA MINING/MEDIUM/MBA/market_basket_transactions.csv', format = 'basket', sep=',')summary(tr)

We see 22.191 transactions, and this is the number of rows as well. There are 7.876 items — remember items are the product descriptions in our original dataset. Transactions here are the collections or subsets of these 7.876 items.

You can generate an itemFrequencyPlot to create an item Frequency Bar Plot to view the distribution of objects based on itemMatrix (e.g., >transactions or items in >itemsets and >rules) which is our case.

# Create an item frequency plot for the top 20 items
if (!require("RColorBrewer")) {
  # install color package of R
  install.packages("RColorBrewer")
  #include library RColorBrewer
  library(RColorBrewer)
}
itemFrequencyPlot(tr,topN=20,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
itemFrequencyPlot(tr,topN=20,type="relative",col=brewer.pal(8,'Pastel2'),main="Relative Item Frequency Plot")

In this plot, first argument is the transaction object to be plotted. top N allows you to plot top N highest frequency items. Type can be “absolute” or “relative”. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.

This plot shows that ‘White Hanging Heart T-Light Holder’ and ‘Regency Cakes and 3 Tier’ have the most sales. So to increase the sale of ‘Set of Cake Tins Pantry Design’ the retailer can put it near ‘Regency Cakes and 3 Tier’.

Generating Rules!

We use the Apriori algorithm in Arules library to mine frequent itemsets and association rules.

association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10))
summary(association.rules)

The summary of the rules gives us some information:

The number of rules : 49122
The distribution of rules by length: a length of 5 items has the most rules.
The summary of quality measures: ranges of support, confidence, and lift.

Since there are 49122 rules, let’s print only top 10:

inspect(association.rules[1:10])

Visualizing Association Rules

Scatter Plot

subRules<-association.rules[quality(association.rules)$confidence>0.4]
plot(subRules)

The above plot shows that rules with high lift have low support.

plot(subRules,method="two-key plot")

The two-key plot uses support and confidence on x and y-axis respectively. It uses order for coloring. The order is the number of items in the rule.

Let’s select 10 rules from subRules having the highest confidence.

top10subRules <- head(subRules, n = 10, by = "confidence")

Plot an interactive graph:

plot(top10subRules, method = "graph",  engine = "htmlwidget")

That’s all for today, hope you guys enjoy it!^^

References: