Market Basket Analysis In Python using Apriori Algorithm

Bhushan Ikhar
Oct 30, 2020 · 6 min read

In Technical terms, Apriori (used in the Market Basket Analysis) tries to find out which items are bought together. So Customer experience can be enhanced by arranging them nearby or suggesting users on retailers site, basically to make customers buy more. (Remember those suggestions of ” These items bought together ” on online shopping sites? )

But how does it works? Of course, there is some theory associated with it but we will study that as we code. So the first thing we need is data, I will be using data from Kaggle. which can be found here (https://www.kaggle.com/shazadudwadia/supermarket#GroceryStoreDataSet.csv). It's a small data set about some breakfast items bought from some store.

  1. Data Pre-processing: This is the most important step. One thing I have observed in the online tutorials is that they start from the data present in this form ( similar to the data what we are using )

My concern was that this is never the output of any traditional application which records transactions neither is when we write SQL . So we need to reshape this in below form

This can be achieved in python by the method explained here https://medium.com/@sureshssarda/pandas-splitting-exploding-a-column-into-multiple-rows-b1b1d59ea12e

Step 1: What we need to do is split each row and assigning one product with a transaction id. For simplicity you can do this in excel also

For Apriori, we need data in the below form so that Algo can extract information easily. So one column of Transaction ids and different columns for Products .1 indicates that the product was a part of this transaction. In the image above c1 includes Milk, Bread, and Biscuit. The same is indicated below by ‘1’.

To achieve this we need to write a small python code . I found Pivot function very useful . But first we need some libraries to run below code. I am using Spyder ( Python 3.7 ). Run below code

##Import Libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Run “pip install mlxtend” in Console if you have not used Apriori earlier and again run above commands

We are all set . Load the data into python . You need to save the excel file we prepared in Step 1 in csv format as mydata.csv and Run below command

"##Load Data in python "
d1 = pd.read_csv("mydata.csv")

Now you need to insert one column in our dataframe . This column will show us the items bought in one transaction by value ‘1’. Run below command

"#add new column with constant value 1"
d1['value'] = d1.apply(lambda x: 1, axis=1)

Your Data frame in python should look like below

We need to reshape this to look like fig 1 . We will use Python pivot function .Run below code

"#Reshape data to pivot form"
d2=pd.pivot_table(d1,index='id',columns='product',values='value')

Your data frame should look like below

“nan” denotes that perticular item wasnt a part of that petivular transaction. Ex: “Bournvita” wasnt in c1 So it is shown by “nan”. We need to remove these and replace it by zero . Run below code

"#remove 'nan' and replace it by 0"
d3=d2.fillna(0)

2. Running Algo:

It is time to run Apriori Algoritham on this data frame. Run below code

"#run apriori algo"
frequent_itemsets = apriori(d3, min_support=0.2, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

I will explain these parameters ( Support and Lift ) later. Output data frame should look like below

3. Theory and Interpretation:

A. There are two parts a. antecedents b. consequents . Means if a person buys Bread (antecedents) then at then same time within same transaction same person may buy Milk (consequents ) as per our data

B. Support : Support= Occurrence of item / Total Transactions . So antecedents support means Support for Bread . It can be calculated using above equation and below pivot of our data (prepared in excel for understanding purpose )

Can you calculate support for pair Bread & Milk. I am highlighting the cells for better understanding. Notice the highlighted cells are in pair means they are bought together .

You may have now understood the support threshold we entered in our Apriori algo ( frequent_itemsets = apriori(d3, min_support=0.2, use_colnames=True). Basically we filtered out other less frequent transactions.

C. Confidence : this number tells you how many times items bought together with respect to bought single time. So if item bought together combined is less compared to single times then occurrence may be insignificant . So higher value of Confidence means higher chances of buying together as compared to single item . Confidence = (Support for items bought together )/( Support for items bought as separately )

Ex: Confidence of Milk and Bread bought together = 0.2/0.6 bases on our earlier calculations i.e. 0.30769 as per our code output in fig 2

D. Lift : Indicates strength of any Rule i.e we derrived that Milk and Bread are bought together but how strong association is this? Lift number tells us exactly that . Lift = (Support for item bought together )/ ( Support for one item)* (Support for another item) . Note that in Confidence we deviding the combine support by any one items support . Here we are dividing multiplication of both items support .

Ex Lift (Milk bread ) = Support for Milk-Bread ( 0.25) /(Support for Bread (0.65 ))* (Support for Milk ( 0.25)) = 1.23077 same as our python output

So higher Lift indicates higher chances of buying these items together than individual . There are few other parameters to analyze these results but Support , Confidence and Lift are the most important and interpret able . Also note that Apriori will provide you all these numbers as shown above output but we also need to understand how to interpret these . That is why I have tried to explain these in detail with explanation

Below is the complete code for your reference

"##Import Libraries "
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

"##Load Data in python "
d1 = pd.read_csv("mydata.csv")

"#add new column with constant value 1"
d1['value'] = d1.apply(lambda x: 1, axis=1)

"#Reshape data to pivot form"
d2=pd.pivot_table(d1,index='id',columns='product',values='value')
"#remove 'nan' and replace it by 0"
d3=d2.fillna(0)

"#run apriori algo"
frequent_itemsets = apriori(d3, min_support=0.2, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

I think the same algorithm can be used to do text analysis like Customer Feedback (which two words are said together ) . What do you think ? .

Let me know your suggestions and feedback . Happy to help .

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Bhushan Ikhar

Written by

Data Visualizer & a Data Scientist | IIMC | Zeno | Reliance | instinctb.com | quora.com/q/datasciencelearners | https://www.linkedin.com/in/ikharbhushan/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com