MARKET BASKET ANALYSIS IN PYTHON

Reia Natu

Published in

Analytics Vidhya

4 min readDec 15, 2020

What is Market Basket Analysis used for?

To Build a recommendations engine.
Improve product recommendations.
Cross-sell products.
Improve inventory management.
Upsell products.

Let us understand using this use-case as explained in this article.

#Loading packages
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules#Reading Data 
retaildata = pd.read_excel('online_retail.xlsx')
retaildata.head()

retaildata.shape(541909, 8)

This data has 541909 observations recorded for 8 variables.

However, for this data needs some pre-processing before it can be used for further analysis.

Pre-processing Steps

This involves:

Removing extra spaces
Removing duplicates
Converting the invoice number to a string value
Removing the credit transactions

#Cleaning the data
retaildata['Description'] = retaildata['Description'].str.strip() 
retaildata.dropna(axis=0, subset=['InvoiceNo'], inplace=True) 
retaildata['InvoiceNo'] = retaildata['InvoiceNo'].astype('str')
retaildata = retaildata[~retaildata['InvoiceNo'].str.contains('C')] 
retaildata.head()

retaildata.shape(532621, 8)

Now the data is ready to be used with 532621 observations for the 8 variables.

retaildata['Country'].value_counts()United Kingdom          487622
Germany                   9042
France                    8408
EIRE                      7894
Spain                     2485
Netherlands               2363
Belgium                   2031
Switzerland               1967
Portugal                  1501
Australia                 1185
Norway                    1072
Italy                      758
Channel Islands            748
Finland                    685
Cyprus                     614
Sweden                     451
Unspecified                446
Austria                    398
Denmark                    380
Poland                     330
Japan                      321
Israel                     295
Hong Kong                  284
Singapore                  222
Iceland                    182
USA                        179
Canada                     151
Greece                     145
Malta                      112
United Arab Emirates        68
European Community          60
RSA                         58
Lebanon                     45
Lithuania                   35
Brazil                      32
Czech Republic              25
Bahrain                     18
Saudi Arabia                 9
Name: Country, dtype: int64

Checking for the values based on the country demographic, it is seen that UK tops the number of transactions.

Let us now subset the data to include only 1185 Australian transactions as seen below:

#Separating transactions for Australia
transaction_basket = (retaildata[retaildata['Country'] =="Australia"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

Next, we encode the values such that 1 indicates existence of the transaction and 0 otherwise and then proceed to Market Basket Analysis.

#Converting all positive values to 1 and everything else to 0
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

transaction_basket_sets = transaction_basket.applymap(encode_units)

#Removing "postage" as an item
transaction_basket_sets.drop('POSTAGE', inplace=True, axis=1)#Viewing the transaction basket
transaction_basket.head()

Key Steps

Market basket analysis involves:

Constructing association rules
Identifying items frequently purchased together

The association rules explain the relationship of:

The important association rules are as follows:

SUPPORT

It determines how often the product is purchased as:

CONFIDENCE

It measures how often items in Y appear in transactions containing X as:

LIFT

It tells us how likely is item Y to be purchased together with item X. Lift > 1 indicates that the items are likely to be bought together and stresses that the association rule is good at predicting the result than just assuming. Consequently, Lift < 1 signifies a poor association rule.

Market Basket Analysis and making recommendations

In this section let us find out the most frequent items by setting a minimum support threshold of 0.7 and then generate rules using the lift metric as seen below:

#Generating frequent itemsets
frequent_itemsets = apriori(transaction_basket_sets, min_support=0.07, use_colnames=True)#Generating rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)#viewing top 100 rules
rules.head(10)

Let us now consider the first rule where the antecedent is the ’36 PENCILS TUBE RED RETROSPOT’ and the consequent is the ‘RED RETROSPOT CAKE STAND’. To begin with this is a strong rule as it has a lift value > 1.

transaction_basket_sets['36 PENCILS TUBE RED RETROSPOT'].sum()4transaction_basket_sets['RED RETROSPOT CAKE STAND'].sum()4

On viewing the transaction basket for this antecedent- consequent rule, it can be said that the 4 people who purchased the ’36 PENCILS TUBE RED RETROSPOT’ were the ones who also purchased the ‘RED RETROSPOT CAKE STAND’.

It can be recommended to place these items together to increase the number of sales.

Visualizing results for one antecedent lhs item against consequents

# Viewing results for one antecedent lhs item against consequents 
import seaborn as sns
import matplotlib.pyplot as plt

# Replacing the sets with strings
rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a)))

# Transforming the dataframe of rules into a matrix using the lift metric
pivot = rules.pivot(index = 'antecedents_',columns = 'consequents_', values= 'lift')

# Generating a heatmap
sns.heatmap(pivot, annot = True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()

Note: This visualisation can be customised for the other rules where there is >1 antecedent with respect to each of their consequents.