MARKET BASKET ANALYSIS IN PYTHON

Reia Natu
Analytics Vidhya
Published in
4 min readDec 15, 2020
AffinityAnalysis.png

What is Market Basket Analysis used for?

  1. To Build a recommendations engine.
  2. Improve product recommendations.
  3. Cross-sell products.
  4. Improve inventory management.
  5. Upsell products.

Let us understand using this use-case as explained in this article.

#Loading packages
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
#Reading Data
retaildata = pd.read_excel('online_retail.xlsx')
retaildata.head()
png
retaildata.shape(541909, 8)

This data has 541909 observations recorded for 8 variables.

However, for this data needs some pre-processing before it can be used for further analysis.

Pre-processing Steps

This involves:

  1. Removing extra spaces
  2. Removing duplicates
  3. Converting the invoice number to a string value
  4. Removing the credit transactions
#Cleaning the data
retaildata['Description'] = retaildata['Description'].str.strip()
retaildata.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
retaildata['InvoiceNo'] = retaildata['InvoiceNo'].astype('str')
retaildata = retaildata[~retaildata['InvoiceNo'].str.contains('C')]
retaildata.head()
png
retaildata.shape(532621, 8)

Now the data is ready to be used with 532621 observations for the 8 variables.

retaildata['Country'].value_counts()United Kingdom          487622
Germany 9042
France 8408
EIRE 7894
Spain 2485
Netherlands 2363
Belgium 2031
Switzerland 1967
Portugal 1501
Australia 1185
Norway 1072
Italy 758
Channel Islands 748
Finland 685
Cyprus 614
Sweden 451
Unspecified 446
Austria 398
Denmark 380
Poland 330
Japan 321
Israel 295
Hong Kong 284
Singapore 222
Iceland 182
USA 179
Canada 151
Greece 145
Malta 112
United Arab Emirates 68
European Community 60
RSA 58
Lebanon 45
Lithuania 35
Brazil 32
Czech Republic 25
Bahrain 18
Saudi Arabia 9
Name: Country, dtype: int64

Checking for the values based on the country demographic, it is seen that UK tops the number of transactions.

Let us now subset the data to include only 1185 Australian transactions as seen below:

#Separating transactions for Australia
transaction_basket = (retaildata[retaildata['Country'] =="Australia"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

Next, we encode the values such that 1 indicates existence of the transaction and 0 otherwise and then proceed to Market Basket Analysis.

#Converting all positive values to 1 and everything else to 0
def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1

transaction_basket_sets = transaction_basket.applymap(encode_units)

#Removing "postage" as an item
transaction_basket_sets.drop('POSTAGE', inplace=True, axis=1)
#Viewing the transaction basket
transaction_basket.head()
png

Key Steps

Market basket analysis involves:

  1. Constructing association rules
  2. Identifying items frequently purchased together

The association rules explain the relationship of:

fow.PNG

The important association rules are as follows:

SUPPORT

It determines how often the product is purchased as:

support.PNG

CONFIDENCE

It measures how often items in Y appear in transactions containing X as:

confidnece.PNG

LIFT

It tells us how likely is item Y to be purchased together with item X. Lift > 1 indicates that the items are likely to be bought together and stresses that the association rule is good at predicting the result than just assuming. Consequently, Lift < 1 signifies a poor association rule.

lift.PNG

Market Basket Analysis and making recommendations

In this section let us find out the most frequent items by setting a minimum support threshold of 0.7 and then generate rules using the lift metric as seen below:

#Generating frequent itemsets
frequent_itemsets = apriori(transaction_basket_sets, min_support=0.07, use_colnames=True)
#Generating rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
#viewing top 100 rules
rules.head(10)
png

Let us now consider the first rule where the antecedent is the ’36 PENCILS TUBE RED RETROSPOT’ and the consequent is the ‘RED RETROSPOT CAKE STAND’. To begin with this is a strong rule as it has a lift value > 1.

transaction_basket_sets['36 PENCILS TUBE RED RETROSPOT'].sum()4transaction_basket_sets['RED RETROSPOT CAKE STAND'].sum()4

On viewing the transaction basket for this antecedent- consequent rule, it can be said that the 4 people who purchased the ’36 PENCILS TUBE RED RETROSPOT’ were the ones who also purchased the ‘RED RETROSPOT CAKE STAND’.

It can be recommended to place these items together to increase the number of sales.

Visualizing results for one antecedent lhs item against consequents

# Viewing results for one antecedent lhs item against consequents 
import seaborn as sns
import matplotlib.pyplot as plt

# Replacing the sets with strings
rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a)))

# Transforming the dataframe of rules into a matrix using the lift metric
pivot = rules.pivot(index = 'antecedents_',columns = 'consequents_', values= 'lift')

# Generating a heatmap
sns.heatmap(pivot, annot = True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()
png

Note: This visualisation can be customised for the other rules where there is >1 antecedent with respect to each of their consequents.

--

--

Reia Natu
Analytics Vidhya

Data Scientist | 15K+ Data Science Family on Instagram @datasciencebyray | LinkedIn- https://in.linkedin.com/in/reia-natu-59638b31a |