Product recommendation using Market Basket Analysis: A practical guide

Samuel Theophilus
CodeX
Published in
5 min readNov 29, 2021
UNSPLASH: https://unsplash.com/photos/rWMIbqmOxrY

Insights from McKinsey & Company in 2018 showed that 35% of what consumers purchase on Amazon and 75% of what they watch on Netflix come from product recommendations[1] — Now that is a large percentage which shows how successful recommendation strategies can be when done the right way. Product recommendations have always been a great way to drive sales by offering customers relevant products of interest. There is a wide range of algorithms used to generate product recommendations, however, in this article, our focus will be — Association Rules (Market Basket Analysis).

What are Association Rules?

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases — Wikipedia.

Association rules are used to spot frequent patterns in datasets. This technique has proven to be successful in analyzing transactional data to understand how customer behavior and then used to increase sales. Given a transaction record with multiple items, Association Rules via Market Basket Analysis (MBA) tries to find the rules that explain why certain items are often purchased together.

For instance, Karen Heath in 1992 discovered that there was a strong correlation between Beer and Diapers [2]. Heath found out that customers who bought Diapers were also more likely to purchase Beer. This discovery is a good example that shows how lots of hidden correlations between products can be easily uncovered using association rules to maximize sales.

Popular Recommendation Strategies

Before we go into the details of how the MBA works, let us have an overview of common product recommendation algorithms:

  1. Popular Product Recommendation: This algorithm makes item referrals to customers based on the popularity of the given item.
  2. Content Filtering: This recommendation algorithm tries to generate a customer profile based on behavior (e.g rating). It tries to generate recommendations for a customer based on their activity patterns.
  3. Collaborative Filtering: Makes recommendations to a customer by collecting preferences or activity pattern information from other similar customers.
  4. Association Rules: Association rule is different from other methods because it tries to answer the question — “What items frequently appear together”.

How MBA works

Fig 1: Sample Transactional database

Given a store’s sample transactional records as seen in fig 1, which contains the following products:

={Milk, Bread, Beer, Diaper, Coke}

  • An Itemset is a collection of one or more items from the product list.

E.g. ={Milk, Bread, Diaper}, {Milk}

  • Support is defined as the fraction of transactions that contain an item set. The higher the support the more frequently the itemset occurs. Given an itemset A, Support of A is the ratio of occurrence of itemset A in the Total list of transaction records.

For example, using the table from fig 1, support for {Milk, Bread, Diaper} => 2/5

  • Confidence on the other hand is the probability that a transaction will contain itemset B given that the transaction contains itemset A.

For example:

Support_count({Bread})=4

Support_count({Bread,Diaper})=3

Therefore Confidence=3/4

  • A Frequent pattern or Frequent Itemset is an itemset whose support is greater than or equal to the selected minimum support threshold. For instance, if the minimum support of the table (fig. 1) is 2, it means any itemset with a frequency less than 2 is below the threshold and would not be selected.
  • Lastly, Lift is a measure of the model performance. It helps to determine if combining a product with another improves the chances of making a sale.

Understanding Lift Score

  • If Lift > 1, this means that the association rule improves the chances of the outcome.
  • If Lift < 1, it means the association rule lessens the chances of the desired outcome.
  • If Lift = 1, it means that the association rule does not affect the outcome.

A Practical Experiment using Python

In this section, we explore an introductory GitHub notebook by Chris Moffit[3] but before we dive into the code, let us go over the runtime requirements:

Requirements

Python Version: Python 3

Python Libraries:

pandas — A is a fast, powerful, flexible data analysis and manipulation tool,
built on top of the Python programming language.

mlxtend — A Python library of useful tools for day-to-day data science tasks such as generating association rules.

Getting started

Now that we know the tools we will be working with, let's install the libraries.

pip install pandas
pip install mlxtend

Import the libraries:

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

There are various techniques we can use to generate frequent patterns (FP Growth, ECLAT, Apriori). For this example, we will stick with Apriori (a brute force approach when compared with other methods).

Next, we load the dataset, clean it, and prepare the data for MBA.

df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')#cleaning & prep
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]
#Filter data, fetch only data about France
basket = (df[df['Country'] =="France"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

The code above has loaded transactional data for a UK retail store (retrieved from UCI Machine Learning Repository).

To generate frequent patterns with minimum support of o.o7, we can run the code below:

frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

Lastly, we can generate the association rules and filter rules based on deserved confidence, support, and lift threshold values.

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

After generating the results, we can apply filter with threshold values to select rules with stronger lift or confidence values:

Lift=6

Confidence=0.8

rules[ (rules['lift'] >= 6) &
(rules['confidence'] >= 0.8) ]

The generated rules can be applied to store records and used for product placements and recommendations. For a more comprehensive walkthrough of the code implementation and its results, check the notebook for the complete code.

References

  1. MacKenzie, I., Meyer, C., & Noble, S., How retailers can keep up with consumers (2018), McKinsey & Company. https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers
  2. Steve Swoyer, Beer and Diapers: The Impossible Correlation (2016), Transforming Data with Intelligence.
  3. Chris Moffit, Market Basket Analysis Introduction(2016 ), GitHub.

--

--

Samuel Theophilus
CodeX
Writer for

Machine Learning Engineer || Technical Writer || Data Engineer • Passionate about Computer Vision, NLP & Business Intelligence.