How To Perform Market Basket Analysis in Python

An Implementation of Machine Learning In Retail Business

10 min readAug 3, 2020

Introduction

Data science and machine learning are very much applicable in such huge and various fields. One of them is retail business. Imagine if you are a retail business owner who owns a retail shop that sells hundreds of items. In a single month, there are more than a hundred transactions occurring in your shop, for instance. Each transaction is usually made with more than an item to be bought. It means there are usually more than an item in a single transaction.

In the end of the month, you would like to increase the sales in your shop, so you are trying to figure out which ways is the best to create promotions or to apply discounts on particular items. You start asking your team a questions, which products should be discounted? How do we apply the discounts? Are there any patterns that occur to our sales, so that we could apply the promotion more accurately?

A little about Market Basket Analysis

Well, in order to answer that, you should try one of a well-known techniques in data science and machine learning called Market Basket Analysis. Market Basket Analysis (MBA) is an accidental transaction pattern that purchasing some products will affect the purchasing of other products. MBA is used to predict what products that customer interested in (Halim et al., 2019). It’s a kind of knowledge discovery in data (KDD) and this technique can be applied in various fields of work (Maitra, 2019). The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers (Chauhan, 2019). By performing this techniques, we could overcome the problem that we faced before. We will find a patterns that shows which items are frequently purchased together so that we could apply a discounts into a particular items more accurately.

In this article, I am not going to explain every fundamental concept of MBA, instead, I am going to show you how to perform market basket analysis using python. If you try to understand the basic concepts of MBA, this article is not for you. I recommend you to visit these article instead :

The purpose of this article is to give you a clear example on how to perform MBA in python. It is highly recommended for you to read the basic concepts before you jump into the code. This article assumes that you have read the basic concepts of Market Basket Analysis. Alright, without wasting any more time, let’s jump right in!

The step by step of Market Basket Analysis using python

1. Import Dataset

Here, I will use one of the most commonly-used datasets among data scientists which is online retail data in UK. Me myself have used this datasets for other several projects as well. So, you might be familiar with this one. You could access the data source here. First thing first, let’s have a look on the dataset by importing the datasets.

As you can see, this datasets contains 8 features with 541909 rows. It shows us the transaction of an actual online retail in UK from 1 Desember 2010 until 9 Desember 2011. So, it is a whole year transaction data from an actual online retail business. In the code above, I already download the datasets from its sources and save it in the same folder as the jupyter notebook file so that I could just simply read the datasets using pandas.

2. Drop all Null Values

Next, to perform any machine learning model, we should handle the NaN or Null values in our datasets. I know, drop all the null values is not the best practice in handling missing values. But, in order to keep this article in a respective length, I am going to simply drop it. As you could see in the picture, the entries now is decreased to 406829 entries (24,9%). It means, there are around 24% from all rows that contain at least one missing values.

3. Using the Positive ‘Quantity’ Values

In this datasets, the Quantity column shows us the number of items that are bought in each transaction (InvoiceNo). Sometimes, the transaction gets cancelled, because this is an online retail. When there is a cancellation on a particular transaction, it will be datificated in Quantity column as a negative value. Since we’re doing market basket analysis, we basically would like to analyze what’s inside the basket that our customer actually bought. This negative value is not one of them. That is why we’re not going to use them. As you can see in the picture above, there are only 397924 entries that are not cancelled.

4. Create the Basket Data while Using The Transaction From UK Only

As you could see from the visualization above, most of the transaction come from UK (91,5%). So, to make this projects is simpler and more personalized, let’s just limit the data that we will use to only the transaction that come from UK. After we understand why we should only use the transaction from UK, we will be creating the basket data. This basket data will contain the Quantity of each items bought per transaction (InvoiceNo). How did I do that?

Using the positive-quantity-and-the-transaction-from-UK-only data, I grouped the data by the transaction (InvoiveNo) & the items (Description) and showed the values of Quantity of each item bought. After that I sum up the value and unstack it. Lastly I changed the index of the data frame to the InvoiceNo so that we could see the quantity of each item bought per InvoiceNo. This dataframe is basically the ‘basket’ that our customers ‘carry on’ to the cashier in our shop. It shows us how much this customer / transaction (InvoiveNo) bought a particular item. If the number is 0, then this customer didn’t buy that particular item. If it shows another value (12 for instances), it means that the customer has bought as many as 12 items.

5. Encode The Data

In market basket analysis, the number of each item bought is not really important. The important one is whether an item is bought or not. Because, we only would like to know, what is the association of buying some items and buying some others. So, we need to encode the basket data into a binary data that shows whether an items is bought (1) or not (0). Here’s how I did that.

Here, I created a function called encode_units that have one particular job; encode the units. That seems obvious. If the units is equal to or less than 0, the function will change it into 0 (Not bought). If the units is more than or equal to 1, it will change the units into 1 (bought). This way, we generated a data frame that shows us whether a particular items is bought or not.

6. Filter The Transaction : Bought More Than 1 Items Only

In market basket analysis, we are going to uncover the association between 2 or more items that is bought according to historical data. So, it is less useful if a transaction only bought a single items. I mean, how could we uncover the association between 2 or more items if there is only 1 item bought? Hence, the next step is to filter out the transactions that is bought more than 1 item. Here’s how I did it.

According to the result above, we could see that there are 15376 transaction that bought more than 1 items. It means, 92.35 % of the basket data is a transaction that is bought more than 1 item.

7. Apply the Apriori Algorithm

After generating the dataset above, it is now the time for us to use the apriori algorithm. Apriori algorithm is simply used to find the frequently bought items in the dataset. In this article, I am not going to explain how the apriori algorithm precisely works, but if you curious about it, you can check it here.

In applying the apriori algorithm, first you have to install the library called “mlxtend”. You can just simply type “pip install mlxtend” and you’re ready to go! After install the packages, here’s how I apply the Apriori Algorithm.

In applying apriori algorithm, we are able to define the frequent data that we wanted by giving the support value. In this case, I define a frequently bought items as an items that is bought as many as 3% out of the whole transaction, it means I will give the support value of 0.03. After that, I added another column called length that contain the number of item that is bought.

As you can see that there are 108 transaction that is consider as a frequently bought items. It is shown in the picture that White hanging Heart T-Light Holder is the most frequently bought items with the support value of 0.121358. It means the item is bought 1866 times out of the whole transaction.

8. Finding The Association Between Frequently Bought Items

After applying the apriori algorithm and finding the frequently bought item, it is now the time for us to apply the association rules. From association rules, we could extract information and even discover knowledge about which items that is more effective to be sold together. That is the whole point of this project. Here’s how I did it.

From the association_rules results above, we could see that ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER are the items that has the highest association each other since these two items has the highest “lift” value. The higher the lift value, the higher the association between the items willl. If the lift value is more than 1, it is enough for us to say that those two items are associated each other. In thise case, the highest value is 17.717 which is very high. It means these 2 items are very good to be sold together.

Beside that, we could also see the support value of ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER are 0.0309% which means there are 3.09% out of total transaction that these 2 items were sold together. In number, it is 476 times.

From the confidence, we could even extract more information. Remember that the confidence value is influenced by the antecedent and consequent. If the antecedent is higher than the consequent, then the rule that will be applied is rule number 1 (not number 2). vice versa. In this case, the antecedent value is higher than the consequent value. It means we will apply rule number 1 which is 𝐺𝑅𝐸𝐸𝑁 𝑅𝐸𝐺𝐸𝑁𝐶𝑌 𝑇𝐸𝐴𝐶𝑈𝑃 𝐴𝑁𝐷 𝑆𝐴𝑈𝐶𝐸𝑅 → 𝑅𝑂𝑆𝐸𝑆 𝑅𝐸𝐺𝐸𝑁𝐶𝑌 𝑇𝐸𝐴𝐶𝑈𝑃 𝐴𝑁𝐷 𝑆𝐴𝑈𝐶𝐸. In a more detail explanation, it means that a customer will tends to bought Roses Regency Teacup and Saucer AFTER they bought Green Regency Teacup And Saucer. Not in the other way around. This could be a very valuable information, because we are now aware which products should we put the discounts on. We could give a discounts on Roses Regency Teacup and Sauce if a customer buy Green Regency Teacup and Saucer.

Conclusions

In this articles, we’ve done a Market Basket Analysis using an actual online retail transaction data from UK. The result of this market basket analysis could be used for a data-driven marketing strategy and decision making. In this datasets, we could generates several business insights as follows :

Item Placements. We could put ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER in a closer place, maybe in a same shelf or any other closer place.
Products Bundling. We could put ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER as a single bundle of product with a lower price compare to each price combined. This way will attract more sales and generates more income.
Customer Recommendation and Discounts. We could put Roses Regency Teacup and Saucer in the cashier, so that every time a customer bought Green Regency Teacup and Saucer, we could offer and recommend them to buy Roses Regency Teacup and Saucer with a lower price.

If you’re going to see the whole code, you could check it in my github account right here.

Sources :

Halim, Octavia, and Alianto. 2019. Designing Facility Layout of an Amusement Arcade using Market Basket Analysis. Procedia Computer Science, Vol 161, Page 623–629. (https://www.sciencedirect.com/science/article/pii/S1877050919318769)
Maitra, Sarit. 2019. Association Rule Mining using Market Basket Analysis. (https://towardsdatascience.com/market-basket-analysis-knowledge-discovery-in-database-simplistic-approach-dc41659e1558)
Subramanian, Dhilip. 2019. Association Discovery — the Apriori Algorithm. (https://medium.com/towards-artificial-intelligence/association-discovery-the-apriori-algorithm-28c1e71e0f04)
Chauhan, Nagesh Singh. 2019. Market Basket Analysis. (https://towardsdatascience.com/market-basket-analysis-978ac064d8c6)
Li, Susan. 2017. A Gentle Introduction on Market Basket Analysis — Association Rules. (https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce)
https://archive.ics.uci.edu/ml/datasets/Online+Retail+II