identify purchasing patterns of goods by customers and formulate marketing strategies based on the resulting association rules

8 min readMay 13, 2024

Assalamualaikum Warahmatullah Wabarakatuh,

Hello friends!!

Let me introduce you, Faisal Hakim Akbar Alhaqq with NIM 22611134, I want to discuss Market Basket Analysis for Transaction Data in Python, let’s listen carefully!!

study case

Transaction data is a collection of information that records details about purchases or transactions made by customers or other entities within a certain time period. Transaction data typically includes information such as the product or service purchased, the amount of the purchase, the price, the time and place of the transaction, and the identity information of the customer or other entity involved in the transaction. Transaction data is often used to analyze purchasing patterns, consumer trends, and to carry out other analyzes such as shopping basket analysis and customer segmentation.

#Import Data

import pandas as pd
dataMarket = pd.read_csv('/content/data3.csv', delimiter=',', encoding="ISO-8859-1")
dataMarket

This data consists of 1083818 rows and 8 columns, which record purchase transactions from various customers. Each entry includes details such as customer ID, transaction ID, transaction time and location, as well as item-details such as codes and descriptions. This data is very useful for analyzing purchasing habits, patterns of product interest, and sales trends over time. Additionally, it can be used to identify new business opportunities and improve marketing strategies by understanding customer preferences.

#Data Processing

##cleaning dataa

dataMarket = data.dropna()
dataMarket.info()

The syntax above aims to clean data by deleting rows containing missing values (NaN) using the dropna() method of Pandas DataFrame. After cleaning the data, info() is called to display information about the cleaned DataFrame. This includes the number of non-null entries in each column and the data type of each column. Thus, the info() method provides a summary of the structure and cleanliness of the data after the cleaning process.

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
<class 'pandas.core.frame.DataFrame'>
Index: 1080910 entries, 0 to 1083817
Data columns (total 8 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   UserId                  1080910 non-null  int64  
 1   TransactionId           1080910 non-null  int64  
 2   TransactionTime         1080910 non-null  object 
 3   ItemCode                1080910 non-null  int64  
 4   ItemDescription         1080910 non-null  object 
 5   NumberOfItemsPurchased  1080910 non-null  int64  
 6   CostPerItem             1080910 non-null  float64
 7   Country                 1080910 non-null  object 
dtypes: float64(1), int64(4), object(3)
memory usage: 74.2+ MB

The output shows the results of the data cleaning operation using the dropna() method on the DataFrame. The cleaned data has 1080910 rows and 8 columns. Each column has the same number of non-null values as the total number of rows, indicating there are no missing values in the dataset after the cleaning process. Additional information such as the data type of each column is also presented, allowing further analysis of the processed data structure.

dataMarket_plus = data[data['NumberOfItemsPurchased']>=0]
dataMarket_plus.info()

The syntax data[data[‘NumberOfItemsPurchased’]>=0] is used to create a subset of the data DataFrame that contains only those rows where the value of the ‘NumberOfItemsPurchased’ column is greater or equal to 0. This means only those rows that have the number of items those purchased greater than or equal to 0 will be included in the new subset. Then, info() is used to display information about the dataMarket_plus DataFrame, including the number of rows, column data types, and the number of non-null values in each column. Thus, the syntax aims to create a subset of data that only contains transactions with a non-negative number of items purchased, and then displays information about that subset of data.

#list of items per transaction
dataMarke_tplus=(data_plus.groupby(['TransactionId','ItemDescription'])['NumberOfItemsPurchased'].sum().unstack().reset_index().fillna(0).set_index('TransactionId'))
dataMarket_plus

The transaction table shows purchases of various goods from several countries, with the UK being the largest country. The items purchased include home decorations, household equipment, stationery and accessories. Transactions were carried out throughout 2018 and 2019.

#Mining Association Rules

Data mining association rules is a technique in data mining that is used to find patterns of relationships between items or variables in large data sets. The main goal of mining association rules is to find associations or correlations between items in transactional data or data stored in itemset form. In the context of a transaction, an itemset is a collection of items that appear together in a single transaction. This technique is often used in shopping cart analysis, where the goal is to find relationships between items frequently purchased together by customers. Examples of applications of mining association rules include product recommendations, determining consumer purchasing patterns, and analyzing customer behavior and market preferences. Algorithms commonly used for mining association rules are Apriori and FP-Growth.

def encode_units(x):
  if x <= 0:
    return 0
  if x >= 1:
    return 1
data_encode = dataplus.applymap(encode_units)
data_encode

From these results, there are 20136 rows and 4077 columns. In market basket analysis, we are interested in knowing whether each item was purchased or not, as well as the relationship between purchases of each item and other items. To reveal this information, we need to convert the data into binary format, where a value of 0 would indicate that the item was not purchased and a value of 1 would indicate that the item was purchased. Thus, values that are initially less than zero will be changed to 0, while values that are more than zero will be changed to 1.

In market basket analysis, our focus is on identifying relationships between purchases of two or more items based on historical transaction data. Therefore, transactions that only purchase one item are considered less relevant. Therefore, we will filter transactions to only include those purchasing more than two items.

data_filter = data_encode[(data_encode>0).sum(axis=1)>=2]
data_filter

Based on this output, it can be seen that there were 18338 transactions that purchased more than 1 item.

#apriori
from mlxtend.frequent_patterns import apriori

The above syntax imports a priori functions from the mlxtend.frequent_patterns library. This function is used to apply the Apriori algorithm, which is a popular method in shopping cart analysis to find frequently purchased items together in a set of transactions. Using the Apriori algorithm, we can identify sets of items that frequently appear together in shopping transactions, which can provide valuable insights for marketing strategy and decision making.

from mlxtend.frequent_patterns import apriori, association_rules

frequent_itemsets_plus = apriori(data_filter, min_support=0.03, use_colnames=True)
frequent_itemsets_plus['length'] = frequent_itemsets_plus['itemsets'].apply(lambda x: len(x))

# Sorting berdasarkan support dan reset index
frequent_itemsets_plus = frequent_itemsets_plus.sort_values(by='support', ascending=False).reset_index(drop=True)

print(frequent_itemsets_plus)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
/usr/local/lib/python3.10/dist-packages/mlxtend/frequent_patterns/fpcommon.py:110: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type
  warnings.warn(
      support                                           itemsets  length
0    0.122205               (WHITE HANGING HEART T-LIGHT HOLDER)       1
1    0.113317                          (JUMBO BAG RED RETROSPOT)       1
2    0.107264                         (REGENCY CAKESTAND 3 TIER)       1
3    0.091122                                    (PARTY BUNTING)       1
4    0.085233                          (LUNCH BAG RED RETROSPOT)       1
..        ...                                                ...     ...
160  0.030320  (LUNCH BAG RED RETROSPOT, LUNCH BAG SUKI DESIGN )       2
161  0.030210                          (HAND WARMER BIRD DESIGN)       1
162  0.030210     (LUNCH BAG CARS BLUE, LUNCH BAG RED RETROSPOT)       2
163  0.030101                  (CHILDRENS APRON SPACEBOY DESIGN)       1
164  0.030101  (PAPER CHAIN KIT 50'S CHRISTMAS , PAPER CHAIN ...       2

[165 rows x 3 columns]

The syntax above uses the a priori function from the mlxtend.frequent_patterns library to produce a set of itemsets that frequently appear together in transaction data, with a minimum support limit of 0.03. Then, a ‘length’ column is added to the resulting dataframe to indicate the length of each itemset. The dataframe is then sorted based on the support value in descending order and the index is reset.

The output is a dataframe showing frequently occurring itemsets along with their support and itemset length. For example, the first itemset (‘WHITE HANGING HEART T-LIGHT HOLDER’) has a support of 0.122205 and its itemset length is 1. There are a total of 165 itemsets displayed in the output.

#Analysis and Interpretation of Rules

from mlxtend.frequent_patterns import association_rules
association_rules(frequent_itemsets_plus, metric='lift',
                  min_threshold=1).sort_values('lift', ascending=False).reset_index(drop=True)

The syntax above uses the association_rules function from the mlxtend.frequent_patterns library to generate association rules from itemsets that frequently appear together in transaction data. The frequent_itemsets_plus parameter is a dataframe containing frequently occurring itemsets along with their supports. The metric parameter specifies the metric used to evaluate the rule, in this case, ‘lift’ is used. The min_threshold parameter is the lower threshold value for the selected metric, in this case, 1. Then, the results are sorted based on the ‘lift’ value in descending order and the index is reset.

The output is a dataframe containing association rules between itemsets and related metric values, such as ‘antecedents’ (itemset before), ‘consequents’ (itemset after), ‘support’, ‘confidence’, and ‘lift’. These rules describe the relationship between itemsets that often appear together in transactions. For example, the first rule might state that if someone buys ‘WHITE HANGING HEART T-LIGHT HOLDER’, they are also likely to buy ‘JUMBO BAG RED RETROSPOT’, with the ‘lift’ value indicating how strong the relationship is.

Association rules are generated from a set of items that frequently appear together, using a lift metric, with a minimum lift threshold of 1. The results are then sorted by lift value. From these results, it can be seen that ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER have the highest association because they both have the highest lift value. This shows that the two items have a strong relationship, where if one is purchased, it is likely that the other will also be purchased. If the lift value is more than 1, it can be concluded that these two items are suitable for sale together.

#Marketing strategy

In this research, we have carried out a Market Basket Analysis using transaction data from the UK. These findings have the potential to be used in marketing strategies and data-driven decision making. From this dataset, we can draw some valuable business insights such as the following:

PRODUCT BUNDLES. To increase sales, we can place GREEN REGENCY TEACUP AND SAUCER and PINK REGENCY TEACUP AND SAUCER as one product at a cheaper price than buying them separately. This method can increase the chances of the item being sold.
Placement of goods. We can place items side by side between ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER or place the WHITE HANGING HEART T-LIGHT HOLDER near the cashier because these items are purchased most often.
Customer Promotions and Discounts. GREEN REGENCY Tea Cups and Saucers can be placed near the cashier, so that every time a customer buys GREEN REGENCY Tea Cups and Saucers, we can offer a discount or recommendation to buy GREEN REGENCY Tea Cups and Saucers at a cheaper price. This can encourage customers to make additional purchases and increase transaction value.