Understanding Association Mining and Market Basket Analysis with Apriori Algorithm using Python

Published in

Analytics Vidhya

9 min readApr 4, 2020

Association Mining

Simply put, finding relations between objects. For example, most of the people must have bought bread if they have bought butter. So, some supermarkets keep the objects that have high probability of being purchased together in same aisles while other may keep them in two different corners so that the customer while finding the other necessity would browse and purchase other products. Hence, association mining can be fundamental in establishing and improving decisions and business rules for catalog design, sales, marketing, etc.

Frequent pattern mining is discovering patterns in the data set that appear frequently i.e discovery of associations and correlations. Patterns could be in the form of set of items like milk and cookies or as a sequence e.g. buying a mobile phone, then memory card and head phones. Market Basket Analysis is an example where buying habits are analysed and rules are established based on the customer’s “buying habits”. Patterns are represented in the form of rules.

Measures of interestingness for patterns

Reflect the level of usability and certainty of the established rules.

Support: It is the ratio of transactions involving a particular item to the total number of transactions. It defines the popularity of an item. It ranges between 0 and 1.
Confidence: It is the ratio of number of transactions involving two items X and Y by the number of transactions involving X. Therefore, it tells the possibility of how often items X and Y occur together, given the number of times X has occurred. It ranges between 0 and 1.
Lift: Lift indicates certainty of a rule. How much sale of X has increased when B is sold?

Lift(X=>Y) = Confidence(X, Y)/Support(Y)

Example: A => B [support = 5%, confidence = 80%]

5% of all the transactions show that A and B have been purchased together. 80% of the customers that bought A also bought B.

Common algorithms are Apriori, FP-growth and more.

Source: Data Mining Concepts and Techniques, Third Edition.

Company Giants like Amazon, Flipkart, Capital One, Walmart use this analysis methods on their big data. Example: “Frequently Bought Together Items.”

Note: Association and Recommendation are different because association is not about a particular individual’s preference but about relations between sets of items.

About Data

Association Mining can be used in problems where you need to make better decisions based on habits of your customers.

E.g. Grocery and Essential Stores, Online Markets, Music and Movie Genres, Software Purchases, etc.

The data will generally be big data and unstructured. That is, the data will not be in strict tabular format. There could be any number of items(columns) in one row, so we would need to handle the problem with varied number of columns.

Apriori Algorithm

1. Find all frequent itemsets: The itemsets that occur at least as frequently as minimum support count.
2. Generate strong association rules from the frequent itemsets: The rules that satisfy min support and min confidence.

Standard data set — groceries.csv

Dimensions: 9834 X 41

Implementation

A. Exploratory Data Analysis over the Data.

Get the Shape
Find the top 20 items that occur in the data set
How much these 20 items account for? — Item Percentage

B. Prune the data set based on number of items in a transaction and total sales percentage.

C. Apply Apriori Algorithm and obtain Rules.

Let’s start and get the answers to above questions!

Import the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
from apyori import apriori
import itertools

2. Get the shape

data = pd.read_csv('groceries.csv')
data.shape

Output:

(9835, 41)

3. Next, to find top 20 items and how much they attribute to the total percentage.

#Finding all items present in our data - groceries.csv
#We will have to use the csv package so that we can read each line one by one and update any new grocery item.all_items = set()#set of all items
with open("groceries.csv") as f:
    reader = csv.reader(f, delimiter=",")
    for i, line in enumerate(reader):
        all_items.update(line)#Now, we count if a particular item appears in the a particular row and update it in a list format.counting = list()
with open("groceries.csv") as f:
    reader = csv.reader(f, delimiter=",")
    for i, line in enumerate(reader):
        row = {item:0 for item in all_items}
        row.update({item:1 for item in line})
        counting.append(row)#Next, convert the list in to Pandas DataFrame so that wecan do pandas operations.groceries = pd.DataFrame(counting)groceries.head()
# 0 represents that the item is not present in a particular row/ item order list.

# Finding item count is now easy - we need to sum it up.# 1. Find total number of items sum of all the sums of rows
tot_item_count = sum(groceries.sum()) # Answer is 43368# 2. Sum the rows and sort is descending order to get top 20 items
item_sum = groceries.sum().sort_values(ascending = False).reset_index().head(n=20)
item_sum.rename(columns={item_sum.columns[0]:'Item_name',item_sum.columns[1]:'Item_count'}, inplace=True)# 3. Add the percent so that we know how much it contributes. #Tot_percent of x determines the percentage of x and above elements in the total percentage i.e cumulative sum.
item_sum['Item_percent'] = item_sum['Item_count']/tot_item_count
item_sum['Tot_percent'] = item_sum.Item_percent.cumsum()
item_sum.head(20) # List of top 20 items with percents# Plotting Code
# We will plot later, so that we can plot items + frequent items.
# But the function can be called here as well.
# Alternative Codeobj = (list(item_sum['Item_name'].head(n=20)))
y_pos = np.arange(len(obj))
performance = list(item_sum['Item_count'].head(n=20))
 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, obj, rotation='vertical')
plt.ylabel('Item count')
plt.title('Item sales distribution')

Top 5 Items = 21.74% and Top 20 Items = 50.37%

Therefore, we need to prune/reduce the data set because most of the elements contribute zilch to the total sales.

4. We have completed A part of our project/problem statement. Now we will look into the B section.

For this, we will define a function prune_dataset that can take following parameters from the user/ analyst.

Input Data Frame
Minimum length of transactions (i.e. minimum number of items in a row) to be considered.
Minimum Total Sales Percent for the item to be considered.

def prune_dataset(olddf, len_transaction, tot_sales_percent):
    # Delete the last column tot_items if present
    if 'tot_items' in olddf.columns:
        del(olddf['tot_items'])
    #Finding the item_count for each item and total number of items.
    #This is the same code as in step 3    
    Item_count = olddf.sum().sort_values(ascending = False).reset_index()
    tot_items = sum(olddf.sum().sort_values(ascending = False))
    Item_count.rename(columns={Item_count.columns[0]:'Item_name',Item_count.columns[1]:'Item_count'}, inplace=True)
    
    # Code from Step 3 to find Item Percentage and Total Percentage.
    Item_count['Item_percent'] = Item_count['Item_count']/tot_items
    Item_count['Tot_percent'] = Item_count.Item_percent.cumsum()
    
    # Taking items that fit the condition/ minimum threshold for total sales percentage.
    selected_items = list(Item_count[Item_count.Tot_percent < tot_sales_percent].Item_name)
    olddf['tot_items'] = olddf[selected_items].sum(axis = 1)
    
    # Taking items that fit the condition/ minimum threshold for length of transaction or number of items in a row.
    olddf = olddf[olddf.tot_items >= len_transaction]
    del(olddf['tot_items'])
    
    #Return pruned dataframe.
    return olddf[selected_items], Item_count[Item_count.Tot_percent < tot_sales_percent]

We will now input different values for len_transaction and tot_sales_percent to obtain an appropriate data set for apriori.

Pruned Data Frame #1

pruneddf, Item_count = prune_dataset(groceries,4,0.4)
print(pruneddf.shape)
print(list(pruneddf.columns))

Output (The list of columns are actually the items that we are taking under consideration for apriori.)

(1267, 13)
['whole milk', 'other vegetables', 'rolls/buns', 'soda', 'yogurt', 'bottled water', 'root vegetables', 'tropical fruit', 'shopping bags', 'sausage', 'pastry', 'citrus fruit', 'bottled beer']

It has decent number of rows and top 13 items(columns)

Pruned Data Frame #2

pruneddf, Item_count = prune_dataset(groceries,4,0.5)
print(pruneddf.shape)
print(list(pruneddf.columns))

Output:

(1998, 19)
['whole milk', 'other vegetables', 'rolls/buns', 'soda', 'yogurt', 'bottled water', 'root vegetables', 'tropical fruit', 'shopping bags', 'sausage', 'pastry', 'citrus fruit', 'bottled beer', 'newspapers', 'canned beer', 'pip fruit', 'fruit/vegetable juice', 'whipped/sour cream', 'brown bread']

Now, the number of rows are good and we also have top 19 items in our condensed data set.

Pruned Data Frame #3

pruneddf, Item_count = prune_dataset(groceries,2,0.5)
print(pruneddf.shape)
print(list(pruneddf.columns))

Output:

(5391, 19)
['whole milk', 'other vegetables', 'rolls/buns', 'soda', 'yogurt', 'bottled water', 'root vegetables', 'tropical fruit', 'shopping bags', 'sausage', 'pastry', 'citrus fruit', 'bottled beer', 'newspapers', 'canned beer', 'pip fruit', 'fruit/vegetable juice', 'whipped/sour cream', 'brown bread']

This is a very good data set because it contains top 19 items and very good (effectively, half) amount of rows/ transactions from the original data set. So, we will proceed with this.

5. B part has been completed. Now, we will apply Apriori and generate rules and relations on the reduced data that contains only the most relevant of the data.

First, we need to convert our data frame into a csv file such that it appears like our original data set but with reduced dimensions.

# Converting 1's to appropriate item name(column name)
y = list(pruneddf.columns)
for s in y:
    pruneddf.loc[(pruneddf[s] == 1),s]=s
#Removing Zero's
lol = pruneddf.values.tolist() 
for a in lol:
    while (0 in a):
        a.remove(0)
#Making a new pruned dataset csv file
with open("pruned.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(lol)

We have a clean and organized csv file on which we can apply Apriori Code.

import csv
import itertools
#Delete Files prunedRules and PrunedFItems otherwise new data will be appended.
Groceries = open('pruned.csv', 'r')
#Minimum Support Count
min_support = 0.04
Rules = "prunedRules.txt"
freqItemsets = "prunedFItems.txt"
#Mininum Confidence
min_confidence = 0.30# Finding all Frequent 1-Item sets
def OneItemSets():
    #Get all 1-itemsets in the list items and their counts in the dictionary counts
    DataCaptured = csv.reader(Groceries, delimiter=',')
    data = list(DataCaptured)
    for e in data:
        e = sorted(e)
    count = {}
    for items in data:
        for item in items:
            if item not in count:
                count[(item)] = 1
            else:
                count[(item)] = count[(item)] + 1count2 = {k: v for k, v in count.items() if v >= min_support*9835}return count2, data#Ck is a superset of Lk. It is a part of Prune Step. its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck.
# Generated by joing two Lk
def generateCk(Lk_1, flag, data):
    Ck = []
    if flag == 1:
        flag = 0
        for item1 in Lk_1:
            for item2 in Lk_1:
                if item2 > item1:
                    Ck.append((item1, item2))
        print("C2: ", Ck[1:3])
        print("length : ", len(Ck))
        print()else:
        for item in Lk_1:
            k = len(item)
        for item1 in Lk_1:
            for item2 in Lk_1:
                if (item1[:-1] == item2[:-1]) and (item1[-1] != item2[-1]):
                    if item1[-1] > item2[-1]:
                        Ck.append(item2 + (item1[-1],))
                    else:
                        Ck.append(item1 + (item2[-1],))
        print("C" + str(k+1) + ": ", Ck[1:3])
        print("Length : ", len(Ck))
        print()
    L = generateLk(set(Ck), data)
    return L, flag#If item in Ck belongs to a transaction, it makes it into list Ct Then Ct is thresholded to form L
# For K frequent Itemsets
def generateLk(Ck, data):
    
    count = {}
    for itemset in Ck:
        #print(itemset)
        for transaction in data:
            if all(e in transaction for e in itemset):
                if itemset not in count:
                    count[itemset] = 1
                else:
                    count[itemset] = count[itemset] + 1print("Ct Length : ", len(count))
    print()count2 = {k: v for k, v in count.items() if v >= min_support*9835}
    print("L Length : ", len(count2))
    print()
    return count2#  Generates association rules from the frequent itemsets
def rulegenerator(fitems):
    counter = 0
    for itemset in fitems.keys():
        if isinstance(itemset, str):
            continue
        length = len(itemset)union_support = fitems[tuple(itemset)]
        for i in range(1, length):lefts = map(list, itertools.combinations(itemset, i))
            for left in lefts:
                if len(left) == 1:
                    if ''.join(left) in fitems:
                        leftcount = fitems[''.join(left)]
                        conf = union_support / leftcount
                else:
                    if tuple(left) in fitems:
                        leftcount = fitems[tuple(left)]
                        conf = union_support / leftcount
                if conf >= min_confidence:
                    fo = open(Rules, "a+")
                    right = list(itemset[:])
                    for e in left:
                        right.remove(e)
                    fo.write(str(left) + ' (' + str(leftcount) + ')' + ' -> ' + str(right) + ' (' + str(fitems[''.join(right)]) + ')' + ' [' + str(conf) + ']' + '\n')
                    print(str(left) + ' -> ' + str(right) + ' (' + str(conf) + ')')
                    counter = counter + 1
                    #Greater than 1???
                    fo.close()
    print(counter, "rules generated")
def plotitemfreq(L):
    aux = [(L[key], key) for key in L]
    aux.sort()
    aux.reverse()
    z = aux[0:20]
    print(z)
    df = pd.DataFrame(z, columns = ['Count', 'Word'])
    df['Count']=pd.to_numeric(df['Count'])
    print(df.info())
    df.plot.bar(x='Word', y='Count')def apriori():
    L, data = OneItemSets()
    flag = 1
    FreqItems = dict(L)
    while(len(L) != 0):
        fo = open(freqItemsets, "a+")
        for k, v in L.items():
           
            fo.write(str(k) + ' >>> ' + str(v) + '\n\n')
        fo.close()
        plotitemfreq(L)L, flag = generateCk(L, flag, data)
        FreqItems.update(L)
    rulegenerator(FreqItems)apriori()