What is Association Rule Learning? An Applied Example in Python: Basket Analysis and Product Offering

Baysan
CodeX
Published in
7 min readSep 21, 2021

In this story, we will try to cover what Association Rule Learning is, and I will demonstrate an applied example in Python. Also, I will share the codes in Kaggle. You can access the Kaggle notebook at the following link:

What is Association Rule Learning?

Association Rule Learning is a rule-based machine learning technique that is used for finding patterns (relations, structures etc.) in datasets. By learning these patterns we will be able to offer some items to our customers. For offering, we will use the Apriori Algorithm.

Photo by CardMapr on Unsplash

What is Apriori Algorithm

In this learning technique, we use the Apriori Algorithm for extracting associations with targeted items. Sometimes we can see this technique as Shopping Cart Analysis.

It has the 3 following metrics:

  • Support(X,Y) = Freq(X,Y) / N
  • Confidence(X,Y) = Freq(X,Y) / Freq(X)
  • Lift = Support(X,Y) / (Support(X) * Support(Y))

Support refers to the probability of observing X and Y together

Confidence refers to the probability of observing Y when the X sold

Lift refers to that when X is bought, the probability of buying Y increases by the Lift times.

In large scale projects, identifying “what is shopping cart” is can be a challenge.

An Example of Apriori Algorithm

Image by Author

We can see a simple example of the Apriori Algorithm above.

In this example, we will assume the bread is X and the Milk is Y for the apriori algorithm.

  • Support(Bread) will give the ratio of the count of transactions that contain Bread to the count of the total transactions.
  • Support(Milk) will give the ratio of the count of transactions that contain Milk to the count of the total transactions.
  • Support(Bread,Milk) will give the ratio of the count of transactions that contain Bread and Milk to the count of the total transactions.
  • Confidence(Bread,Milk) will give the ratio of the count of transactions that contain Bread and Milk to the count of the transactions that contain Bread.
  • Lift(Bread,Milk) will give the ratio of the count of transactions that contain Bread and Milk to the count of the transactions that contain Bread multiply by the count of the transactions that contain Milk.
  • We can say that if Bread is bought, our probability of selling Milk increase 1 times more in the light of the above process.

An Applied Example in Python

We will use this dataset for our example. Also, you can get it from here. As aforementioned, I’ll explain the codes’ meanings in this story. You can find the just codes without explanation in the Kaggle notebook.

You can find the dataset’s column meanings at the following lines:

  • Invoice: Invoice number. If this number starts with ‘C’, it means this transaction is cancelled.
  • StockCode: Product code
  • Description: Product Name
  • Quantity: Product counts
  • InvoiceDate: Transaction date
  • Price: A single product price
  • CustomerID: Unique customer number
  • Country: Customer’s country name

Installing Libraries

I’ll use openpyxl for reading excel files and mlxtend for using the apriori algorithm and extracting rules.

!pip install openpyxl!pip install mlxtend

Getting Dataset

I’m going to import the libraries which I will need to read the data from the excel file.

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
raw_data = pd.read_excel('../online_retail_II.xlsx',sheet_name='Year 2010-2011')

Preparing Dataset

I’ll prepare the dataset. I’ll filter out entries that contain ‘C’ in the Invoice column etc. I did this process so many times before. You can see the meaning of this from my previous stories. I wrote the function that you see below for this process. I prepared the dataframe and assigned it to a new variable df.

def prepare_retail(dataframe):
# preparing dataset
dataframe.dropna(inplace=True)
dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]
dataframe = dataframe[dataframe["Quantity"] > 0]
dataframe = dataframe[dataframe["Price"] > 0]
return dataframe
df = prepare_retail(raw_data)

Creating Apriori Data Structure

I need the below structure (pivot table) for using the apriori algorithm.

Rows represent transactions (invoice, shopping cart etc.) and columns represent products. We represent which transaction (invoice, shopping cart etc.) contains which products as binary. If the product is in the invoice, the intersection cell will be “1”. If is not, it will be “0”.

For doing that, I’ll create a function that has an id argument. If the id argument is True, it will create the pivot table by using the StockCode column. If it is False, it will create the pivot table by using product names.

def create_apriori_datastructure(dataframe, id=False):
if id:
grouped = germany_df.groupby(
['Invoice', 'StockCode'], as_index=False).agg({'Quantity': 'sum'})
apriori_datastructure = pd.pivot(data=grouped, index='Invoice', columns='StockCode', values='Quantity').fillna(
0).applymap(lambda x: 1 if x > 0 else 0)
return apriori_datastructure
else:
grouped = germany_df.groupby(
['Invoice', 'Description'], as_index=False).agg({'Quantity': 'sum'})
apriori_datastructure = pd.pivot(data=grouped, index='Invoice', columns='Description', values='Quantity').fillna(
0).applymap(lambda x: 1 if x > 0 else 0)
return apriori_datastructure

After using this function, I will get a result dataframe like below.

Selecting Germany Based Invoices

In this example, I want to work based on German data. By this, I can save time and get better performance.

germany_df = df[df['Country'] == 'Germany'] 

germany_df.head()

Getting Apriori Data Structure

I’ll use the function that I created above for getting an apriori algorithm data structure.

germany_apriori_df = create_apriori_datastructure(germany_df,True)germany_apriori_df.head() # Invoice-Product matrix (apriori data structure)

Learning Rules (Association Rule Learning)

I am going to create a function for extracting association rules. Actually, we don’t need to create functions for each process but I thought that if I created functions for each process, they can be more clear.

This function takes 2 arguments. apriori_df is the dataframe that is apriori data structure. We say that by using min_supportthe products that can be sold together with a min 0.01 probability should come up. The probability that each product will be sold together with each other. We will apply the apriori algorithm by using apriori function. Then we will extract the association rules by using association_rules function.

def get_rules(apriori_df, min_support=0.01):
frequent_itemsets = apriori(apriori_df, min_support=min_support, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="support", min_threshold=min_support) return rulesgermany_rules = get_rules(germany_apriori_df)germany_rules.head()

As we see below, the new dataframe that we created by using extracted rules has some columns that we don’t know yet.

Don’t worry, I’ve explained them in the following:

  • antecedents -> the first product(s) (the product that we assumed is sold first)
  • consequents -> the next product(s) (the product that we assumed is sold after the first product)
  • antecedent support -> probability of observing the first product(s)
  • consequent support -> probability of observing the next product(s)
  • support -> probability of observing the next product(s) (consequents) and the first product(s) (antecedents) together
  • confidence -> probability of observing the next product(s) when sold the first product(s)
  • lift -> When the first product is sold, the probability of selling the next product(s) increases by a factor of lift.
  • leverage -> Similar to the lift but leverage tends to prior higher support values. We should avoid use this if we already have the lift value.
  • conviction -> probability of observing the antecedents without consequents

Creating Some Utility Functions

I’ll create a function for getting the product names by using product id.

def get_item_name(dataframe, stock_code):
if type(stock_code) != list:
product_name = dataframe[dataframe["StockCode"] == stock_code][["Description"]].values[0].tolist()
return product_name
else:
product_names = [dataframe[dataframe["StockCode"] == product][["Description"]].values[0].tolist()[0] for product in stock_code]
return product_names

I’m going to test it.

get_item_name(germany_df,10125)

I’ll create another useful function. We’ll see the product that is in the shopping cart and what are recommended products related to it.

def get_golden_shot(target_id,dataframe,rules):
target_product = get_item_name(dataframe,target_id)[0]
recomended_product_ids = recommend_products(rules, target_id)
recomended_product_names = get_item_name(dataframe,recommend_products(rules, target_id))
print(f'Target Product ID (which is in the cart): {target_id}\nProduct Name: {target_product}')
print(f'Recommended Products: {recomended_product_ids}\nProduct Names: {recomended_product_names}')

Recommending Products

I’ll create another function for simulating the recommendation process. Actually, we will do a recommendation in this function.

def recommend_products(rules_df, product_id, rec_count=5):
sorted_rules = rules_df.sort_values('lift', ascending=False)
# we are sorting the rules dataframe by using "lift" metric
recommended_products = []

for i, product in sorted_rules["antecedents"].items():
for j in list(product):
if j == product_id:
recommended_products.append(
list(sorted_rules.iloc[i]["consequents"]))

recommended_products = list({item for item_list in recommended_products for item in item_list})

return recommended_products[:rec_count]

You can see the deep dive explanations of the above function in the Kaggle notebook. I won’t deep explain here. Just know this, the function which we coded above sorts the dataframe that holds the association rules by using the lift metric. Then it filters out the entries which contain the product_id argument and collects the related products from the consequents column, and adds them to therecommended_list then the function finally returns to the recommended products that were created for recommended products.

Let’s Do Some Recommendations

I’ll create some product ids for simulating this.

# simulating some products like they are in cart
TARGET_PRODUCT_ID_1 = 21987
TARGET_PRODUCT_ID_2 = 23235
TARGET_PRODUCT_ID_3 = 22747

I can see the target products’ names by using the function that we created above for seeing the product name.

get_item_name(germany_df, [TARGET_PRODUCT_ID_1,TARGET_PRODUCT_ID_2, TARGET_PRODUCT_ID_3])

I’m executing the recommendation function for testing.

Finally

Hopefully, you enjoyed this. I tried to do my best. As aforementioned, you can see the all codes in the Kaggle notebook.

Kind Regards

--

--

Baysan
CodeX
Writer for

Lifelong learner & Developer. I use technology that helps me. mebaysan.com