“Unleashing Data Synergy: A Comprehensive Exploration of Association Rule Mining Implementation for Enhanced Customer Insights and Business Growth”

Emel Kunt
Trendyol Tech
Published in
8 min readFeb 29, 2024

Hello, my name is Emel Kunt and I’m working at Trendyol as a Data Product Owner in Data Management tribe. I’m writing this article to explain how we conducted Market Basket Analysis by implementing the Association Rule Mining Algorithm at Trendyol. In this article, you will find information about Market Basket Analysis, Association Rule Mining techniques and terminologies and a detailed overview of the implementation at Trendyol.

Decoding Customer Patterns: Market Basket Analysis

Market basket analysis is a data analysis technique used to understand customers’ purchasing behaviors, based on the assumption that certain products are more frequently bought together than others. The basic output of market basket analysis is often in the form of ‘if-then’ statements: if a customer buys product X, they are likely to buy product Y. The most common technique to perform market basket analysis is association rule mining.

The Journey and Impact of Association Rule Mining Implementation in Trendyol

Deep Dive into Association Rule Mining Terminologies

Association Rule Mining involves a set of terminology crucial for understanding and interpreting patterns within datasets.

Itemset: The fundamental terms include “itemset,” representing a collection of one or more products frequently purchased together.

Support: Support is a measure, indicating the frequency of a specific product appearing in all transactions. It quantifies the frequency of a particular product within the dataset, essentially representing its occurrence rate. Support % is a ratio that illustrates the frequency of occurence of an item in the entire dataset relative to all transactions.
Briefly , support is the frequency of a specific product or product combination appears in all transactions.

  • Example: The support for the combination a “Laptop” and “Wireless Mouse” is calculated by dividing the number of transactions containing both items by the total number of transactions. If this combination appears in 50 out of 200 transactions, the support is 50/200 = 0.25 or 25%.

Confidence: Confidence is measuring the reliability of an association rule by assessing the likelihood of the consequent item given the presence of the antecedent item. The likelihood of one product being purchased when the antecedent item is present.

  • Example: If the rule is, “If a customer buys a Laptop, they are likely to buy a Laptop Bag,” the confidence is calculated by dividing the number of transactions containing both “Laptop” and “Laptop Bag” by the number of transactions containing only “Laptop.” If 30 out of 40 transactions with a Laptop also include a Laptop Bag, the confidence is 30/40 = 0.75 or 75%.

Antecedent and Consequent: Antecedent is the very first component of the association rule and consequent is the second component of the association rule.

Association rule mining is based on an “if and then statement”. So the antecedent item is the IF statement and the consequent item is the THEN statement of the rule.

Lift: The lift measure is a statistical metric used in Association Rule Mining to assess the strength and significance of an association between two items (antecedent and consequent) in a dataset. Lift helps determine whether the probability of the occurrence of the consequent is influenced by the presence of the antecedent or whether the two events are independent.
Lift measures how much more likely one product is to be bought when another product is also purchased, compared to its independent purchase.

  • Example: For the rule “If a customer buys a Smartphone, they are likely to buy Earphones,” the lift is calculated by dividing the support for both items by the product of their individual supports. If the support for “Smartphone” and “Earphones” together is 20%, the support for “Smartphone” is 30%, and the support for “Earphones” is 25%, then the lift is (20%) / (30% * 25%) = 2.67. A lift value greater than 1 indicates that the two items are correlated, which suggests a stronger association than random chance.

Precision in Practice: Implementation of Association Rule Mining in Our Environment

In the Trendyol data management team, we utilize Google Cloud as our Data Warehouse system, and the entire data preparation process is seamlessly managed through BigQuery, which is part of the Google Cloud platform. An existing integration between the Trendyol order management system and the data transformation process, has been successfully implemented.

While there are various association rule mining algorithms available, such as Apriori, we have chosen to concentrate on SQL for our study. This decision is rooted in our existing familiarity with SQL, making it, a more efficient choice for our exploration.

Our goal was to develop market basket analysis that is easily understandable, even for newcomers to our team, and enhance the outputs with various measures essential for our stakeholders. Additionally, we planned to integrate our main data visualization tool Looker with these outputs to present them seamlessly to the relevant teams within the company.

About The Dataset

Our dataset comprehends e-commerce sales transactions for the last month. We considered creating a market basket analysis data mart and used t-1 model tables and built a daily incremental load in ETL.

Our dataset includes e-commerce sales transactional data from the last 1 year. Our current focus involves the creation of a market basket analysis data mart, leveraging t-1 model tables, and implementing a daily incremental load as part of our ETL process. This strategic approach ensures that our analyses stay up-to-date and align with the dynamic nature of the e-commerce sales landscape.

So, we skipped the data cleaning process because we already had a cleaned and transformed data warehouse sales order table.

Step 1: Data Preparation

We created a prep table named ‘bought_together_products_data_prep’. This table encompasses sales data and product information. During this step, we filtered relevant daily sales orders and calculated the support count (frequency) of all products individually. We also incorporated product information from various dimension tables.

Additionally, in our implementation, we adjusted the usage of the support measure based on our understanding and stakeholders’ requirements. Conventionally, support is defined as the ratio of the frequency of a product individually in all transactions. However, in this step, we calculate support as the raw frequency without considering the ratio component traditionally associated with it.

Simple pseudocode for the step 1:

In this table:

  • Order_Id, is the unique identifier for a specific transaction.
  • Product_Id is the identifier of the product in the transaction.
  • Support represents the quantity of the product purchased in that transaction.
  • Order_Date is the date of the transaction.
  • Product Name , Category_Id, Brand_Id are additional details about the product, obtained from the product_dimension_table.

Step 2: Calculating Support

In this step, we generated combinations of itemsets, representing combinations of products within a single transaction. This process includes a self-join on the ‘bought_together_products_data_prep’ table with the condition of the ‘same order id but different product number.’

Following this, we create a new table named ‘bought_together_products_calculation’ to store the results.

Simple pseudocode for the step 2:

In this table:

  • product_id_1 is the antecedent product.
  • product_id_2 is the consequent product.
  • support_product_id_1 indicates the total order count of product 1.
  • support_product_id_2 indicates the total order count of product 2.
  • support indicates the frequency or count of occurrences of this specific combination in the dataset.

Step 3: Pruning

We have chosen a daily incremental table model to calculate measures efficiently, using only the most recent transactions in order to minimize processing costs. This approach involves adding the records from the last day to the final bought_together_products table on a daily basis.
ensuring that our analyses remain up-to-date without the need to process the entire dataset each time.This ensures that our analyses stay current without requiring processing of the entire dataset each time.

Furthermore, we were planning to implement pruning in this step, we typically set a threshold value for support and eliminate results falling below this threshold. While this step is conventionally applied to reduce data size and enhance processing speed, in our specific implementation, we have initially chosen to make the pruning step optional. We prefer to assess the unfiltered results for a while before deciding on specific threshold values for support.

Furthermore, in the upcoming phases of the project, we will evaluate the potential benefits of pruning to optimize data and enhance processing efficiency.

Step 4: Calculation confidence and lift

We have chosen to calculate confidence and lift measures directly within our visualization tool. This decision aligns with our goal of dynamically assessing these metrics across different time periods.By incorporating confidence and lift calculations into the visualization, we gain the flexibility to analyze customer purchase patterns dynamically, especially during events like Black Friday or national holidays.

Challenges

Despite the success of our association rule mining implementation, we faced certain challenges along the way. The decision to make the pruning process optional provided flexibility but also presented challenges, particularly when dealing with very large datasets. Furthermore, calculating lift and confidence during the visualization phase presented another challenge, primarily due to the complexities arising from different time periods.

Conclusion

In summary, our association rule mining strategy combines various techniques, with a focus on efficiency and adaptability. The daily incremental table model ensures that our sales data is consistently updated, enabling seamless integration of the latest transactions into our analyses. Calculating confidence and lift measures directly within our visualization process enhances the dynamic nature of our insights, enabling us to understand customer purchase patterns across different periods and significant events.
Notably, all these processes are executed solely using SQL and data warehouse concepts. This deliberate choice reflects our commitment to simplicity, ensuring ease of comprehension and maintenance.

About Us

We’re building a team of the brightest minds in our industry. Interested in joining us? Visit the pages below to learn more about our open position.

--

--