DATA STORIES | RECOMMENDATION SYSTEM | KNIME ANALYTICS PLATFORM

Product recommendation for Huimitu e-commerce using Association Rule with KNIME

An end-to-end solution for a Uni project: from an e-commerce design to a recommendation engine in production

Do Vuong Phuc
Low Code for Data Science

--

What is Huimitu?

Huimitu is a bakery e-commerce shop developed by me & my team as a university project. The requirement of the project was to build an e-commerce website using popular frameworks (e.g. ReactJS, ExpressJS) instead of using the product-ready platform like Wordpress.

Huimitu webview.

The cool thing is that I have suggested that my team apply KNIME — a free, open-source and low-code platform for data science — to implement the related product recommendation feature.

Let’s dive into how we leveraged KNIME to do that!

Use cases

In this section, we will summarize the specification which relates to our recommendation system only. Roughly speaking, we have:

  • Product: with basic information like name and description. Also each product may have variants (e.g. color, size) with different prices.
  • Cart: a single user obviously has one cart, which can contain many products (variants) with their corresponding amount and price.
  • Order: After the cart has been checked-out, it will become the order. Because the information of the product (e.g. price, discount) will change overtime, so the order must store all the information at the check-out time. After the order has been check-out, we also need to track its status (PENDING, SHIPPING, CANCEL, SUCCESS, REFUND).

Moreover, in the product detail page, we need to display the related products by analyzing the users’ buying pattern.

Recommendation algorithms

Thinking about a comprehensive recommendation system for an e-commerce website, we can implement different strategies occurring at different stages:

  • Related products recommendation
  • Cart recommendation
  • Top buying products
  • User-based recommendation

Each of them may be implemented differently, for instance, the “top buying” products can be easily produced by aggregating & sorting. In contrast, the “user-based recommendation” must leverage the users’ information.

In general, the recommendation algorithms can be divided into three groups:

  • Non-personalized
  • Semi-personalized
  • Personalized
Personalized level of recommendation algorithms (source: Practical Recommender Systems).

In our case, we just recommended the “related products”, so we did not consider the users’ information. This is called non-personalized recommendation.

Regarding the approach, we can either:

  • Approach 1: Use the product’s information (name, description) to learn the relationship between them.
  • Approach 2: Use the users’ transaction to learn the buying pattern.

Regarding the first approach, it is an NLP (Natural Language Processing) problem, which is quite complex for our system. We have decided to use the second approach, that is analyzing the users’ transaction to find the buying pattern with the Association rule algorithm.

Association rule algorithm

Association rule (or frequent itemset mining) is an algorithm to find rules of relation between items. The items can be anything, such as our products, Netflix films, Facebook’s posts, etc. To use this algorithm, we input a set of transactions and the outcome will be the rules:

  • Input: Set of transactions, which contain items
  • Output: A set of rules

The transaction can be either a bought cart, viewing films in the same session, or even the list of posts that the user likes. And it gives us the association rules like:

  • If the user bought product A, they may buy product B
  • If the user watch film A, they may like film B

For example, when the user bought lemon, they may want to buy sugar to make lemonade. Or when they bought apples with flour, honey is the next thing they may want to buy (maybe to make a delicious apple cake?).

To implement it, we have many algorithms to choose, including:

  • Apriori algorithm
  • Frequent pattern growth (FP-growth)
  • Equivalence Class Transformation (ELCAT)

However, we won’t deep dive into it in this discussion. We just need to understand the metrics they provide us with. Specifically, support & confidence are the 2 metrics we consider here:

  • Support(A & B) = #{The number of transaction contains both A and B} / #{The number of transactions}
  • Confidence (A → B) = #{The number of transaction contains B} / #{The number of transactions contains A}

In the formula above, both A and B are the itemset (a set of items). You can think of support as the probability of a transaction (order) that has both A and B. As for confidence, it is the chance you will see B if the transaction contains itemset A.

Database schema

Firstly you need to understand how we structured our data in the database. In this project, we used PostgreSQL, which is a relational database.

As discussed above, we have different entities, including:

  • Product
  • Variant
  • Cart
  • Order

To store the “buying” rules, we also declare a table called “frequent_product”, which stores the relation between 2 products. Moreover, it is worth noting that we only suggest the related “product”, not “variant”. In conclusion, our database schema is look like:

Huimitu database schema (Illustrated using DrawSQL).

Using KNIME to learn patterns

As mentioned before, we used KNIME, which is a very powerful platform, to learn the rules. To start off the KNIME workflow, we will connect to the PostgreSQL database using the PostgreSQL Connector node:

Because we have several tables, joining these data is a must. After joining the tables, we filter orders to retain only successful orders, which we’ll use later on in the workflow. Note that the nodes we used here are the database manipulation nodes, so all the queries will be done in the database before fetching data to the KNIME memory. Moreover, to reduce the size of data, we only keep the relevant columns and remove the unused columns (e.g. product name, description).

Subsequently, the fetched data which has the same “order_id” is grouped (the product that is bought together). In order to feed the data to the Association Rule Learner node, we need to vectorize the data to a bit vector. Next, after getting the result, we have post-processed the data to fit the columns’ name in our database schema. Lastly, the processed data is loaded to the “frequent_product” table.

For example, we obtain the grouped table and the bit vector table as follow:

For the Association Rule Learner node, we need to set several options:

  • Minimum support: The minimum support we want to keep for the itemsets.
  • Itemset type: The type of itemset we want to find. In this case, we just consider the MAXIMAL itemset, which is the optimized one.
  • Maximal itemset length: The maximum length of our itemset. In our case, we only use itemsets whose length is 2 (product A is related to product B).
  • Minimum confidence: The minimum confidence of our rule.

The settings may differ according to the data size, so you would need to finetune these parameters for your application:

As a result, we receive a set of 5 rules shown in the below image.

Looking at the rules, we can easily see that there is a rule that when a user bought product 5, they are more likely to buy product 7. This is because:

  • Support (5 & 7): there are 2 out of 5 transactions (orders) that have product 5 and product 7. Which means support is ⅖ = 0.4
  • Confidence (5→7): moreover, because when product 5 appears, it definitely appears together with product 7, so confidence will be 1

In contrast, when we consider the reverse rule (7 → 5), this is not acceptable. Because the confidence, in this case, is only ⅔ = 0.6(6).

From the result, we can output the recommendation by displaying the Consequent whose confidence is in the top 5 highest, for instance. As we can be see, for productID 3 we can recommend the follow products:

  • Product 5
  • Product 7
  • Product 2

Scheduling batch job to build rules

Because the rules do not need to be updated in real time, and the computation is costly (fetch all the transactions to learn the rules), we should schedule the pipeline to run after a certain period (instead of re-training when a new transaction has been made).

To do this, we have several solutions:

In our case, we created a command (CMD) file and set it to the Window scheduler to run it every 2 hours. Here is the command file:

@echo off
echo Begin running…
"P:\KNIME\knime.exe" -nosave -consoleLog -nosplash -reset -application org.knime.product.KNIME_BATCH_APPLICATION -workflowFile=".\huimitu-frequent-item.knwf"
echo Done

Provide API for inference from rules

After having the rules, we need to provide an API for inference (aka return predictions). This can be done with a simple SQL query. When a user views a specific product (with product_id), we select all the rules that contain the product and sort by the confidence. Moreover, we will do a join operation to get the necessary information (e.g. category):

SELECT  'product.id',
'product.product_name',
'category.category_name',
'product.description',
'product.avg_rating',
'product.count_rating',
'product.min_price',
'product.max_price',
'product.stock',
'product.created_time'
FROM frequent_product
JOIN product ON frequent_product.consequent_id = product.id
JOIN category ON product.category_id = category.id
WHERE frequent_product.product_id = #{query.product_id}
ORDER BY frequent_product.confidence DESC
LIMIT #{query.limit}

Further study

Versioning rules

Because the project above is just a University project, it has a few weak points. After working in industry, we have acknowledged a few problems. For example, when we load the rules table, it may reduce the application performance due to table locking. To resolve this problem, we can add a version column to the table and increase it whenever a new set of rules is created. This not only handles the issue above, but also improves the versioning of rules.

Applying NLP to improve the accuracy

In this project, we have only used the user’s transactions to learn rules. The product information has not been used. It is a good approach to applying NLP to further analyze the relation between products based on their description. We can choose to use one or the other, but integrating both approaches is likely to yield a higher accuracy for our platform.

Cart recommendation

Moreover, since we have understood the itemset mining algorithm, it is easy to apply it for cart recommendation. Let’s extend the above example. After having the rules, we can look at the user’s cart and find rules, which are the subset of their cart, to provide recommendations. If the user tends to buy water, apples and flour; we can recommend them to buy honey.

--

--