Product recommendation in Bukalapak Online-to-Offline Business

Deni Suswanto
Nov 18, 2020 · 12 min read

Learn how we help our Mitra to maximize their profit

Mitra Bukalapak

Online-to-offline (O2O) commerce a business strategy to use online channels to drive offline sales. In essence, O2O brings offline business activities to Internet platforms and uses these platforms to promote traditional offline businesses (Xiao et al, 2019). When relating it to e-commerce, this means customers can buy the products they see on the online platform at closer offline channels such as mom-and-pop stalls (warung).

In this context, we are proud that Bukalapak is among the first companies to pioneer the business model in Indonesia. We materialized the business model under the program called Mitra Bukalapak that we launched back in 2017. Our partners (Mitra) in this program are warungs, well-known entities that propel most of the retail transactions throughout Indonesia (Cited from Kr-Asia; according to a Nielsen report published in 2018, sales of FMCG retail in Indonesia reached IDR 700 trillion (USD 47.5 billion), with warungs facilitating around 72% of those transactions). Currently, we have partnered with more than six million warungs all over Indonesia — and still counting! — , which makes Bukalapak the market leader in the industry.

With respect to the O2O scheme, Bukalapak provides goods for sale in warungs (with lower price, indeed) which can be purchased online through Mitra Bukalapak apps. After warung owners make a booking, we will deliver the purchased goods right at their doorstep. Therefore, they can save their time for something else instead of going to market/grocery outlets in person just to restock their goods (as the old way they used to do). Also, we enable them to grow by broadening their products offering by also providing a family of essential virtual products such as pulsa (mobile balance), electricity tokens, and even buy online products through Bukalapak.

Maximizing warung’s profit through recommendations

Considering the importance of warungs, we at Bukalapak try our best to maximize the revenue of our warung partners. The approach we use is to ensure that they can maximize their sales by providing all the essential products needed by their customers. We achieve this goal by recommending items that are likely to sell well in their area. In a nutshell, we provide some sort of recommendations that allow warung owners to sell complementary or similar products based on our rich dataset.

One common method to build a recommendation system is Market Basket Analysis (MBA). On a high-level, MBA works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy. In this article, we will discuss MBA, starting from what Market Basket Analysis is and why we use it, then what’s the impact of implementing Market Basket Analysis for Bukalapak business especially for O2O business.

Market Basket Analysis 101

Market basket analysis (also known as association-rule mining) is a useful method of discovering customer purchasing patterns by extracting associations or co-occurrences from stores’ transactional databases (Chen et al., 2005). In other words, this method is used for identifying the relationship between one item to another. By knowing such information, the associated stakeholders (sellers) can optimize their sales by performing cross-selling to their existing customers.

In our case, we can use the insights gained from an MBA in many ways, including:

Knowledge of the customer’s desires and situations and up-selling/cross-selling additional items is easy money for any business. If we can predict customers’ shopping list, then we can recommend the products to our Mitra/warung owner so they can sell their goods a lot more.

“More sales means more profit”

MBA Concepts

In this section, let’s get acquainted with the concepts behind MBA. Our goal here is to make the readers familiar with the general concepts used as the building blocks of market basket analysis. For those who demand more details and are interested in the math behind the concepts, one can refer to this chapter in Introduction to Data Mining which provides a great amount of details on the subject.

We start with association rules. An association rule between two sets of items is defined in the form of implication like the one as follows:

{Coffee} → {Sugar}

The above rule reads as follows. If coffee is bought, customers also buy sugar (in one transaction). To give you more concepts, {coffee} in the above example is called the antecedent or left-hand side (LHS), while {Sugar} is called the consequent or right-hand side (RHS). As it happens, we can have multiple items as antecedent and/or consequent. For example:

{Coffee, Milk} → {Tea, Sugar}

That is, when coffee and milk are bought, there is a good chance that tea and sugar are also bought.

From the two examples above, the readers might correctly think that there will be so many rules that can be generated if we have hundreds or thousands of products (think of all possible combinations of the products). Therefore, it is natural to ask which of those association rules is interesting, i.e. its occurrence is prevalent on the data. In light of this, researchers have proposed several different metrics which we will discuss in the next section.

The Metrics
I mean the metrics, not the matrix

There are three popular metrics for evaluating the quality or the strength of an association rule: support, confidence, and lift. To make them easier to understand, we will use the following simple transactions dataset as our setting. Moreover, we will denote antecedent and consequent as A and C, respectively

In the table below, there are 10 transactions involving 4 different products: apple, banana, cheese, and milk.

Table 1


Support of a rule is the relative frequency of the rule showing up in the database. This metric is used to measure the abundance or frequency (often interpreted as significance or importance) of an itemset in a database.

Support metric equation

From the above formula, we know that the value of any support is between 0 and 1 (inclusive). Support is equal to zero when the rule does not occur throughout the data, i.e. there is no transaction that contains both items of A and C together. This is vice versa for support equals 1.

Let’s have an example! Support ({Milk} → {Cheese}) is 5/10 since milk and cheese are found in 5 transactions out of 10 transactions in the table.


Confidence is a metric about the reliability of the rule. It measures how much the consequent (item) is dependent on the antecedent (item). We can compute confidence using the following formula

Confidence metric equation

We see that the range of confidence is also [0,1]. Moreover, readers who are familiar with probability theory would think of conditional probability kind of form that is going on here.

Continuing our example, confidence ({Milk} → {Cheese}) is 5/7 since milk and cheese are found in 5 transactions, whilst milk alone exists in 7 transactions.


Lift (also called improvement or impact) is the ratio of the observed support to that expected if the two rules were independent. It is the value that tells us how likely items in C are bought together with items in A. Lift is mathematically defined as follows.

Lift metric equation

Unlike the previous two metrics, a lift takes a range of [0, infinity). Values greater than one indicate that the items are likely to be purchased together (interesting rule).

To compute lift ({Milk} → {Cheese}), we first need to compute each support ({Milk}) and support ({Cheese}). Both Support ({Milk}) and Support ({Cheese}) is 7/10 since Milk and Cheese each exist in 7 transactions throughout the table. Since we have computed Support ({Milk} → {Cheese}) which is 5/10, thus lift ({Milk} → {Cheese}) = (5/10) / ((7/10) x (7/10)) = 50/49.

Market Basket Analysis (MBA) Tutorial

In this section, we will perform MBA on mocking groceries sales data. The goal remains to identify groups of products that are bought together frequently. As mentioned above, this kind of information allows us to boost more sales through cross-selling and/or up-selling eventually.

The flowchart below summarizes the steps of our analysis.

Flowchart to get association rules

1. The data

We will work on a simple dataset consisting of sales details of some grocery products. Below is given a snippet of the data so you can get an idea of how the data looks like. I use data from Kaggle as an example, you can find and download the full data here.

Import pandas as pddf = pd.read_excel('/dataset/online_retail.xlsx')

Let’s do some simple exploratory data analysis here to make sure our data is ready to process.

The date range of this dataset starts from 2010–12–01 to 2011–12–09, and if we see the histogram below we can conclude that the transactions are relatively stagnant until the middle of September, 2011, the transaction start increases.

Distribution of Transactions

It turns out that most of the transactions had to ship to the United Kingdom. Well, this makes sense since the e-commerce origin is in fact in the UK.

Distribution of created transactions per country

2. Preprocessing

We can read the description about the dataset here, but basically, we just need these features for the association rule:

But first, we need to clean up our data, there are several things to do:

Duplicate Description grouped by stockCode column
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

3. Generate Association Rules

After cleaning up the data, we need to consolidate the items into one transaction per row with each product 1 hot encoded. Also in this tutorial, I want to use transactions that are created in the UK only (since the country has the most significant frequency of transactions > 90%). Still, we can try to compare some association rules results based on country — for example, Germany vs EIRE (but we don’t do this here).

basket = (df[df['Country'] =="United Kingdom"].groupby(['InvoiceNo','Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))

Next, we need to make sure any positive values are converted to a 1 and anything less the 0 is set to 0. This step will complete the one-hot encoding step of the data and remove the postage column (since that charge is not one we wish to explore).

def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1
basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

Now we have properly structured data, from which we can generate frequent itemsets. Note that, we can also set our min_support, but in this tutorial, we will set the support to 3% to get decent results to show.

from mlxtend.frequent_patterns import association_rules, apriorifrequent_itemsets = apriori(basket_sets, min_support=0.03, use_colnames=True)

The final step is to generate association rules. We can set what metric we use to get the results; but here we are just going to use support, confidence, and lift metric only. More on confidence, I use the metric with min_threshold 2%.

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.02)

Voila! Here are the sample results:

That’s pretty much it! We build frequent items using apriori then build the rules with association_rules.

Now, we will show how we can visualize the Market Basket Analysis association rules using a heatmap. First, we need to count our antecedent items and name them as Left-Hand side (LHS items).

rules['lhs items'] = rules['antecedents'].apply(lambda x:len(x) )

Then, we show all the rules as follows.

# Import seaborn under its standard aliasimport seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))# Replace frozen sets with strings
rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a)))
# Transform the DataFrame of rules into a matrix using the lift
# metric

pivot = rules.pivot(index = 'antecedents_',
columns = 'consequents_', values= 'lift')
# Generate a heatmap with annotations on and the colorbar off
sns.heatmap(pivot, annot = True)
MBA results visualization

The above heatmap shows the lift of the corresponding row-column pair. An example of how to read it is as follows. Notice value ‘13’ on the first row, it reads: Alarm clock bakelike green and Alarm clock bakelite red are 13 times more likely to be purchased together than purchased separately.

Using the results shown in the heatmap, we can recommend at least two strategies to the commerce platform:


In this post, we have learned the concept of Market Basket Analysis (MBA), a useful method to build a product recommendation system that is based on the so-called association rules. Later in the article, we also demonstrated how to perform MBA on a mocking e-commerce dataset via python.

There are several remarks, though. First, note that the association analysis takes memory if the dataset has large products or transactions. In such cases, we can just run association analysis and use the category of product, instead of each product id, to reduce memory usage.

Second, there are several libraries that offer functionalities to perform association analysis. I chose python with MLXtend library mainly due to its easiness and straightforwardness. Finally, there are so many libraries in python that can generate association analysis including MLXtend, I encourage you to check and try the rest of the MLXtend library and try another library or even create your own library to generate the association analysis.

And finally, there are other metrics beyond these three that we have just discussed in this article, namely conviction and leverage metric (you can search the details by yourself).

All in all, thanks for reading this article, and happy playing around with market basket analysis in your next data science projects!


Xiao, L., Zhang, Y., Fu, B. (2019). Exploring the moderators and causal process of trust transfer in online-to-offline commerce. Journal of Business Research 98. 214–226.

Chen, YL., Tang, K., Shen, RJ., Hu YH. (2005). Market basket analysis in a multiple store environment. Decision Support Systems 40. 339–354


Special thanks to Pararawendy Indarjo who help to proofread this article.

Bukalapak Data

Turning Insight Into Action