Explained Market basket Analysis using SQL

Samrat Jain
5 min readNov 14, 2018

--

Market What? using SQL!

Yeah Market Basket Analysis is principally done though R or python.But here i am focusing on achieving the same with some simple set of SQL queries.

Before directly jumping into what is Market basket analysis and how its done, lets just talk about why do we do MBA(Market basket Analysis) and where it came from. So MBA is generally a data mining technique that discovers the co-occurrence relationships between items, it is mainly used by business owners to pump up their sales. As Analyst we analyse every transactions and items bought together and measure the probabilities. We then provide these inferences to business user to make critical decisions to boost their sales. How its done, is explained below.

Definition:- Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.Market Basket Analysis can be well explained by answering this simple question.What is the probability of buying product A with product B when bought together?

MBA can only be done when you have bought more than one product in a single transaction. When I say transaction, you can relate it to order or your list of items you want to buy together. For instance, think of a shopping cart or trolley as your basket and we want to analysis the list items you bought.

For Example lets have a look at the below story.

Three people went Telmart to shop and bought some products :

Bag 1 as T1 -> { A,B,H,T}

Bag 2 as T2 -> { A,B,T,U,K}

Bag 3 as T3 -> { B,R,U,K,H,P}

Assumption- Any given product occurs only once in a transaction, here product A in T1 as well as in T2, but product A can not occur twice in T1 or T2. If that’s the case then we can have a separate column like product quantity.Tabular representation of these Item data sets would look like this:-

Fig 1

Here, number of transaction each product was bought can be summarized as

Fig 2

Now, We associate every product which are bought together and then we count the number of transactions they appeared together. There are total 28 combinations that can be formed by 8 products ( n*(n-1)/2). I am listing few of them below.

Fig 3

Lets calculate the probability of buying product A with Product B

Support of A= Number of transactions A was bought in

Support of A and B= Number of transactions A and B was bought together in

Confidence (A -> B) = Support of A and B / Support of B

C(A ->B)=2/3= 0.67=66.67%

It states that when a person buys B there is 66.67% chances that one will buy A also.

C(B ->A)=2/2=1=100%

Here C(A ->B) is not equal to C(B ->A), Since the support of A is different from support of B. Similarly

C(A->T)=Support(A and T together) / Support (T)

C(A->T)=2/2=1=100%

This case says that whenever Product T is bought there is 100% probability of one buying product A.

We are considering only basket of two, where we calculate the confidence of one product over the other i.e C(x->y) where x and y are set of products with only one product example — C(A->T),C(A->B),C(B->A)…..

Now What about Basket of three products, where y has more than one product

C({A}->{B,H})= Support (A,B,H bought together) / Support of (B,H together)

C({A}->{B,H})=1/2=0.5=50%

From Figure one we can clearly see A,B,H are bought together in only one transaction that is T1, and B,H are bought together in two transactions that is T1,T3, So from we say there is 50% chances of buying A when you have already bought B,H or you already have B,H in your Basket. It shows we can calculate the probabilities for basket of n, given that you have already calculated the respective support.

SQL :- Here I GO

For fig 2 considering data is in tabular form as fig 1

Select Product, Count(*) as Number_of_transactions_Count From [dbo].[MASKET] T Group by Product;

For fig 3

SELECT
PRODUCT_1,
PRODUCT_2,
COUNT(TRANSACTION_ID) TRANS_COUNT
FROM
(
SELECT
a.TRANSACTION_ID,
a.PRODUCT_ID PRODUCT_1,
b.PRODUCT_ID PRODUCT_2,
b.TRANSACTION_ID
FROM [dbo].[MASKET] a,
[dbo].[MASKET] b
WHERE a.TRANSACTION_ID=b.TRANSACTION_ID
and a.PRODUCT_ID <> b.PRODUCT_ID
and a.PRODUCT_ID < b.PRODUCT_ID
) Temp
GROUP BY PRODUCT_1,PRODUCT_2;

The above query can be modified to get the basket of 3 as well. By this we have the support count of buying 2,3,4.. products together in other words we can now say the we have the count of transactions where different products were bought together. So we can calculate the probabilities by formula

C(x->y)=Support (x and y together) / Support(y), where x and y are sets of products. This can also be achieved by SQL.

Some Basic Terminologies

Support — The support showcases the probability in favor of the event under analysis.

Confidence — It expresses the operational efficiency of the rule. It calculated as the ratio of the probability of occurrence of the favorable event to the probability of the occurrence of the antecedent.

Lift Ratio — The lift ratio calculates the efficiency of the rule in finding consequences, compared to a random selection of transactions.

Applications of Market Basket Analysis

— Product Placements
— Marketing Messages
— Affinity Promotions
— Inventory Management
— Fraud Detections
— Customer Behavior
— Content Placements

--

--