Similar Products Recommender System for Big Basket

Maqsood Pasha S N
10 min readSep 20, 2022

--

Image credit: https://theprint.in/ani-press-releases/

Table of Contents :

  1. Introduction to Big Basket
  2. Business Problem /Objective
  3. Constraints
  4. Dataset sources and Description
  5. Approaches to the Problem
  6. Mapping to ML problem and Performance Metric
  7. Exploratory Data Analysis
  8. Text Preprocessing
  9. Basic Feature Extraction and its importance
  10. Advanced Feature Extraction and it’s Importance
  11. TF-IDF weighted Word2Vec using Glove model
  12. Building ML model using Cosine Similarity :Item-Item based similar products
  13. Summary of Results
  14. Github links, Streamlit app Deployment link and LinkedIn link
  15. Video demo
  16. Future work
  17. References

1.Introduction to Big Basket:

BigBasket is the largest online grocery supermarket in India. Was launched somewhere around in 2011 since then they’ve been expanding their business. Though some new competitors have been able to set their foot in the nation such as Blinkit etc. but BigBasket has still not loose anything — thanks to ever expanding popular base and their shift to online buying.

2.Business Problem /Objective:

To recommend products as much as similar (relevant) to the product searched in Big basket search query

3.Constraints:

  • Incorrect recommendations could impact customer experience
  • Low latency requirement

4.Dataset Sources and Description:

The link for the dataset

https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints

Data is available in .csv format

There are 27555 rows in dataset
The dataset contains 10 attributes (columns) with simple meaning and which are described as follows:

A snap of dataset overview
A Snap of Dataset overview

5.Existing Approaches to Problem:

Method 1:

Demographic Filtering: It is like recommending items based on a feature. Like the top 10 rated items or the top 10 items in a particular category.
Example: If we pass rating as column recommend top 10 rated items, or if we pass other feature like Brand recommend 10 products of same brand

Method 2:

Content Based Recommender system by using Description text feature using TF-IDF as vectorization technique to construct Similarity matrix using cosine similarity.
Cosine similarity is calculated for the features of Products vectorized using TF-IDF and then a function is built to recommend similar items by passing Product title and Cosine similarity matrix of Items as attributes.
The function takes the index of Product title and finds the corresponding row in Similarity matrix and sorts the top ten similar items, gets their indices and recommends them as similar products.
Features used for vectorization and constructing similarity matrix is TF –IDF vectorized Description feature.

6.Mapping to ML problem and Performance Metric:

ML problem Formulation:
* It is a Item- Item based Collaborative Filtered recommender system problem
* Item based Collaborative Filtering Recommender system uses similarity between items as a measure to recommend similar items to the queried item

Performance Metric:
- Cosine similarity will be used to find similar items for the given query item using the item (product) details while modelling
- Cosine similarity is a metric, helpful in determining, how similar the data objects(two products) are, In cosine similarity, data objects in a dataset are treated as a vector. The formula to find the cosine similarity between two vectors is
Cos(x, y) = x . y / ||x|| * ||y||
where,
x . y = product (dot) of the vectors ‘x’ and ‘y’.
||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’

Source: https://www.geeksforgeeks.org/cosine-similarity/

  • In Real time, metrics described above don’t show us how real customers react to the produced recommendations in terms of the company’s business strategy. The only way to measure it is A/B testing. A/B testing costs more resources and time, but it allows us to measure the model performance,One of techniques used in A/B testing is Click Through Rate(CTR)

7.Exploratory Data Analysis:

The exploratory data analysis includes the following:-

a)Loading the .csv file to pandas

b)Checking the shape of the data i.e. the number of rows and columns present in the data.

c) To know column names and missing values

Column names of data

Of these 10 features, we will be using only 6 features for our Model.

1.Category — Category into which product has been classified-11 unique categories
2. sub_category — Subcategory into which product has been kept-90 unique sub_categories
3. brand — Brand of the product-2314 unique brands
4. type- Type into which product falls-426 unique types
5. sale_price — Price at which product is being sold on the site
6. description- Description of the product (in detail)

In addition to the above 6 features we will use newly built feature `discount` on product(i.e, market price-sell price/market price) , to rank the order of Recommended products

Will select columns: ‘product’,’category’,’sub_category’,’brand’,’type’,’description’,’sale_price’,’market_price’,’discount_%’ for further processing and feature engineering

Also will remove data points with missing brand,product and description values as erroneous points
The dataset will be reduced to 27439 data points after removing the above points

Split data into train and test with test size of 0.2 and do Exploratory analysis on Train data features
After Split ,Train data :21951 data points
Test data : 5488 data points

Product:
- Out of 21951 train data, there are 19301 Unique values for `product` and top most occurring product is ‘Turmeric Powder/Arisina Pudi’ with 21 times

Category:
There are 11 unique Categories in train data

Code snip for uni variate analysis of category feature/column
  • `28.72%` of products fall in category of `Beauty & Hygenie`
    - Gourmet & World Food occupies 17.01 % of total products
    - followed by Kitchen, Garden & Pets which has a 12.92% share
    - Snacks & Branded Foods has 10.2% share
    - Rest of the categories have less than 1 percent share each in total products list
Plot of Category counts and their cumulative percentage

Sub-Category:
There are 90 unique sub-categories and Skin care is top occurring with 1814 times
- The top four major sub_categories are as follows

  1. Skin Care- 8.26%
    2.Health & Medicine- 4.22%
    3.Hair Care-3.78%
    4.Fragrances & Deos — 3.74%
Top 10 Sub-Category in Train Data

Type:
- There are 423 unique types of product
- Face care is the top type of product with 1181 counts- 5.38%

Sale_price:

CDF of sale_price
  • 99% of products sale price is below 2000 in train data
  • Datapoints reduced to 21648 (99.98% of original data)in train data from 21951 after removing products with sale price greater than 2000

Discount_%:

CDF of discount_% values

Description:
A text describing about the product

sample of description of a product

8.Text Preprocessing:

Text preprocessing done on categorical features ; category,sub_category,brand ,type
The code snippet for categorical columns preprocessing is as

The above function preprocesses a column by doing the steps as mentioned in code

Same way done for sub_category,type,brand columns

For Description text preprocessing is also done .
nltk stop words are used to remove stop words in description

Snap of Stopwords loading and creating a variable for stopwords
code for preprocessing description text

The above code cell takes the description text and does the preprocessing as defined in the above function

9.Basic Feature Extraction and its importance:

Discount_% feature is built by using sale_price and market_price from the given data for model building (i.e, market price-sell price/market price)

By this we can have discount range of products in finding similar products of the queried one

10.Advanced Feature Extraction and it’s Importance:

As part of Advanced Feature Extraction , we will do clustering for the train data ;add cluster label as a feature. Also save the cluster means as pickle file to assign a query point/unseen point to the nearest cluster mean and label it for that cluster

For clustering we will be using Scaled sale_price (standardized using min-max scaler), discount_% and sentiment scores obtained from preprocessed description text.

Data after adding new feature columns
Stats of the columns used for clustering analysis

For clustering the above columns are stacked using numpy hstack

Code for stacking all advanced features for clustering
code for clustering Analysis
Plot for selecting optimal K using Silhouette score

Will cluster our data into 5 groups as per silhouette score and elbow method and label them accordingly

T-SNE visualization of cluster analysis for better understanding

T-SNE plot for Kmeans with k=5

will save cluster means(centroids), train data after all preprocessing and feature engineering for further use to avoid repetancy

11.TF-IDF weighted Word2Vec using Glove model:

Will use TF-IDF weighted Word2Vec representation of description text for featurization of text data to numerical data as ML models take numerical inputs

Will use GloVe: Global Vectors (300 dimensions) for Word Representation for our featurization
The link for downloading pretrained glove word embeddings is:
https://nlp.stanford.edu/projects/glove/

Follow the link for better understanding of TF-IDF weighted Word2Vec
https://towardsdatascience.com/word-embedding-techniques-word2vec-and-tf-idf-explained-c5d02e34d08

TF-IDF Word2Vec featurization of preprocessed train data description text

Code snippet for TF-IDF weighted W2V featurization

The TF-IDF weighted Word2Vec of all preprocessed description text of train data are stored in a list as shown in above code for using it in computing item-item cosine similarity

Label encoding categorical features 'category', 'sub_category', 'brand', 'type'

While Label encoding categorical columns ,fit is done on train data and for any unseen label in test data ,will be labelled as (n+1)th ,for n unique values in train data.
Encoders returned from the functioned are stored for encoding test data/query data points

Snap of train data after Label Encoding categorical features(category,sub_category,brand,type)

will build a matrix from train data to compute cosine similarity as follows in below code snippet

12.Building ML Model using Cosine Similarity: Item-Item based similarity Recommender system

  • will give attributes of query product
  • query product should have category,sub_category,brand,type,description,sale_price,market_price
  • using the above features will do encoding, vectorizing ,scale sale price to using min max scaler fit on train data,compute discount_%, calculate sentiment scores, and assign to nearest cluster

Following are the functions which will be used in Data pipeline with their objectives to perform on query point

Test case of missing_features function
Function to check sale_price of given product in range of same brand or train data
Function to get sentiment scores of description text of query product

The text preprocessing of categorical features (category,sub_category,brand,type) and text of description of query product is done in same way as for train data by defining functions for same

The query data point is assigned to nearest(euclidean distance) one among the five cluster means which are obtained by using train data for optimal K in Advance Feature Engineering part

Now after the text preprocessing, building advance features, of query point we compute cosine similarity of the query product with other products in train data and recommend the similar products with highest cosine similarity

The link to code which returns the top similar products for a given query product

The function takes the query point and does the following steps

  1. Checks whether the query point has all the needed columns , if not will give a warning message and function exits
  2. After checking for any missing columns , it does preprocessing for categorical columns(category,sub_category,brand,type) and then encode with transform using label encoders which are fit on train data
  3. Next check is done on sale_price whether the sale price is in +-15% of maximum sale price of the same brand in train data , if not function exits with a message -salepricecheck function
  4. Preprocessing of description text of query point using preprocess_description function
  5. discount_% is computed after sale_price check
  6. Get sentiment scores of preprocessed query’s description and scaling sale price with min and max of train data
  7. Assigning cluster label using cluster means from train data
  8. Now stacking all vectors and features of query product and compute cosine similarity with products in train data
  9. Returning top n similar product based on cosine similarity

13.Summary Of Results:

The similar products shown have cosine similarity of 0.99 range

Snap of result of similar products for the query product
Snap of result for “sauce-chilli,Hot”
Snap of results for “Brahmi Bhringaraj Taila — Anti Graying”’
Snap of results for “Instant Pasta — Tomato Salsa”
Snap of warning message when the query product has missing attributes
Snap of results for “Solid Rim Power Active-Lemon”
Snap of results for “Organic-Mustard oil”

14.Github links, Streamlit app Deployment link and LinkedIn link

My Github link to this case study

Link to Streamlit app

You can connect with me on LinkedIn

15.Video demo
Here is video demo of deployment in Streamlit app

16.Future work:

  • Further steps can be made by imputing for missing sale_price by getting mean sale_price in same brand.
  • Removing duplicates from train data for products whose product title differs only in last 2 or 3 words like (gms,quantity,colour,packet size,etc)

17.References:

Thank you…

--

--

Maqsood Pasha S N

Maqsood Pasha S N | Supply Chain Manager | Business Analyst | Aspiring Data scientist to leverage operations with ML and AI