Similar Products Recommender System for Big Basket
Table of Contents :
- Introduction to Big Basket
- Business Problem /Objective
- Constraints
- Dataset sources and Description
- Approaches to the Problem
- Mapping to ML problem and Performance Metric
- Exploratory Data Analysis
- Text Preprocessing
- Basic Feature Extraction and its importance
- Advanced Feature Extraction and it’s Importance
- TF-IDF weighted Word2Vec using Glove model
- Building ML model using Cosine Similarity :Item-Item based similar products
- Summary of Results
- Github links, Streamlit app Deployment link and LinkedIn link
- Video demo
- Future work
- References
1.Introduction to Big Basket:
BigBasket is the largest online grocery supermarket in India. Was launched somewhere around in 2011 since then they’ve been expanding their business. Though some new competitors have been able to set their foot in the nation such as Blinkit etc. but BigBasket has still not loose anything — thanks to ever expanding popular base and their shift to online buying.
2.Business Problem /Objective:
To recommend products as much as similar (relevant) to the product searched in Big basket search query
3.Constraints:
- Incorrect recommendations could impact customer experience
- Low latency requirement
4.Dataset Sources and Description:
The link for the dataset
https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints
Data is available in .csv format
There are 27555 rows in dataset
The dataset contains 10 attributes (columns) with simple meaning and which are described as follows:
5.Existing Approaches to Problem:
Method 1:
Demographic Filtering: It is like recommending items based on a feature. Like the top 10 rated items or the top 10 items in a particular category.
Example: If we pass rating as column recommend top 10 rated items, or if we pass other feature like Brand recommend 10 products of same brand
Method 2:
Content Based Recommender system by using Description text feature using TF-IDF as vectorization technique to construct Similarity matrix using cosine similarity.
Cosine similarity is calculated for the features of Products vectorized using TF-IDF and then a function is built to recommend similar items by passing Product title and Cosine similarity matrix of Items as attributes.
The function takes the index of Product title and finds the corresponding row in Similarity matrix and sorts the top ten similar items, gets their indices and recommends them as similar products.
Features used for vectorization and constructing similarity matrix is TF –IDF vectorized Description feature.
6.Mapping to ML problem and Performance Metric:
ML problem Formulation:
* It is a Item- Item based Collaborative Filtered recommender system problem
* Item based Collaborative Filtering Recommender system uses similarity between items as a measure to recommend similar items to the queried item
Performance Metric:
- Cosine similarity will be used to find similar items for the given query item using the item (product) details while modelling
- Cosine similarity is a metric, helpful in determining, how similar the data objects(two products) are, In cosine similarity, data objects in a dataset are treated as a vector. The formula to find the cosine similarity between two vectors is
Cos(x, y) = x . y / ||x|| * ||y||
where,
x . y = product (dot) of the vectors ‘x’ and ‘y’.
||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’
Source: https://www.geeksforgeeks.org/cosine-similarity/
- In Real time, metrics described above don’t show us how real customers react to the produced recommendations in terms of the company’s business strategy. The only way to measure it is A/B testing. A/B testing costs more resources and time, but it allows us to measure the model performance,One of techniques used in A/B testing is Click Through Rate(CTR)
7.Exploratory Data Analysis:
The exploratory data analysis includes the following:-
a)Loading the .csv file to pandas
b)Checking the shape of the data i.e. the number of rows and columns present in the data.
c) To know column names and missing values
Of these 10 features, we will be using only 6 features for our Model.
1.Category — Category into which product has been classified-11 unique categories
2. sub_category — Subcategory into which product has been kept-90 unique sub_categories
3. brand — Brand of the product-2314 unique brands
4. type- Type into which product falls-426 unique types
5. sale_price — Price at which product is being sold on the site
6. description- Description of the product (in detail)
In addition to the above 6 features we will use newly built feature `discount` on product(i.e, market price-sell price/market price) , to rank the order of Recommended products
Will select columns: ‘product’,’category’,’sub_category’,’brand’,’type’,’description’,’sale_price’,’market_price’,’discount_%’ for further processing and feature engineering
Also will remove data points with missing brand,product and description values as erroneous points
The dataset will be reduced to 27439 data points after removing the above points
Split data into train and test with test size of 0.2 and do Exploratory analysis on Train data features
After Split ,Train data :21951 data points
Test data : 5488 data points
Product:
- Out of 21951 train data, there are 19301 Unique values for `product` and top most occurring product is ‘Turmeric Powder/Arisina Pudi’ with 21 times
Category:
There are 11 unique Categories in train data
- `28.72%` of products fall in category of `Beauty & Hygenie`
- Gourmet & World Food occupies 17.01 % of total products
- followed by Kitchen, Garden & Pets which has a 12.92% share
- Snacks & Branded Foods has 10.2% share
- Rest of the categories have less than 1 percent share each in total products list
Sub-Category:
There are 90 unique sub-categories and Skin care is top occurring with 1814 times
- The top four major sub_categories are as follows
- Skin Care- 8.26%
2.Health & Medicine- 4.22%
3.Hair Care-3.78%
4.Fragrances & Deos — 3.74%
Type:
- There are 423 unique types of product
- Face care is the top type of product with 1181 counts- 5.38%
Sale_price:
- 99% of products sale price is below 2000 in train data
- Datapoints reduced to 21648 (99.98% of original data)in train data from 21951 after removing products with sale price greater than 2000
Discount_%:
Description:
A text describing about the product
8.Text Preprocessing:
Text preprocessing done on categorical features ; category,sub_category,brand ,type
The code snippet for categorical columns preprocessing is as
The above function preprocesses a column by doing the steps as mentioned in code
Same way done for sub_category,type,brand columns
For Description text preprocessing is also done .
nltk stop words are used to remove stop words in description
The above code cell takes the description text and does the preprocessing as defined in the above function
9.Basic Feature Extraction and its importance:
Discount_% feature is built by using sale_price and market_price from the given data for model building (i.e, market price-sell price/market price)
By this we can have discount range of products in finding similar products of the queried one
10.Advanced Feature Extraction and it’s Importance:
As part of Advanced Feature Extraction , we will do clustering for the train data ;add cluster label as a feature. Also save the cluster means as pickle file to assign a query point/unseen point to the nearest cluster mean and label it for that cluster
For clustering we will be using Scaled sale_price (standardized using min-max scaler), discount_% and sentiment scores obtained from preprocessed description text.
For clustering the above columns are stacked using numpy hstack
Will cluster our data into 5 groups as per silhouette score and elbow method and label them accordingly
T-SNE visualization of cluster analysis for better understanding
will save cluster means(centroids), train data after all preprocessing and feature engineering for further use to avoid repetancy
11.TF-IDF weighted Word2Vec using Glove model:
Will use TF-IDF weighted Word2Vec representation of description text for featurization of text data to numerical data as ML models take numerical inputs
Will use GloVe: Global Vectors (300 dimensions) for Word Representation for our featurization
The link for downloading pretrained glove word embeddings is:
https://nlp.stanford.edu/projects/glove/
Follow the link for better understanding of TF-IDF weighted Word2Vec
https://towardsdatascience.com/word-embedding-techniques-word2vec-and-tf-idf-explained-c5d02e34d08
TF-IDF Word2Vec featurization of preprocessed train data description text
The TF-IDF weighted Word2Vec of all preprocessed description text of train data are stored in a list as shown in above code for using it in computing item-item cosine similarity
Label encoding categorical features 'category', 'sub_category', 'brand', 'type'
While Label encoding categorical columns ,fit is done on train data and for any unseen label in test data ,will be labelled as (n+1)th ,for n unique values in train data.
Encoders returned from the functioned are stored for encoding test data/query data points
will build a matrix from train data to compute cosine similarity as follows in below code snippet
12.Building ML Model using Cosine Similarity: Item-Item based similarity Recommender system
- will give attributes of query product
- query product should have category,sub_category,brand,type,description,sale_price,market_price
- using the above features will do encoding, vectorizing ,scale sale price to using min max scaler fit on train data,compute discount_%, calculate sentiment scores, and assign to nearest cluster
Following are the functions which will be used in Data pipeline with their objectives to perform on query point
The text preprocessing of categorical features (category,sub_category,brand,type) and text of description of query product is done in same way as for train data by defining functions for same
The query data point is assigned to nearest(euclidean distance) one among the five cluster means which are obtained by using train data for optimal K in Advance Feature Engineering part
Now after the text preprocessing, building advance features, of query point we compute cosine similarity of the query product with other products in train data and recommend the similar products with highest cosine similarity
The link to code which returns the top similar products for a given query product
The function takes the query point and does the following steps
- Checks whether the query point has all the needed columns , if not will give a warning message and function exits
- After checking for any missing columns , it does preprocessing for categorical columns(category,sub_category,brand,type) and then encode with transform using label encoders which are fit on train data
- Next check is done on sale_price whether the sale price is in +-15% of maximum sale price of the same brand in train data , if not function exits with a message -salepricecheck function
- Preprocessing of description text of query point using preprocess_description function
- discount_% is computed after sale_price check
- Get sentiment scores of preprocessed query’s description and scaling sale price with min and max of train data
- Assigning cluster label using cluster means from train data
- Now stacking all vectors and features of query product and compute cosine similarity with products in train data
- Returning top n similar product based on cosine similarity
13.Summary Of Results:
The similar products shown have cosine similarity of 0.99 range
14.Github links, Streamlit app Deployment link and LinkedIn link
My Github link to this case study
Link to Streamlit app
You can connect with me on LinkedIn
15.Video demo
Here is video demo of deployment in Streamlit app
16.Future work:
- Further steps can be made by imputing for missing sale_price by getting mean sale_price in same brand.
- Removing duplicates from train data for products whose product title differs only in last 2 or 3 words like (gms,quantity,colour,packet size,etc)
17.References:
- https://www.geeksforgeeks.org/
- https://seaborn.pydata.org/generated/seaborn
- https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html
- https://docs.python.org/3/library/re.html
- https://www.towardsdatascience.com/
- https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python
- https://stackoverflow.com/questions/30129936/python-regex-replace-sentence-with-starting-word
Thank you…