Amazon Product Recommendation System (Content Based)

Kalimuddin
7 min readApr 9, 2022

--

Index of Contents

  1. Description
  2. Business-Problem Statement
  3. Data Source & Overview
  4. Basic EDA and Data Cleaning
  5. Text Preprocessing
  6. Text based product similarity
  7. Weighted similarity using brand and color
  8. Deep learning based visual product similarity
  9. Measuring goodness of our solution : A/B testing

1. Description

  • As Amazon reveals that more than 30% of the revenue in Amazon E-commerce comes from the recommendation page.
  • More than 30% people always buy products from the Product recommendation page as per Amazon E-Commerce data.

2. Business-Problem Statement

  • Main Objective is to recommend similar products in e-commerce using product content (ASIN, title, brand, color, images, type of the product, price, etc).

3. Data Source & Overview

You can get whole notebook and required data regarding this project at Github Link

  • Number of data points : 183138
  • Number of features/variables: 19

Of these 19 features, we will be using only 7 features in this workshop :

  • asin ( Amazon standard identification number) (unique no.)
  • brand ( brand to which the product belongs to )
  • color ( Color information of apparel, it can contain many colors as a value ex: red and black stripes )
  • product_type_name (type of the apperal, ex: SHIRT/TSHIRT )
  • medium_image_url ( url of the image )
  • title (title of the product.)
  • formatted_price (price of the product)
  • Description are lengthy and it takes longer time to process, thats why we didn’t consider it.
  • Initially taken, Number of data points: 183138, Number of features: 7

4. Basic EDA and Data Cleaning

Basic stats for the feature: product_type_name :

10 most frequent product_type_names :

[('SHIRT', 167794),
('APPAREL', 3549),
('BOOKS_1973_AND_LATER', 3336),
('DRESS', 1584),
('SPORTING_GOODS', 1281),
('SWEATER', 837),
('OUTERWEAR', 796),
('OUTDOOR_RECREATION_PRODUCT', 729),
('ACCESSORY', 636),
('UNDERWEAR', 425)]

Basic stats for the feature: brand :

10 most frequent brands :

[('Zago', 223),
('XQS', 222),
('Yayun', 215),
('YUNY', 198),
('XiaoTianXin-women clothes', 193),
('Generic', 192),
('Boohoo', 190),
('Alion', 188),
('Abetteric', 187),
('TheMogan', 187)]

Basic stats for the feature: color :

10 most frequent colors :

[(None, 118182),
('Black', 13207),
('White', 8616),
('Blue', 3570),
('Red', 2289),
('Pink', 1842),
('Grey', 1499),
('*', 1388),
('Green', 1258),
('Multi', 1203)]

Basic stats for the feature: formatted_price

10 most frequent formatted_price :

[(None, 154743),
('$19.99', 945),
('$9.99', 749),
('$9.50', 601),
('$14.99', 472),
('$7.50', 463),
('$24.99', 414),
('$29.99', 370),
('$8.99', 343),
('$9.01', 336)]
  • title are short and very informative and mostly present for every products and it is the most important, All of the products have a title.
  • We use titles extensively in this project as they are short and informative.
  • Number of data points After eliminating price=NULL : 28395
  • Number of data points After eliminating color=NULL : 28385

We brought down the number of data points from 183K to 28K :

  • We are processing only 28K points so that we can run this code on laptops in a reasonable amount of time.
  • For those of you who have powerful computers and some time to spare, you are recommended to use all of the 183K images.

Understand duplicate rows :

  • we have 2325 products which have same title but different color
  • user doesn’t want to be recommended of same product of different sizes or different colors.
  • After removal of products with very few words in title: 27949

Some examples of dupliacte titles that differ only in the last few words :

Titles 1:
16. woman's place is in the house and the senate shirts for Womens XXL White
17. woman's place is in the house and the senate shirts for Womens M Grey
Title 2:
25. tokidoki The Queen of Diamonds Women's Shirt X-Large
26. tokidoki The Queen of Diamonds Women's Shirt Small
27. tokidoki The Queen of Diamonds Women's Shirt Large
  • There are only 17593 data point are remaining after removed the dupliactes which differ only at the end (stage one of dedupe)
  • Some examples of dupliacte titles that differ maybe in between words
  • Finally number of datapoints remaining after cleaning removing duplicates : 16042

5. Text Preprocessing

  • nltk is used alot for text pre-processing
  • stopwords removal is not good for all types of algorithm.
  • we use the list of stop words that are downloaded from nltk lib.
  • we take each title and we text-preprocess it.
  • Stemming : Convert each of the word in root form, We tried using stemming on our titles and it did not work very well.

6. Text based product similarity

Bag of Words model :

bag_of_words_model(doc_id, num_results), call the bag-of-words model for a product to get similar products.

  • text based product similarity : Converting text to an n-D vector: (bag of words)
  • Bag of Words (BoW) on product titles
  • title_features.get_shape() = (16042, 12609)
  • Here, for each title there are 12609 columns, but in actual each titles have very less number of words, most of the column value are zero
  • we are doing BOW and euiclidean distance :-
  • if the words are more similar in titles, then euiclidean distance will be low
  • This BOW is not the best solution, but it is one of them solution

TF-IDF model :

featurizing text based on word-importance, tfidf_model(doc_id, num_results)

tfidf_title_features is an scipy sparse matrix

IDF model :

based product similarity, If title is not very big, idf_model(doc_id, num_results)

Till now we tried three techniques :-

  • This idf is not also very good, it is okay
  • Till now TF-idf is best
  • TF-idf > idf > BOW (in terms of better output)

Word2Vec & Average Word2Vec model :

(featurizing text based on semantic similarity)

avg_w2v_model(doc_id, num_results)

for every word have 3000 dim vector

  • word2vec requires very large data corpus to work well
  • In this project we are using a pretrained model by google, its 3.3G file, once you load this into your memory it occupies ~9Gb, so please do this step only if you have >12G of ram
  • word2vec_model : we have provided a pickle file wich contains a dict and it contains all our courpus words as keys and model[word] as values
  • we take small sample :- those word which are in our titles
  • This output was not available in BOW & tf-idf because they treated ‘tiger’ and ‘tigers’ different words
  • Word2wVec gives semantics similarity (many animals print type shirt), which is not given by BOW and TfIdf

TF-IDF weighted Word2Vec model :

weighted_w2v_model(doc_id, num_results)

for every title we build a weighted vector representation

7. Weighted similarity using brand and color

idf_w2v_brand(doc_id, w1, w2, num_results)

some of the brand values are empty, need to replace Null with string “NULL”

8. Deep learning based visual product similarity (CNN)

get_similar_products_cnn(doc_id, num_results)

  • How to featurize an image: edges, shapes, parts
  • Using (Keras + Tensorflow) to extract features
  • img_width, img_height = 224, 224
  • each image is converted into 25088 length dense-vector
  • In actual real world problem, we use combination of solutions
  • Here, we are using only images as input and find similar images

9. Measuring goodness of our solution : A/B testing

  • A/B Testing ia a very big area, There are many textbook written on just A/B testing
  • It is also known as Bucket Testing
  • If you have two solution : Solution A and Solution B, Then AB testing is used to determine which solution perform better

--

--

Kalimuddin

Currently working as Java Developer and also, I have done my personal projects in Machine Learning.