Amazon Product Recommendation System (Content Based)

Kalimuddin

7 min readApr 9, 2022

Index of Contents

Description
Business-Problem Statement
Data Source & Overview
Basic EDA and Data Cleaning
Text Preprocessing
Text based product similarity
Weighted similarity using brand and color
Deep learning based visual product similarity
Measuring goodness of our solution : A/B testing

1. Description

As Amazon reveals that more than 30% of the revenue in Amazon E-commerce comes from the recommendation page.
More than 30% people always buy products from the Product recommendation page as per Amazon E-Commerce data.

2. Business-Problem Statement

Main Objective is to recommend similar products in e-commerce using product content (ASIN, title, brand, color, images, type of the product, price, etc).

3. Data Source & Overview

You can get whole notebook and required data regarding this project at Github Link

Number of data points : 183138
Number of features/variables: 19

Of these 19 features, we will be using only 7 features in this workshop :

asin ( Amazon standard identification number) (unique no.)
brand ( brand to which the product belongs to )
color ( Color information of apparel, it can contain many colors as a value ex: red and black stripes )
product_type_name (type of the apperal, ex: SHIRT/TSHIRT )
medium_image_url ( url of the image )
title (title of the product.)
formatted_price (price of the product)

Description are lengthy and it takes longer time to process, thats why we didn’t consider it.
Initially taken, Number of data points: 183138, Number of features: 7

4. Basic EDA and Data Cleaning

Basic stats for the feature: product_type_name :

10 most frequent product_type_names :

[('SHIRT', 167794),
 ('APPAREL', 3549),
 ('BOOKS_1973_AND_LATER', 3336),
 ('DRESS', 1584),
 ('SPORTING_GOODS', 1281),
 ('SWEATER', 837),
 ('OUTERWEAR', 796),
 ('OUTDOOR_RECREATION_PRODUCT', 729),
 ('ACCESSORY', 636),
 ('UNDERWEAR', 425)]

Basic stats for the feature: brand :

10 most frequent brands :

[('Zago', 223),
 ('XQS', 222),
 ('Yayun', 215),
 ('YUNY', 198),
 ('XiaoTianXin-women clothes', 193),
 ('Generic', 192),
 ('Boohoo', 190),
 ('Alion', 188),
 ('Abetteric', 187),
 ('TheMogan', 187)]

Basic stats for the feature: color :

10 most frequent colors :

[(None, 118182),
 ('Black', 13207),
 ('White', 8616),
 ('Blue', 3570),
 ('Red', 2289),
 ('Pink', 1842),
 ('Grey', 1499),
 ('*', 1388),
 ('Green', 1258),
 ('Multi', 1203)]

Basic stats for the feature: formatted_price

10 most frequent formatted_price :

[(None, 154743),
 ('$19.99', 945),
 ('$9.99', 749),
 ('$9.50', 601),
 ('$14.99', 472),
 ('$7.50', 463),
 ('$24.99', 414),
 ('$29.99', 370),
 ('$8.99', 343),
 ('$9.01', 336)]

title are short and very informative and mostly present for every products and it is the most important, All of the products have a title.
We use titles extensively in this project as they are short and informative.
Number of data points After eliminating price=NULL : 28395
Number of data points After eliminating color=NULL : 28385

We brought down the number of data points from 183K to 28K :

We are processing only 28K points so that we can run this code on laptops in a reasonable amount of time.
For those of you who have powerful computers and some time to spare, you are recommended to use all of the 183K images.

Understand duplicate rows :

we have 2325 products which have same title but different color
user doesn’t want to be recommended of same product of different sizes or different colors.
After removal of products with very few words in title: 27949

Some examples of dupliacte titles that differ only in the last few words :

Titles 1:
16. woman's place is in the house and the senate shirts for Womens XXL White
17. woman's place is in the house and the senate shirts for Womens M GreyTitle 2:
25. tokidoki The Queen of Diamonds Women's Shirt X-Large
26. tokidoki The Queen of Diamonds Women's Shirt Small
27. tokidoki The Queen of Diamonds Women's Shirt Large

There are only 17593 data point are remaining after removed the dupliactes which differ only at the end (stage one of dedupe)
Some examples of dupliacte titles that differ maybe in between words
Finally number of datapoints remaining after cleaning removing duplicates : 16042

5. Text Preprocessing

nltk is used alot for text pre-processing
stopwords removal is not good for all types of algorithm.
we use the list of stop words that are downloaded from nltk lib.
we take each title and we text-preprocess it.
Stemming : Convert each of the word in root form, We tried using stemming on our titles and it did not work very well.

6. Text based product similarity

Bag of Words model :

bag_of_words_model(doc_id, num_results), call the bag-of-words model for a product to get similar products.

text based product similarity : Converting text to an n-D vector: (bag of words)
Bag of Words (BoW) on product titles
title_features.get_shape() = (16042, 12609)
Here, for each title there are 12609 columns, but in actual each titles have very less number of words, most of the column value are zero
we are doing BOW and euiclidean distance :-
if the words are more similar in titles, then euiclidean distance will be low
This BOW is not the best solution, but it is one of them solution

TF-IDF model :

featurizing text based on word-importance, tfidf_model(doc_id, num_results)

tfidf_title_features is an scipy sparse matrix

IDF model :

based product similarity, If title is not very big, idf_model(doc_id, num_results)

Till now we tried three techniques :-

This idf is not also very good, it is okay
Till now TF-idf is best
TF-idf > idf > BOW (in terms of better output)

Word2Vec & Average Word2Vec model :

(featurizing text based on semantic similarity)

avg_w2v_model(doc_id, num_results)

for every word have 3000 dim vector

word2vec requires very large data corpus to work well
In this project we are using a pretrained model by google, its 3.3G file, once you load this into your memory it occupies ~9Gb, so please do this step only if you have >12G of ram
word2vec_model : we have provided a pickle file wich contains a dict and it contains all our courpus words as keys and model[word] as values
we take small sample :- those word which are in our titles
This output was not available in BOW & tf-idf because they treated ‘tiger’ and ‘tigers’ different words
Word2wVec gives semantics similarity (many animals print type shirt), which is not given by BOW and TfIdf

TF-IDF weighted Word2Vec model :

weighted_w2v_model(doc_id, num_results)

for every title we build a weighted vector representation

7. Weighted similarity using brand and color

idf_w2v_brand(doc_id, w1, w2, num_results)

some of the brand values are empty, need to replace Null with string “NULL”

8. Deep learning based visual product similarity (CNN)

get_similar_products_cnn(doc_id, num_results)

How to featurize an image: edges, shapes, parts
Using (Keras + Tensorflow) to extract features
img_width, img_height = 224, 224
each image is converted into 25088 length dense-vector
In actual real world problem, we use combination of solutions
Here, we are using only images as input and find similar images

9. Measuring goodness of our solution : A/B testing

A/B Testing ia a very big area, There are many textbook written on just A/B testing
It is also known as Bucket Testing
If you have two solution : Solution A and Solution B, Then AB testing is used to determine which solution perform better