Amazon Product Recommendation System

Saksham Checker

Published in

Analytics Vidhya

7 min readJul 28, 2021

NLP based recommendation system

Introduction

Every e-commerce website works with a recommendation system to provide the customer with best recommendations on what they might be willing to buy. This article covers how Natural Language Processing can be used to recommend products for the customer. The models are tested for Fashion products. The project is also presented as to complete the course of CO102 at Delhi Technological University.

Natural Language Processing(NLP) —

It is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

Knowing the Dataset

The dataset contains over 180k fashion products of women which were extracted using Amazon products API by the AppliedAI. The data thus is openly available at there website. The data is thus further processed and used for applying the NLP models. The data initially has 19 features. of these 19 features, we will be using only 6 features in this workshop. 1. asin ( Amazon standard identification number) 2. brand ( brand to which the product belongs to ) 3. color ( Color information of apparel, it can contain many colors as a value ex: red and black stripes ) 4. product_type_name (type of the apperal, ex: SHIRT/TSHIRT ) 5. medium_image_url ( url of the image ) 6. title (title of the product.) 7. formatted_price (price of the product)

Also, the data entries with title size less than 4 are removed so as to have plenty words in the title for Natural Language Processing.

Data de-duplication

The 180k data entries have titles which have similar titles. As customers don’t like same products to be shown as a recommended product while looking at one. Thus the de-duplication process is done. In this process, if the number of words in which both strings differ are > 2 , we are considering it as those two apparels are different. If the number of words in which both strings differ are < 2 , we are considering it as those two apparels are same, hence we are ignoring them.

A similar algorithm but keeping the margin of 3 can also be implemented for better results but is not performed because the algorithm has O(n²) time complexity and requires a lot of resources for de duplication.

151k data entries are left after de-duplication of keeping a margin of 2 in the titles. The data is taken next for further pre-processing.

Data Pre-processing

For data pre-processing, the data is cleaned by removing of the stop words.

The stop words are removed by first downloading a list from NLTK or Natural language Toolkit. A lot of the data that you could be analyzing is unstructured data and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it.

The cleaned data is saved as a pickle file to be reused for models in the further part of this project.

Text Based product similarity

First, a few functions are defined for displaying the recommended products in the form of heat maps and plots.

The given block of code tells about the utility functions used in the same.

Now three different models are defined for the text based recommendation system.

Bag of Words.
Term Frequency — Inverse Document Frequency
Inverse Document Frequency

Bag of Words

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Euclidean distance is calculated between the data entries which were before converted into count vectorizer.

Initially, a random product is selected as shown in the figure below. For this article, 3 recommendations are included such that, it can easily be explained here.

Based on the title of the product, Euclidean distance is taken into consideration and 3 products are recommended below. If we look at the images closely, it can be seen that the products are similar.

For more better results, TF-IDF model is used.

Term Frequency — Inverse Document Frequency

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.

To display the recommended products, same product which was used in BoW is used as the main product.

The recommendations of that product is shown in the figure below. It can be seen that these recommendations are more appropriate as compared to the ones before.

Inverse Document Frequency

In simple terms, it’s a measure of the rareness of a term. Conceptually, we start by measuring document frequency. As vectorization of IDF is time consuming, I have removed the data entries where the Color, brand and price information was missing. After that, de duplication with margin of 3 is also performed for better results. This more processed data is of 28k entries and can be used in Text-semantics based recommendation which will be discussed in the next part.

As the data is more processed, a different product is set up as base product as shown in the figure below.

The recommendations based on this model is shown below. Although the recommendations do not seem to be more appropriate but the vector formed in this model will be used further in text semantics.

Text Semantics Based product similarity

The utility functions for text semantic based product similarity are given below.

Two models are applied in this section —Average Word2Vec and Word2Vec based on brand and color.

Average Word2Vec

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

The word2vec model used in this project is the one already given in the data folder as for creating our own model, a lot of data is required.

The product used as base product in IDF model is also used here to create recommendations.

It can be observed that these recommendations are more appropriate compared to all other models.

Word2Vec based on Brand and Color

As we know that in a business model, products of same brand are not given as recommendation and other products are shown. This model gives weights to titles and brands/colors which is set as 10 and 5 to obtain the results.

The base product is selected same as in IDF and average Word2Vec model.

The recommendations are more appropriate and thus suitable for a proper business model.

Conclusion and Future work

It is thus observed that for a business perspective, Word2vec model based on brand and colors is more useful to give recommendations on a E-Commerce website.

For future, more data can be used with mixed type of products, along with that the images can be used and neural networks can be used to give recommendations based on the images.