A content based Recommendation System

Published in

The AI Guide

10 min readMay 4, 2021

Introduction

Recommender systems have ended up everywhere in consumers’ everyday lives on the online stage, extending from e-commerce, social media to news outlets. Our opinions are super-charged by machine learning models that learn from our likes and recommend more of what we like to see. Since recommender frameworks have such a significant effect on our lives, and subsequently on society, I have chosen to make a content-based recommender engine for an e-commerce web store, Wayfair.[3]

This recommendation that I’m going to build is very useful for the Furniture buyers on the Wayfair e-commerce website, provides useful recommendations similar to the product that they are interested in and recommends the most similar items.

Collaborative based Filtering VS Content based Filtering

Content based Recommender System

A content-based recommender system attempts to guess a user’s features or actions based on the features of an object to which he or she responds positively. Once we know the user’s preferences, we can use the feature vector created to embed the user in an embedding space and recommend the user based on user preferences.[4]

how can you find similarities between items?

The similarity metrics are calculated from the item’s feature vectors and the user’s favorite feature vectors from previous records during recommendation. The top few are then suggested.

Collaborative based Recommender System

The features of the products are not needed for collaborative filtering. A function vector describes each user and item. On its own, it provides embedding for both. Both users and items are embedded in the same embedding space.
When recommending a user, it takes into account the reactions of other users. It keeps track of the items a single user likes, as well as which items users with similar behavior and preferences like, to recommend things to that user.[5]

Libraries

Numpy provides a high-performance multidimensional array object and tools for working with these arrays.

Pandas is an open-source library built on top of NumPy providing high-performance, easy-to-use data structures, and data analysis tools for the Python programming language. It allows for fast analysis, data cleaning, and preparation. It excels in performance and productivity. It can work with data from a wide variety of sources.

NLTK (Natural Language Toolkit) in python has a list of stopwords stored. Ve can remove stopwords easily, by storing a list of words that you consider to stopwords for example a, as, the, etc..,

CountVectorizer is a great tool provided by the sci-kit-learn library in Python. It is used to transform a given text into a vector-based on the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors. [2]

Data Collection

Import.io is a tool that enables users to convert the data on websites into a structured format. If we provide the URL as a link and select which details we need to extract from the link it provides those details. From this tool, I have Wayfair E-commerce website data, which contains the furniture, home decors. I used major details like product name, manufacturer, price, product details to build a recommendation engine. Here is the link for import.io https://www.import.io/product/extract/

Data Cleaning

The extracted data contains a lot of null values, which might negatively affect my recommendation engine. So I tried to clean up the data by filling NA values as the empty string.

Here is the overview of the data that I have after cleaning up NA values

I have here a lot of columns, from here I’m going to use only columns that are required to generate a recommendation engine. I reduced columns using PCA, can be used to automatically perform dimensionality reduction.Fewer input variables can result in a simpler predictive model that will have good performance when making predictions on new data.

The price has extra values such as ‘$’ ‘;’ ‘;$5.69’ symbols for all the products. I need only a numerical format. I cleaned the prices in such a way that I extracted only numerical values from it, getting rid of unwanted symbols.

Here is the cleaned price data.

Extended cleaning of price

For some price values have few more extra symbols in-between like ‘-’ ‘,’ , I replaced all with empty string so that price will be intact without changing anything.

Price contains different values ranging from less than 10$ to greater than 45,000$ , I need to give price as a string to generate results, If I don’t I might end up with wrong results. Hence I decided to use the concept of binning, to provide price ranges for each product, so that even if the price is tokenized, similar price range products come into one category.

Using required columns for the analysis which are product name, product information, description of the project, price range, manufacturer details.

Exploratory Data Analysis

To get a view about how price values are distributed in the Wayfair dataset, I wanted to know which products are top and which products have the least prices to get the idea to implement the concept of binning into my analysis.

Here are the top 5 priced products. The top one here is Original Handel Peacock 63'’ Floor Lamp, pricing greater than 30,000$ followed by Palace Chandeelier 108" Candelabra Floor Lamp (set of 2), Tolomeo 103" Reading Floor Lamp etc.,

Here are the resultant least prices of the products, Lynn 62.5" Floor Lamp, Barner 71" Torchiere Floor Lamp which has the same price 34.99$

Basically, in our dataset, the price ranges from very low to very high, by understanding the given graphs I came up with the price binning method and implemented it.
This is the Final Analysis of the above step, after computing cosine similarity I wanted to plot the similarity metrics of each product and how it relates to each other, From the below heat map, the dark portions are more related, the x-axis and y-axis are the products with their respective indexes.

Here a bit more products were added to check similarity, Not so much readable. Not very efficient too. This is why I wrote a simple code to generate similar top 10 products rank-wise by providing product name as a parameter.

TF-IDF

Term Frequency Inverse Document Frequency (TF-IDF)

This is a very common algorithm to transform the text into a meaningful representation of numbers which is used to fit machine algorithms for prediction. TF-IDF is a statistic that aims to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus.
This is performed by looking at how many times a word appears in a document while also paying attention to how many times the same word appears in other documents in the corpus.

TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents.

N is the number of documents we have in our dataset
d is a given document from our dataset
D is the collection of all documents
w is a given word in a document

First step is to calculate the term frequency, our first measure if the score.

Second step is to calculate the inverse term frequency.

Final Step,

Similarity metrics

Cosine Similarity

The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate the cosine similarity we use the following formula:

Recall the cosine function: on the left the red vectors point at different angles and the graph on the right shows the resulting function.

Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors point in the exact same direction, the cosine similarity is +1. If the vectors point in opposite directions, the cosine similarity is -1.

The cosine similarity is very popular in text analysis. It is used to determine how similar documents are to one another irrespective of their size. The TF-IDF text analysis technique helps converting the documents into vectors where each value in the vector corresponds to the TF-IDF score of a word in the document. Each word has its own axis, the cosine similarity then determines how similar the documents are.

Jaccard Similarity

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets).

In set theory it is often helpful to see a visualization of the formula:

We can see that the Jaccard similarity divides the size of the intersection by the size of the union of the sample sets.

Both Cosine similarity and Jaccard similarity are common metrics for calculating text similarity. Calculating the Jaccard similarity is computationally more expensive as it matches all the terms of one document to another document. But the Jaccard similarity turns out to be useful by detecting duplicates. [1]

Recommending TOP 10 similar items

Here I merged all the columns which are necessary to my analysis, Information of the products, Data of the Product, Description of the Product, Price range of the product, Product Manufacturer.

I removed stop words from the merged text

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix. count matrix provides the occurrences of given word in each feature.

In order to find the cosine similarity between count matrix results. I will be using cosine_similarity function

Here is the code to generate top 10 products details from the given name of the product name.

Here is the result of top 10 products for the given product name Saville Leather Sofa

Here is the result of top 10 products for the given product name 51" Floor Lamp

Here is the result of top 10 products for the given product name Bordadora Sofa with Pillow

Bugs Encountered

I got this bug while I was fitting the data, the problem here is I didn't convert the merged string to list. After Converting the merged string to tokens, list of words I overcame this issue.

Limitations

I designed a basic workable Content-Based Recommendation system, here my limitation is only using price, product details, manufacturer, etc., I could improve my recommendation engine If I could extract the features like clicks, impressions, users demographic details, preferences, gender, ratings, user reviews, search words. If I used those in my recommendation system, It would have given a stable recommendation.

Conclusion

Regardless of the flaws, The current Wayfair recommendation engine provided results which were very related to the given product for example for the sofa, it gave top 10 recommendations of only sofa products. And for Lamps, it gave only recommendations to Lamps. It hasn’t deviated from any, since the data cleaning, binning concept tokenizing I used are very effective. This recommendation engine can be modified and can be made as a robust engine.

References

[1]

https://medium.com/r/?url=https%3A%2F%2Fwww.import.io%2Fproduct%2Fextract%2F

[2]

sklearn.feature_extraction.text.CountVectorizer - scikit-learn 0.24.2 documentation

class sklearn.feature_extraction.text. CountVectorizer( *, input='content', encoding='utf-8', decode_error='strict'…

scikit-learn.org

[3]

TF-IDF for Document Ranking from scratch in python on real world dataset.

Check how document ranking is done using “TF-IDF Cosine Similarity” and “TF-IDF Matching Score” on a real world…

towardsdatascience.com

[4]

A Content-Based Recommender for E-Commerce Web Store

A case study of a simple algorithm with Matrix Multiplication and Cosine Similarity