Content-based Recommendations — A Technical deep dive

5 min readFeb 15, 2019

Introduction

YFret provides recommender systems that can be used in industries such as e-commerce, travel, content-sites etc. In our series about recommender systems, the previous post explored the role of different types of recommendation in the e-commerce lifecycle. In this post, we will be looking at the business need and the technical implementation of content-based recommendation system.

Need for Content-based recommendations

Recommender systems can be broadly classified into Content-based system and Collaborative-filtering systems. Content-based systems generate recommendations based on product attributes (or any data available in the form of objects, for example, blog posts in the case of content-sites). Collaborative-filtering generates recommendations based on user behavior.

Content-based vs collaborative filtering

In most use-cases, collaborative filtering works better than content-based recommendations because it brings out more obscure user behavior patterns which are not apparent from the product data. So why do we need content-based recommendations at all? Because in a brand-new website which lacks user activity, collaborative-filtering presents a catch-22.

The catch-22 is that collaborative-filtering needs quality user activity in the site for generating recommendations, but quality recommendations are very much needed to encourage users to be active on the site.

Even if the site does sufficient user activity it might not be evenly distributed among the products in the site, user activity usually follows an approximation 80:20 rule. 80% of the traffic will be brought in by 20% of the content. Collaborative-filtering fails to generate quality recommendations for clusters of products that lack user-activity.

Content-based recommendations can be used to substitute collaborative-filtering recommendations whenever enough user data is not available.

Generating Content-based Recommendations

We will be using a cleaned version of this dataset to generate the recommendations. All the relevant code and visualizations can be found in this jupyter notebook.

As discussed earlier content-based approach generates recommendations based on product attributes without taking user activity into account. But to compute the similarity between products we need to convert product attribute documents into a format that the algorithm can process — vectors, basically a numpy array of numbers that can be used downstream.

Let’s have a look at a sample product and it’s attributes.

We have many descriptive attributes which can be used to generate the product vector. A simple approach to generating the vector is to use one-hot encoding on select attributes like category, brand, cloth_type etc.. But by doing this we will be ignoring the other valuable attributes and it wouldn’t work when those attributes are unavailable. Ideally, we will want a method that can make use of as many attributes as possible, while keeping the load on the system reasonable.

Steps to convert product attributes to a vector — code is available here

Remove non-descriptive attributes such as product_url , id etc..
description can be important in a few cases, but since we have many other attributes, in this case, we can safely ignore it.
The remaining attributes can possibly be of the type, number , string or list (flatten the data so that it does not contain any nested objects), each of these types should be vectorized.
Numeric data such as price , retail_price etc. is already in the right format, so they are added to the vector after scaling them with MinMaxScaler.
string type attributes can be encoded using Tf-Idf Vectorizer, where the attribute value in each product as is treated as a document, and the vocabulary is built with the values of that attribute in all products. Care must be taken to control the vector length by using parameters such as min_df and max_features .
list can be encoded into a string and can be vectorized the same way.
Convert the vector matrix to sparse type, to make the algorithm memory efficient.

Now that the documents are in vector format, to ensure that the algorithm has learned the structure of the data we could verify it with a 3D TSNE plot. TNSE is a dimensionality reduction algorithm which reduces an n-dimensional vector to a 3 dimension vector, which is easy to visualize. Below scatter plot shows the spread of 3D product vectors. Each product is color coded with a combination of cloth-type and gender attributes. As can be seen from the plot, the basic structure in the data is captured in the vectors.

With the vector representation of the products at hand, we could use distance metrics such as cosine similarity to compute the similarity between them.

Similar Products

Let’s use the algorithm to provide similar products when given a base product. Basically, this can be imagined as getting the nearest points when given a base point from the 3D scatter plot.

The recommendations generated are similar to the base product provided, it can be used to power a recommendation widget on the base product detail page.

Personalized Recommendations

Content-based recommendation is not the goto method for personalized recommendations, but they can still be used as a fallback to more sophisticated recommendation engines. When given a list of products liked by the user, the same similar products logic can be extended to generate personalized recommendations.

That was it, hope you had fun! Please share your comments and thoughts below. I’ll be happy to respond.