Content Based Recommender System in Python

6 min readAug 14, 2017

In the previous article, we learned about the most general form of recommender system called Non Personalized Recommender System. Although, this type of recommendations might be useful in some cases, the biggest problem is that it lacks the context.

In this article, we will learn about the Content based Recommender Systems. This type of recommender system is dependent on the inputs provided by the user. Some of the typical examples of content based filtering are Google, Wikipedia etc. Let’s start by getting an intuition of what these recommenders do. When you search a group of keywords, the most basic idea will be to show all the items containing those keywords. But the biggest problem that we are trying to solve here, The problem of Huge Pile of information, acts as a roadblock in this also. For Example, if someone search for “The Recommender System”, there will be a large number of documents containing the Keyword “The”. Same is for the keyword “System”.

So how do we select the most relevant item? We will be using the concept of TF-IDF for this purpose.

Concept of TF-IDF

TF-IDF stands for Term Frequency times Inverse document frequency. Let’s understand these terms separately. TF stands for Term Frequency. It tells us about how often does the term you are talking about appear in the document? How relevant is it to the document? For e.g. how many times the keyword “The” appeared in the document OR how many times a tag is applied for a movie.

IDF stands for Inverse Document Frequency. It tells us how rare it is for a document to have this term or for a tag to be applied to the movie. We calculate it by taking the inverse of how many documents have this tag divided by total number of documents. What it will do is that if a term appears in a large number of documents, it will provide us with a low IDF value. We use log of IDF to bring the value into a useful scale as the total number of documents are very large.

We will then multiply TF and IDF and get a weight. That weight is assigned to the particular search term or tag that we’re talking about. So for very common terms, weight will be very low in value and thus demoting the common terms and promoting the core terms.

Keyword Vector

A keyword vector starts with the notion that we can define a multi-dimensional content space based on the universe of all possible keywords. Let’s consider that we’re dealing with movies and we’ve decided that the things that describe movies are genres and actors and directors, then every genre, actor, and director is a dimension in that space. And every item has a position in that space made by combining all these dimensions. And that position defines a vector. It basically tells us how much is this particular movie Tom Cruise? How much is it about fighting scenes? How much of it is romance? How much is it comedy?

Like every movie, Every user will also have a taste profile, or in some cases, many taste profiles, that are vectors in that space and the match between user preference and items is measured by how closely the two vectors align. Normally we normalize these vectors and make them unit vectors before finding how well both the vectors align.

After defining the TF-IDF value for the tags and making Keyword Vectors for both item and user, we are finally left with the last part which is computing predictions.

To compute how much user profile vector aligns with the item’s vector, we will use the concept of cosine similarity. It is equal to dot product of the two vectors divided by the product of length of both the vectors.

Please note that sometimes, while building a user profile, we aggregate the profiles of item rated/consumed by incorporating a weighted scheme to incorporate the liking of user. We will be working on assignments of both type of systems.

So let’s start with the coding part. Click here to download the datasets.

The representation of the code below might not be very easy to read, so please go to my GitHub repository to access all the codes of Recommender Systems of this series.

Let’s start by firing up the required libraries and importing the datasets.

import pandas as pd
import numpy as np
import math
Ratings=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\Content Based\data\\ratings.csv”)
Movies=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\Content Based\data\\movies.csv”)
Tags=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\Content Based\data\\tags.csv”)

Calculating the TF value and IDF value and multiplying together to get TF-IDF value

TF= Tags.groupby([‘movieId’,’tag’], as_index = False, sort = False).count().rename(columns = {‘userId’: ‘tag_count_TF’})[[‘movieId’,’tag’,’tag_count_TF’]]
Tag_distinct = Tags[[‘tag’,’movieId’]].drop_duplicates()
DF =Tag_distinct.groupby([‘tag’], as_index = False, sort = False).count().rename(columns = {‘movieId’: ‘tag_count_DF’})[[‘tag’,’tag_count_DF’]]
a=math.log10(len(np.unique(Tags[‘movieId’])))
DF[‘IDF’]=a-np.log10(DF[‘tag_count_DF’])
TF = pd.merge(TF,DF,on = ‘tag’, how = ‘left’, sort = False)
TF[‘TF-IDF’]=TF[‘tag_count_TF’]*TF[‘IDF’]

Calculating the unit length vector by dividing TF-IDF value with the vector length of a particular movie.

Vect_len=TF[[‘movieId’,’TF-IDF’]]
Vect_len[‘TF-IDF-Sq’]=Vect_len[‘TF-IDF’]**2
Vect_len =Vect_len.groupby([‘movieId’], as_index = False, sort = False).sum().rename(columns = {‘TF-IDF-Sq’: ‘TF-IDF-Sq-sum’})[[‘movieId’,’TF-IDF-Sq-sum’]]
Vect_len[‘vect_len’] = np.sqrt(Vect_len[[‘TF-IDF-Sq-sum’]].sum(axis=1))
TF = pd.merge(TF,Vect_len,on = ‘movieId’, how = ‘left’, sort = False)
TF[‘TAG_WT’]=TF[‘TF-IDF’]/TF[‘vect_len’]

In the first part of the assignment with unweighted user profile, the user profile should be the sum of the item-tag vectors of all items the user has rated positively (>= 3.5 stars). Mathematically, this is:

Let’s implement the same and calculate user profile for each user.

Ratings_filter=Ratings[Ratings[‘rating’]>=3.5]
distinct_users=np.unique(Ratings[‘userId’])
user_tag_pref=pd.DataFrame()
i=1
for user in distinct_users[1:2]:

if i%30==0:
print “user: “, i , “out of: “, len(distinct_users)

user_data= Ratings_filter[Ratings_filter[‘userId’]==user]
user_data = pd.merge(TF,user_data,on = ‘movieId’, how = ‘inner’, sort = False)
user_data1 = user_data.groupby([‘tag’], as_index = False, sort = False).sum().rename(columns = {‘TAG_WT’: ‘tag_pref’})[[‘tag’,’tag_pref’]]
user_data1[‘user’]=user
user_tag_pref = user_tag_pref.append(user_data1, ignore_index=True)
i=i+1

Now we are ready with both the user profile and item profile. Now the final step is to calculate the cosine similarity between the two vectors.

distinct_users=np.unique(Ratings_filter[‘userId’])
tag_merge_all=pd.DataFrame()
i=1
for user in distinct_users[1:2]:

user_tag_pref_all= user_tag_pref[user_tag_pref[‘user’]==user]
distinct_movies = np.unique(TF[‘movieId’])
j=1
for movie in distinct_movies:

if j%300==0:

print “movie: “, j , “out of: “, len(distinct_movies) , “with user: “, i , “out of: “, len(distinct_users)

TF_Movie= TF[TF[‘movieId’]==movie]
tag_merge = pd.merge(TF_Movie,user_tag_pref_all,on = ‘tag’, how = ‘left’, sort = False)
tag_merge[‘tag_pref’]=tag_merge[‘tag_pref’].fillna(0)
tag_merge[‘tag_value’]=tag_merge[‘TAG_WT’]*tag_merge[‘tag_pref’]

TAG_WT_val=np.sqrt(np.sum(np.square(tag_merge[‘TAG_WT’]), axis=0))
tag_pref_val=np.sqrt(np.sum(np.square(user_tag_pref_all[‘tag_pref’]), axis=0))

tag_merge_final = tag_merge.groupby([‘user’,’movieId’])[[‘tag_value’]].sum().rename(columns = {‘tag_value’: ‘Rating’}).reset_index()

tag_merge_final[‘Rating’]=tag_merge_final[‘Rating’]/(TAG_WT_val*tag_pref_val)

tag_merge_all = tag_merge_all.append(tag_merge_final, ignore_index=True)
j=j+1
i=i+1
tag_merge_all=tag_merge_all.sort_index(by=[‘user’,’Rating’]).reset_index()

In the second part of the assignment with weighted user profile, rather than just summing the vectors for all positively-rated items, we will compute a weighted sum of the item vectors for all rated items, with weights being based on the user’s rating. We would implement the following formula:

Let’s implement the same and calculate user profile for each user.

distinct_users=np.unique(Ratings[‘userId’])
user_tag_pref=pd.DataFrame()
i=1
for user in distinct_users[1:2]:

if i%30==0:
print “user: “, i , “out of: “, len(distinct_users)

user_data= Ratings[Ratings[‘userId’]==user]
user_data[‘weight’]=user_data[“rating”]-user_data[“rating”].drop_duplicates().mean()
user_data1 = pd.merge(TF,user_data,on = ‘movieId’, how = ‘inner’, sort = False)
user_data1[‘TAG_WT_WTD’] = user_data1[‘TAG_WT’]*user_data1[‘weight’]
user_data2 = user_data1.groupby([‘tag’], as_index = False, sort = False).sum().rename(columns = {‘TAG_WT_WTD’: ‘tag_pref’})[[‘tag’,’tag_pref’]]
user_data2[‘user’]=user
user_tag_pref = user_tag_pref.append(user_data2, ignore_index=True)
i=i+1

Again calculating cosine similarity between the two vectors

distinct_users=np.unique(Ratings[‘userId’])
tag_merge_all=pd.DataFrame()
i=1
for user in distinct_users[1:2]:

user_tag_pref_all= user_tag_pref[user_tag_pref[‘user’]==user]
distinct_movies = np.unique(TF[‘movieId’])
j=1
for movie in distinct_movies:

if j%300==0:

print “movie: “, j , “out of: “, len(distinct_movies) , “with user: “, i , “out of: “, len(distinct_users)

TF_Movie= TF[TF[‘movieId’]==movie]
tag_merge = pd.merge(TF_Movie,user_tag_pref_all,on = ‘tag’, how = ‘left’, sort = False)
tag_merge[‘tag_pref’]=tag_merge[‘tag_pref’].fillna(0)
tag_merge[‘tag_value’]=tag_merge[‘TAG_WT’]*tag_merge[‘tag_pref’]

TAG_WT_val=np.sqrt(np.sum(np.square(tag_merge[‘TAG_WT’]), axis=0))
tag_pref_val=np.sqrt(np.sum(np.square(user_tag_pref_all[‘tag_pref’]), axis=0))

tag_merge_final = tag_merge.groupby([‘user’,’movieId’])[[‘tag_value’]].sum().rename(columns = {‘tag_value’: ‘Rating’}).reset_index()

tag_merge_final[‘Rating’]=tag_merge_final[‘Rating’]/(TAG_WT_val*tag_pref_val)

tag_merge_all = tag_merge_all.append(tag_merge_final, ignore_index=True)
j=j+1
i=i+1
tag_merge_all=tag_merge_all.sort_index(by=[‘user’,’Rating’]).reset_index()

Please note that this is a somewhat complex concept and i encourage you to go through the University of Minnesota’s Recommender system specialisation courses on Coursera for complete understanding.

In the next article we will be looking at one of the widely used Recommender system called User-User Recommender Systems.

Please go to my GitHub repository to access all the codes of Recommender Systems of this series.

Thanks!

Content Based Recommender System in Python

Written by Ankur Tomar