Day 49 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
5 min readAug 4, 2020

Content Based Recommendation Engines. We have discussed about Recommendation engines in the past few blogs so I’m going to focus directly on the diversity of the recommendation engines or Content based engines which help us with the ‘Suggestions’ when we use different media or entertainment based apps. Anyways, lets quickly get to the working of these.

Working of the Content based Recommender system (Algorithm)

These systems make recommendations using a user’s item and profile features. They hypothesize that if a user was interested in an item in the past, they will once again be interested in it in the future. Similar items are usually grouped based on their features. User profiles are constructed using historical interactions or by explicitly asking users about their interests. There are other systems, not considered purely content-based, which utilize user personal and social data.

Issues with Content Based filtering
One issue that arises is making obvious recommendations because of excessive specialization (user A is only interested in categories B, C, and D, and the system is not able to recommend items outside those categories, even though they could be interesting to them). Another common problem is that new users lack a defined profile unless they are explicitly asked for information. Nevertheless, it is relatively simple to add new items to the system. We just need to ensure that we assign them a group according to their features.

Keep in mind that Content based methods suffer far less from the cold start problem than collaborative approaches: new users or items can be described by their characteristics (content) and so relevant suggestions can be done for these new entities. Only new users or items with previously unseen features will logically suffer from this drawback, but once the system old enough, this has few to no chance to happen.

Alright, now let's get straight to the implementation in python.

Let's start with importing our libraries.

#Importing libraries which are neededimport pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

The next step would be to import the dataset. The link to the dataset is given below:

Download the csv file and import it into the given environment using the following command.

#Importing dataset of metadata from movies#Importing dataset of metadata from movies
metadata = pd.read_csv(‘movies_metadata.csv’, low_memory=False)
metadata.head(3)
Movies metadata head()

On Calculating the mean of “vote average” column, we get, 6.069064705533525 as the output.

mean = metadata[‘vote_average’].mean()
print(mean)

To describe subdivisions of frequency distribution, we use quantile keyword

m = metadata[‘vote_count’].quantile(0.90)
print(m)

Filtering the qualified movies into a new DataFrame

filtered_movies = metadata.copy().loc[metadata[‘vote_count’] >= m]
filtered_movies.shape

We write a function “weighted_rating” which calculates the overall rating of each movie.

def weighted_rating(x, m=m, mean=mean):
v = x[‘vote_count’]
R = x[‘vote_average’]
# Calculation based on the IMDB formula
return (v/(v+m) * R) + (m/(m+v) * mean)

Defining a new feature called ‘score’ and calculate its value with `weighted_rating()`

filtered_movies[‘score’] = filtered_movies.apply(weighted_rating, axis=1)

Sort movies in Descending order based on the value that is calculated.

filtered_movies = filtered_movies.sort_values(‘score’, ascending=False)#Print the top 15 movies
filtered_movies[[‘title’, ‘vote_count’, ‘vote_average’, ‘score’]].head(20)
Movies recommendation based on descending order of scores obtained

Let's get to the most important part of the implementation

Let us printing the plot of the first 5 movies to get a rough idea of what it looks like.

metadata[‘overview’].head(5)
metadata text

The next step would be to get the TF-IDF vectorizer from sci-kit learn and I would suggest going through the library documentation for syntax based queries.

#Import TF-ID Vectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])
#Output the shape of tfidf_matrix
tfidf_matrix.shape

Lets try to get an array mapping from feature integer indices to feature name.

tfidf.get_feature_names()[5000:5010]

The next important step would be to import the cosine similarity matrix for which I would advise you to copy paste the below given snippet of code which imports the linear kernel in order to help with the creation of the cosine similarity.

# On Importing linear_kernel
from sklearn.metrics.pairwise import linear_kernel
# Computing the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

You can test your output to get a rough idea of what the cosine similarity looks like using:

cosine_sim[1]
Cosine Similarity values

Let us now construct a reverse map of indices and movie titles.

indices = pd.Series(metadata.index, index=metadata[‘title’]).drop_duplicates()

You can print out the values and check them out.

indices[:10]
Movies obtained using cosine similarity

If you’ve made it till here, this is the final and main step whereby we would write our main function to predict similar movies.

MAIN PART!

def get_recommendations(title, cosine_sim=cosine_sim):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
return metadata['title'].iloc[movie_indices]

Now, you can test your function by inputting any random movie which exists in the dataset. Check out the result I obtained below:

get_recommendations('The Godfather')
recommendations obtained

It may be seen that we tried to query the movie named Godfather and the results similar to it were obtained such as its sequels based on the Cosine similarity matrix that was calculated. Since we are working with a text-corpus, we have used the TF-IDF vectorizer. Initially, a recommender system was developed solely based on the similarity matrix and score developed for the movie. Towards the ending, we developed a recommender system which was based on the cast, keywords and directors.

That’s it for today. Thanks for reading. Keep Learning.

Cheers.

--

--