Movie Recommendation System

Published in

Analytics Vidhya

8 min readApr 4, 2021

Hey!! geeks. we all know that watching movies is lot fun. we all watched a many movies on different platform like Netflix. I have been thinking about how movie based platforms like Netflix or even on medium suggest anything based on user interest. how there recommendations are working based on the our interest. lets get knowledge.!

In this blog we will talk about the recommendation systems. and create a movie recommendation system using TheMoviesDatabase dataset.

hearing what google has to say about it.

A Recommender system, or a recommendation system, is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item.

simply, Recommender systems aim to predict users’ interests and recommend product items that quite likely are interesting for them.

value of recommendation

Netflix: 2/3 of movies watched are recommended

Google news: recommendation generates 38% clickthrough

Amazon: 35% sales from recommendation

So, as per the business standpoint, the more relevant products a user finds on the platform, the higher their engagement. it results in increased revenue for the various platform.

Types of Recommendation system

Typically, machine learning algorithms are fit into two categories of the recommendation system.

Content-Based Recommendation Systems

2. Collaborative Filtering Recommendation Systems

Although, modern recommendation uses the both approaches called as Hybrid recommendation.

Content-Based Filtering

suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. ex. YouTube

Collaborative Filtering

these systems are widely used, and they try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

There are two categories of CF:

User-based: measure the similarity between target users and other users
Item-based: measure the similarity between the items that target users rates/ interacts with and other items

The key idea behind CF is that similar users share the same interest and that similar items are liked by a user. examples of this are found in the recommendation systems of Netflix, and Spotify.

for detailed information click here.

Hybrid Engine

We brought together ideas from content and collaborative filtering to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.

let’s go for the implementation part..

Here, I have used a tmdb_5000_dataset. which contains nearly 5000 movies with their title, names, genres, cast etc. information.

we use pandas and numpy for the data preprocessing and sklearn for our machine Learning task.

import all needs and dataset..

import pandas as pd
import numpy as np
movies = pd.read_csv('datasets/tmdb_5000_movies.csv')
df = pd.read_csv('datasets/tmdb_5000_credits.csv')
df.columns = ['id', 'tittle', 'cast', 'crew']
movies = movies.merge(df, on='id')movies.info()

above image we can see that there are 4803 movies and 23 columns with there tmdb id.

lets start with content based recommendation engine. for that we need a information about the movie which is here describe in the Overview column.

now, we will use TF-IDF vectorizer of sklearn to get feature of the movies and find the similarities between the movies base on the Tf-IDF matrix.

from sklearn.feature_extraction.text import TfidfVectorizer# removing english stop word like a, and , the 
tfidf = TfidfVectorizer(analyzer = ‘word’,stop_words = ‘english’)#NaN -> ‘’
movies[‘overview’] = movies[‘overview’].fillna(‘’)tfidf_matrix = tfidf.fit_transform(movies[‘overview’])tfidf_matrix.shape # outputs: (4803, 20978)

the shape of the TF-IDF matrix is (4803, 20978). which means that here are 20978 different words are used to describe a 4803 movies.

Now, we will find similarity score of this matrix.

As we have a TF_IDF vectorizer, calculating directly a dot product will give us a cosine similarity. here we are using cosine similarity score since it is relatively easy and fast to calculate.

from sklearn.metrics.pairwise import linear_kernelcosin_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

we will do reverse map for the indices of the movies and titles.

index_of_movies = pd.Series(movies.index,   index=movies['title']).drop_duplicates()

let’s write function for recommendation.

will get the title
find the similarity score for the movie from cosin_sim matrix
sort the similarity score
return top movie base on the input

def get_recommendations(title, cosin_sim=cosin_sim):
    idx = index_of_movies[title]
    
    sim_scores = list(enumerate(cosin_sim[idx]))
    # sorting of moviesidx based on similarity score
    sim_scores = sorted(sim_scores, key = lambda x:x[1], reverse = True)
    # get top 10 of sorted 
    sim_scores = sim_scores[1:31]
    
    movies_idx = [i[0] for i in sim_scores]
    
    return movies['title'].iloc[movies_idx]

improve the recommender with another metadatas

first, we get the cast, crew, keywords and genres column data. then we will put some preprocessing on that data to get the most useful information for example we will get Director from the ‘crew’ column.

we will create a soup of these information. and apply the CountVectorizer.

One important difference is that we use the CountVectorizer() instead of TF-IDF. This is because we do not want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies. It doesn’t make much intuitive sense.

next step is to compute a Cosine Similarity matrix based on the Count matrix.

below is the code for the same..

from ast import literal_evalfeatures = ['cast', 'crew', 'keywords', 'genres']
for f in features:
    movies[f] = movies[f].apply(literal_eval)# to get director from job
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan# get top 3 elements of list
def get_list(x):
    if isinstance(x, list):
        names = [ i['name'] for i in x]
        
        if len(names)  > 3:
            names = names[:3]
        return names
    return []#apply all functions
movies['director'] = movies['crew'].apply(get_director)features = ['cast', 'keywords', 'genres']
for f in features:
    movies[f] = movies[f].apply(get_list)#striping
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(' ', '')) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(' ', ''))
        else:
            return ''features = ['cast', 'keywords', 'director', 'genres']
for f in features:
    movies[f] = movies[f].apply(clean_data)#creating a SOUP
def create_soup(x):
    return ' '.join(x['keywords'])+' '+' '.join(x['cast'])+' '+x['director']+' '+' '.join(x['genres'])movies['soup'] = movies.apply(create_soup, axis=1)#count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(movies['soup'])# finding similarity matrix
from sklearn.metrics.pairwise import cosine_similarity
cosin_sim2 = cosine_similarity(count_matrix, count_matrix)

now, we use same get_recommendation() function and u can see the improved recommendation for our movies.

this is how we can create a Content Based Recommendation engine.

but, Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn’t capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who she/he is. Therefore, now, we will use a Collaborative Filtering to make recommendations to Movies.

now, lets go for the another type which is collaborative filtering based

Since the dataset we used before did not have userId(which is necessary for collaborative filtering) let’s load another dataset. We’ll be using the Surprise library to implement SVD. u can download dataset from here.

from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validatereader = Reader()
ratings = pd.read_csv(‘datasets/ratings_small.csv’)

cross validate our data.

we get a Root Mean Square Error of 0.89 approx which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

train = data.build_full_trainset()
svd.fit(train)

let’s predict the user 1’s rating on the movie Id=302

svd.predict(1, 302)

here you can see est=2.6280 means that user 1 might be give rating of 2.63 to movie which has Id 302.

that is how we can predict the movie rating based on the users profile and recommend the best movie to them without knowing the past behaviour of the User. this is called a collaborative filtering.

hybrid recommender

now, lets put our content based and CF based together and make a strong recommender.

movie_id = pd.read_csv(‘datasets/links.csv’)[[‘movieId’, ‘tmdbId’]]
movie_id['tmdbId'] = movie_id[‘tmdbId’].apply(conv_int)
movie_id.columns = [‘movieId’, ‘id’]
movie_id = movie_id.merge(movies[[‘title’, ‘id’]], on=’id’).set_index(‘title’)
print(movie_id.shape) # o/p: (4599, 2)

our movie_id dataframe will look like this.

make a index_map to find a index of a movie.

index_map = movie_id.set_index('id')

finally, lets define our recommendation function. which has a power of two technique content based and CF.

def recommend_for(userId, title):
    index = index_of_movies[title]
    tmdbId = movie_id.loc[title]['id']
    

    #content based
    sim_scores = list(enumerate(cosin_sim2[int(index)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:30]
    movie_indices = [i[0] for i in sim_scores]

    mv = movies.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'id']]
    mv = mv[mv['id'].isin(movie_id['id'])]    # CF
    mv['est'] = mv['id'].apply(lambda x: svd.predict(userId, index_map.loc[x]['movieId']).est)
    mv = mv.sort_values('est', ascending=False)
    return mv.head(10)

lets have just deserts of our work ;)

Conclusion

We create recommenders using content- based and collaborative filtering. Hybrid Systems can take advantage of content-based and collaborative filtering as the two approaches are proved to be almost complimentary. This model was very baseline and only provides a fundamental framework to start with.

you can find this project here. go and play with it ;)

Happy Learning!!