“Movies we think you will like…..” -Movie Recommendation System Using Cosine Similarity

Jinal Kalpesh Shah
Web Mining [IS688, Spring 2021]
6 min readApr 23, 2021

A movie recommendation is significant in our social life due to its strength in providing wonderful entertainment. Such a recommendation system can suggest a set of movies to users based on their interest, or the popularities of the movies. Although, a set of movie recommendation systems have been proposed, most of these either cannot recommend a movie to the existing users efficiently or to a new user by any means.

During my research, I have seen some of the recommender offers generalized recommendations to every user based on movie popularity. The basic idea behind that recommender is the movies which are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. That model does not give personalized recommendations based on the user.

All that model have to do is sort movies based on ratings and popularity and display the top movies of our list.

That recommender suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user’s personal taste.

In this assignment, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since, I will be using movie metadata (or content) to build this engine, this also known as Content Based Filtering.

Cosine Similarity

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

According to neo4j article, Cosine Similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors’ lengths (or magnitudes).

Data Source

I have used pre-existing dataset of Kaggle. Dataset comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

It has following columns as listed below :

  1. adult : whether movie has adult content or not.
  2. belongs_to_collection : whether it belongs to collection or not.
  3. budget: Budget of the movie.
  4. genres : a category of artistic composition, as in music or literature, characterized by similarities in form, style, or subject matter.
  5. homepage : landing page of the movie website or any details.
  6. user id : id of the user
  7. movie_id : id of the movie
  8. original_language : language in which movie was made.
  9. original_title : Title of the movie
  10. overview : summary of the movie
  11. release_date : release date of the movie.
  12. revenue : profit made by the movie.
  13. runtime : duration of the film.
  14. spoken_languages : languages in which film was made.
  15. status : whether the film is released or not.
  16. tagline : tagline of the movie.
  17. title
  18. video
  19. vote_average : voting average of the movie.
  20. vote_count : voting count of the movie.
  21. year : year in which movie was released.

First, we will import all the libraries used in this assignment.

Importing necessary libraries and packages used in the assignment

Next step, we will read the dataset into our Jupyter notebook.

Loading the dataset

Data Cleaning :

Removing noise from the data

In the real-world, ratings are very scattered and data points are mostly collected from very popular or widely known movies and large population of engaged users. We wouldn’t want movies that were rated by a small number of users because it’s not reliable enough. Similarly, users who have rated only a few amount of movies should also not be taken into account.

So with all that taken into account and some trial and error experimentations, we will reduce the noise by adding some filters for the final dataset.

  • To qualify a movie, a minimum of 10 users should have voted a movie.
  • To qualify a user, a minimum of 50 movies should have voted by the user.

Let’s perform visualization of how these filters look like

Aggregating the number of users who voted and the number of movies that were voted.

no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')

Let’s visualize the number of users who voted with our threshold of 10.

f,ax = plt.subplots(1,1,figsize=(16,4))
# ratings['rating'].plot(kind='hist')
plt.scatter(no_user_voted.index,no_user_voted,color='mediumseagreen')
plt.axhline(y=10,color='r')
plt.xlabel('MovieId')
plt.ylabel('No. of users voted')
plt.show()

Making the necessary modifications as per the threshold set.

final_dataset = final_dataset.loc[no_user_voted[no_user_voted > 10].index,:]

Let’s visualize the number of votes by each user with our threshold of 50.

f,ax = plt.subplots(1,1,figsize=(16,4))
plt.scatter(no_movies_voted.index,no_movies_voted,color='mediumseagreen')
plt.axhline(y=50,color='r')
plt.xlabel('UserId')
plt.ylabel('No. of votes by user')
plt.show()

Making the necessary modifications as per the threshold set.

final_dataset=final_dataset.loc[:,no_movies_voted[no_movies_voted > 50].index]
final_dataset
Final Dataset

Making the movie recommendation system

I will build Content Based Recommenders based on:

  • Movie Overviews and Taglines

Let us try to build a recommender using movie overviews and taglines.

Now let’s use Cosine Similarity to calculate a numeric quantity that will denote the similarity between two movies.

Using Cosine Similarity

We now have a cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 10 most similar or likewise movies based on the cosine similarity score.

So the function to return similar movies based on the cosine similarity score is :

Function to recommend similar movies

Let us now get the top recommendations for a few movies and let’s see if this recommendations are valid.

Recommending similar movies

We see that for Godfather, our system is able to recommend movies like Family and 8 women.

Now let us try to get similar recommendation for other movies For example, Dark Knight.

We see that for The Dark Knight, we are getting recommendations of films like Batman and subsequently recommend other Batman films as its top recommendations.

Conclusion / Limitations :

We see that for The Dark Knight movie, our system is able to identify it as a Batman film and other batman films for recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn’t take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked The Dark Knight probably likes it more because of actor/ actress in it and would hate Batman Forever and every other substandard movie in the Batman Franchise.

Similarly for the Godfather movie, system was able to identify movies like 8 women, family films for recommendations. Here, Godfather is crime film whereas 8 women is dark comedy film.

Therefore, we are going to use much more suggestive metadata than Overview and Tagline. The assignment can be modified or can be extended further by building a more sophisticated recommender that takes genre, keywords, cast and crew into consideration.

References :

  1. https://www.sciencedirect.com/topics/computer-science/cosine-similarity#:~:text=Cosine%20similarity%20measures%20the%20similarity,document%20similarity%20in%20text%20analysis.
  2. https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/cosine/

--

--