Recommending Movies to Users using Cosine Similarity

Kwalanj
Web Mining [IS688, Spring 2021]
11 min readFeb 28, 2021
Memories of Murder (2005)

There is nothing more satisfying than watching a GOOD film. There is a lot of effort that gets put into conveying a story, building an atmosphere, establishing its characters, etc. Not many films despite their genres can manage to achieve that level of perfection. Memories of Murder (2005), a korean crime drama by Bong Joon-ho (Parasite) based on the true story of a string of serial killings that rocked a rural community in the 1980s makes you re-live those incidents in a span of 2 hrs 12 mins without wasting time giving useless exposition. Quentin Tarantino called it one of the best films of the 21st century and David Fincher used this movie as an inspiration for his 2007 film, Zodiac.

Ever since I was a child, I have had a strong affinity for movies. Appreciating a film is like your taste palate evolving as you get older. As a child, I only cared for fast food. Get me a happy meal and I will be the happiest kid. Now, that could be further from the truth. Fast food is the last option on my list when picking where to eat out. Trying different cuisines is an adventure on its own. I would have never thought of enjoying Japanese, Ethiopian, or Spanish food as I do now. This is the same for movies.

Enter the Dragon (1973)

Back then, I didn’t appreciate movies as I do now. I only cared for the action. Bruce Lee, Jackie Chan, JCVD’s movies where my go-to watch. I remember as a child in India going to watch Minority Report (2002), not for its story, but for its action. The movie is a magnificent thriller driven by its world building, characters and story. The action is just a small part of it. I only learned to appreciate it after my second viewing when I reached adulthood.

Goal

Collaborative vs. Content-Based Filtering

My project would be using three query entities of interest to create a list of top 10 movies generated from each.

  1. Title
  2. Genre
  3. User Ratings

These sorts of recommender systems utilize either collaborative or content-based filtering. Movies being recommended based off of another user’s reviews can be considered a form of collaborative filtering. It is better defined as a method of making automatic predictions about the interests of a user by collecting preferences or taste information from many users.Movies recommended off of a title or a genre would be content-based filtering. These methods are based on a description of the item and a profile of the user’s preferences. These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user. Movies that might otherwise go under the radar could possibly garner new fans. This considers users’ opinions on different movies and recommends the best movies to each user based on the user’s previous rankings and the opinion of other similar types of users.

I used the MovieLens dataset for my project.

Children of Men (2006)

MovieLens, a movie recommendation service offers a dataset for educational purposes. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018 and used by me on February 22, 2021. All randomly selected users had rated at least 20 movies. The dataset contains 4 csv files.

  1. links.csv
  2. movies.csv
  3. ratings.csv
  4. tags.csv

Data Cleaning

The data was pretty clean for the most part. There were some instances of duplicate movie id values and some movie titles had special characters in them, which were removed.

Method Used- Cosine Similarity

Pulp Fiction (1994)

SO WHAT NOW?

I used the Cosine similarity metric to achieve my results. I used the SKLearn, scikit-learn, and NumPy library using Python 3 within Jupyter Notebook for this analysis.

Cosine Similarity Formula

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether they are pointing in roughly the same direction. This similarity score ranges from 0 to 1, with 0 being the lowest (the least similar) and 1 being the highest (the most similar)

Cosine similarity is an easy way to make decisions as nearest points are the most similar and farthest points are the least relevant. The similarity is subjective and is highly dependent on the domain and application. For example, two movies are similar can because of genre or length or cast. The shared properties are represented as two vectors and their similarity is based on the cosine of the angle between them.

Title & Genre

The first entity we analyzed using cosine similarity is the title. This approach uses the movie title of the dataset to make recommendations. Before we use the metric, it is important that we convert the title which is in text format into number to easily calculate for similarity. This is where we utilize the Term Frequency-Inverse Document Frequency (TF-IDF), an information retrieval method for feature extraction purposes.

TF-IDF

TF is simply the frequency of a word in a document. IDF is the inverse of the document frequency among the whole corpus of documents. TF-IDF is used mainly because of two reasons: Suppose we search for “the results of football games 2020” on Google. It is certain that “the” will occur more frequently than “football games” but the relative importance of football games is higher than the search query point of view. In such cases, TF-IDF weighting negates the effect of high frequency words in determining the importance of an item (document).

Vector space model

We now have to determine the closeness of these titles to one another. This can be accomplished using the vector space model which can compute the proximity based on the angle between the two vectors. Each item is stored as a vector of its attributes in an n-dimensional space and the angles between the points are calculated to determine similarities.

The above vector space model diagram gives us an easy understanding of the model. Sentence 2 is more likely to be used by term 2 than term 1. Sentence 1 is more likely to be used by term 1 than term 2. This measure is calculated by taking the cosine of the angle between the sentences and the terms. As the value of angle decreases, the value of cosine increases, thus implying that the similarity is more significant.

I proceeded to build a content based recommender that would compute similarities between movies based on movie titles and genres based on ratings by the 610 users from the dataset. We started off the importing the cosine similarity function and created an “item based” (title) and item and “genre based” (genre) recommendation functions.

Title and Genre functions

I created another function (diagram below) that would encapsulate the two title and genre based function and display the results in a more appealing format.

Function encapsulating the Title and genre functions

We used Heat (1995), of the action, crime thriller genre as the item to be used for cosine similarity calculations when recommending the top 10 movies.

Title and Genre based results

As we can see, the title with the closest similarity was The Rock (1996) and the least closest similarity was Independence Day (1996) out of the 10 results. The recommender also calculated genres similar to our inputted title. All of the top 10 results we got based off of genre similarity were a value of 1. They all were action, crime thriller movies starting with Dirty Harry (1971). Similar to the title recommender, I added a new layer which would find movies with similar genres and then select the best rating similarities. In order to do that I’ve added a new column containing the genre cosine similarity.

User Ratings

How do cosine similarity and user ratings work? Let’s use our dataset as an example.

User 1 has rated 4 superhero movies.

Movie 1: Avengers: Endgame 4/5

Movie 2: The Dark Knight 5/5

Movie 3: Deadpool 3.5/5

Movie 4: Logan 4.5/5

Logan (2017)

User 2 comes along and has watched the first 3 movies, and gave them similar scores. He however hasn’t watched movie 4. Cosine similarity calculates that and determines user 2 would most probably enjoy Logan just as much user 1 did based off of their similar user scores.

I wanted to start off with merging the some csv files together for better analysis. I joined the movies.csv and ratings.csv together so I could easily analyze my data in one dataframe.

Data Merge
Top 10 movies with their rating counts

The above diagram gives us a list of the top 10 movies with the highest amount of reviews. The dataset contained users that reviewed the 10 movies the most. It’s not surprising as all of those movies are fairly popular globally.

The ratings for the movies ranged from 0 to 5. We analyzed the count of unique reviews by users based on movie id in the dataset. We see from the below images that user tend to give more 4 star reviews than other scores.

Rating range count
Bar graph of rating ranges

We created a pivot table from pandas with each movie id representing a column and each user representing a row. Values with “NaN” have not been reviewed by the particular user. User 1 gave movie id 1 (Toy Story (1995)) a 4.0 out of 5.0 and movie id 2 (Jumanji (1995)) had a value of “NaN” as it was not reviewed by user 1. We will be using this pivot table and implement our metrics.

Cosine Similarity

I created a parent function called recommend_movie_by_user to utilize the cosine ratings function I created. The parent function had three input parameters.

  1. user- user id of the user
  2. method- name of the metric
  3. n_recommend- number of movies recommended
Parent function

The below is the output of the top 10 movies recommended by user ratings.

The results could greatly differ based on user because of its collaborative nature.

Data analysis

Akira (1988)

Our data for the most part was quite clean. There were no fuzzy matches or over the top cleaning that was needed to make it presentable. Recommender engines can sometimes be a hit or miss. It is hard to accurately rely on user scores all the time. Some movies one user might like might be hated by another and vice versa despite the similar tastes the data might show. A repetitive occurrence like this can definitely skew the results of movies recommended to a certain user.

Movies that had a high count of reviews from users have a larger likelihood of always being recommended. 610 reviewers is quite a small pool to use as means for recommendations. What is the likelihood of a movie like Akira (1988) being recommended to someone who has an interest in cyberpunk, fantasy animations? Only 38 out of the 610 reviewers gave that movie a review. That is only a 6% of the population from this dataset. This drastically increases the chances of mainstream popular movies being recommended while smaller lesser known ones not seeing the light of day.

Title, genre and rating attributes are sometimes not enough to give accurate results. One user could be a fan of english action movies like The Magnificent Seven (1960), Unforgiven (1992) or Tombstone (1993). Another user could also be a fan of the same genre but is open to foreign films as well. A recommendation of a Japanese action film like 13 Assassins (2010) to the first user would not necessarily be of any interest to them due to the language preference. Just using the score as a determining factor opens up a vast pool of choices which might not be necessarily useful.

Drive (2011)

I recently watched Bridgerton, a Netflix TV show. Ever since watching that, I am being barraged by a lot of sensual romantic dramas, which is not necessarily my cup of tea. I wonder how much our metrics could be skewed because of one show disturbing an otherwise uniform trend. Drive (2011), an excellent film with one of the most memorable soundtracks to date was marketed as a fast paced high octane heist film in the trailers. However, it was a slow paced indie neo-noir crime thriller. An average viewer who went to watch the movie with the hopes of it being the next Fast & Furious would definitely be disappointed. The genres for both of these films are similar, but the content is quite different. Their score when rating the movie would be lower than another viewer. This sentiment would largely skew the quality of data.

Burning (2018)

From my analysis of the metrics, cosine similarities can sometimes treat missing values as negative. This metric can also sometimes not provide the most accurate results because it considers only the angle and not the relationship between the ratings and the average ratings.

The user review entity we analyzed had The Shawshank Redemption (1994) as a recommended movie. Is this because the metric calculated this movie as a clear choice for user #25 or because it just seemed to be a movie with a large count of reviews as null values are ignored? It’s hard to say! Similarly, I was recommended a F/X2 (a.k.a. F/X 2 — The Deadly Art of Illusion) (1991), a movie with the same genre as Heat (1995). However, it is nowhere close in quality to Heat despite the similarity in genres. This large swaths of comparison makes for a quantity over quality-like result.

A shortcoming of user based review recommendations is that it requires user input to suggest movies. It’s called a cold start problem because beginning the recommendation process requires previous data from users. A newly launched e-commerce website, for example, suffers from the cold start problem because it doesn’t have a large number of users and lacks variety of opinions. Product similarity based off of titles and genres don’t have this problem because it just requires product information (title) and the user’s preference (genre). Netflix, for example, avoids this issue by asking users their likes when starting a new subscription.

At the end, a recommender system like this is a helpful way to get exposure to new movies from like-minded users who share your interests based off of user reviews, genres or titles. However, it cannot be expected to be exact with the results that one would receive and the metrics used to analyze these results.

--

--