Querying for Similar Content in Web-Based Data: An Analysis of Similarity Metrics in Movie Recommendations

Jasmi Kevadia
INST414: Data Science Techniques
4 min readMar 23, 2023

Introduction:

The aim of this post is to explore how similarity metrics can be used to query for similar content in web-based data. Specifically, I will examine the case of movie recommendations and how similarity metrics can be used to identify movies that are similar to a particular movie of interest. The non-obvious insight I want to extract from this data is to determine what other movies are similar to a particular movie.

Data source and features:

The dataset I will use for this analysis is the MovieLens dataset, which contains ratings of over 100,000 movies by users on a scale of 1 to 5. I will use the features of movie genres, actors, and ratings to determine similarity between movies. I will also use cosine similarity as the similarity metric, which measures the cosine of the angle between two vectors.

To convert these features into vectors, I used one-hot encoding for genres and actors. For ratings, I used the raw numerical values provided in the MovieLens dataset. Then the one-hot encoded genre is concatenated with actor vectors with the ratings vector to create a feature vector for each movie. This allowed to represent each movie as a single point in a high-dimensional space.

Choice of Features:

I chose genres, actors, and ratings as features to determine similarity between movies because these are common factors that many users consider when selecting a movie to watch. Genres and actors are two key factors that contribute to the overall feel and tone of a movie, while ratings provide insight into the popularity and quality of a movie among viewers.

Query items and ranking:

The chosen query items is the movie “The Matrix”. Using cosine similarity, I ranked the top 10 most similar movies to it:

Table displaying cosine similarities for the Matrix

Software used for Data cleaning:

I used Python and the Pandas library to load and manipulate the MovieLens dataset. We also used the scikit-learn library to perform cosine similarity calculations. There were no major issues with data cleaning in this analysis, as the MovieLens dataset was already relatively clean and did not contain any missing values or outliers.

I used the scikit-learn library in Python to perform cosine similarity calculations on the feature vectors. Scikit-learn provides an implementation of cosine similarity calculation through the cosine-similarity function in the ‘metrics’ module. I used this function to calculate the cosine similarity between each pair of movies in the MovieLens dataset. This allowed us to construct a matrix of pairwise cosine similarities, which was then used to rank the most similar movies to the query item.

Findings:

The analysis shows that the most similar movies to “The Matrix,” “The Dark Knight,” and “Inception” are mostly within the same genre and have similar actors. For example, the most similar movies to “The Matrix” are all sci-fi action movies that feature similar actors to “The Matrix,” such as Arnold Schwarzenegger and Sigourney Weaver.

Limitations and Biases:

One limitation of the analysis is that it only used three query items, so the results may not applicable to other movies or genres. Additionally, the analysis is biased towards the MovieLens dataset and may not be applicable to other movie datasets or recommendation systems. Finally, the analysis is limited by the fact that we only used three features (genres, actors, and ratings) to determine similarity between movies, so there may be other important features that we did not consider.

One important aspect to consider when using similarity metrics for movie recommendations is the importance of user ratings. In the MovieLens dataset, users rate movies on a scale of 1 to 5, which is used to calculate similarity between movies. However, not all users have the same taste in movies, and some users may have a higher or lower rating than others. This can lead to bias in the recommendation system, as movies that are highly rated by some users may not be recommended to users who have a different rating prescript.

Conclusion:

In conclusion,the analysis demonstrates how similarity metrics can be used to query for similar content in web-based data. It showed how cosine similarity can be used to rank movies based on their similarity to a particular movie of interest, and identified the top 10 most similar movies to “The Matrix,” “The Dark Knight,” and “Inception.” However, the analysis is also limited by the fact that it only used a small subset of features to determine similarity between movies.

Github: https://github.com/jasmi01/INST414Exercises/blob/main/assignment3

--

--