Analysis of Similarity in Movies

How can movie recommendations be improved based on movie similarity?

Colleen Wang
INST414: Data Science Techniques
5 min readApr 29, 2024

--

Stakeholders/Decisions informed:

As the competition for streaming services increases, so does the importance of optimizing movie recommendations to enhance user experience. This metric can benefit individuals in the entertainment industry by improving the likelihood that their users will engage with their recommendations. Analyzing movie data and comparing similarities between movies by various features can help answer the question of how can movie recommendations be improved based on movie similarity. By identifying similar movies, platforms can suggest more relevant content to users, increasing the likelihood of user satisfaction and retention. This analysis is of particular interest to stakeholders in the streaming industry such as data scientists in charge of enhancing recommendation algorithms for streaming services.

The decisions the answer to this question could inform are relevant to improving user experience and optimizing the performance and efficiency of recommendation algorithms. By understanding movie similarity, stakeholders can refine the recommendation engine to suggest movies that align more closely with users’ preferences, thus improving engagement and satisfaction.

Data:

To answer the proposed question, a dataset of movies and their characteristics is essential. The data should have contextual data about when the movies were released and their genres to give further insight about the movies. Data including movie attributes such as title, rank, number of reviews, and rating would be beneficial for the analysis.These fields are important because they allow for comparison across different groups to determine the best recommendation based on movie similarity using metrics such as Euclidean distance or cosine similarity. Analyzing these fields and metrics can inform stakeholders in the streaming industry of their decisions surrounding improving algorithms and enhancing user engagement. I collected a subset of this data on Kaggle, a free resource for open data sets. The fields contained in this data set are:

  • Genre
  • Rank
  • RatingTomatometer
  • Title
  • No. of Reviews

Similarity Measurement:

To measure similarity between movies, I used the features RatingTomatometer and No. of Reviews. These features provide valuable insights into both the reputation and popularity of a movie among audiences. To measure the similarity between movies, I used the similarity/distance metrics euclidean distance and cosine similarity. Euclidean distance can be used to calculate the linear distance between two points, in this case movies. It is computed by the square root of the sum of the squares of the differences between RatingTomatometer and No. of Reviews. Cosine similarity can be measured by the cosine of the angle between two vectors. Each vector being the values of the movie’s features, providing a measure of similarity based on the “direction” they are pointing.

Top 10 Most Similar Movies:

Query 1: Avengers: Endgame (2019)

1. Us (2019) — Euclidean Distance: 6.0828

2. Captain Marvel (2019) — Euclidean Distance: 15.5242

3. A Star Is Born (2018) — Euclidean Distance: 19.4165

4. Black Panther (2018) — Euclidean Distance: 22.0907

5. Once Upon a Time In Hollywood (2019) — Euclidean Distance: 25.6320

6. Avengers: Infinity War (2018) — Euclidean Distance: 62.6498

7. Star Wars: The Last Jedi (2017) — Euclidean Distance: 64.0703

8. Wonder Woman (2017) — Euclidean Distance: 75.0067

9. Knives Out (2019) — Euclidean Distance: 79.0569

10. La La Land (2016) — Euclidean Distance: 80.0562

Query 2: Coco (2017)

1. The Farewell (2019) — Cosine Similarity: 0.999910

2. Finding Dory (2016) — Cosine Similarity: 0.999877

3. Battle of the Sexes (2017) — Cosine Similarity: 0.999850

4. Booksmart (2019) — Cosine Similarity: 0.999824

5. Inside Out (2015) — Cosine Similarity: 0.999796

6. Soul (2020) — Cosine Similarity: 0.999794

7. Moonlight (2016) — Cosine Similarity: 0.999704

8. Lady Bird (2017) — Cosine Similarity: 0.999648

9. Spider-Man: Into the Spider-Verse (2018) — Cosine Similarity: 0.999631

10. Get Out (2017) — Cosine Similarity: 0.999519

Query 3: Skyfall (2012)

1. The Martian (2015) — Euclidean Distance: 1.0000

2. I, Tonya (2018) — Euclidean Distance: 2.2361

3. The Lighthouse (2019) — Euclidean Distance: 2.8284

4. Doctor Strange (2016) — Euclidean Distance: 3.1623

5. Incredibles 2 (2018) — Euclidean Distance: 3.1623

6. Hereditary (2018) — Euclidean Distance: 3.6056

7. A Quiet Place (2018) — Euclidean Distance: 5.0000

8. It (2017) — Euclidean Distance: 6.7082

9. Booksmart (2019) — Euclidean Distance: 7.2111

10. Inside Out (2015) — Euclidean Distance: 7.8102

Figures:

Top 10 most similar movies to “Avengers: Endgame (2019)”
Top 10 most similar movies to “Coco (2017)”
Top 10 most similar movies to “Skyfall (2012)”

Answer to the Question:

Using the Euclidean distance and cosine similarity metrics, I conducted an analysis to identify movies most similar to the target films I chose, “Avengers: Endgame (2019), Coco, (2017), and Skyfall (2012)” By computing these similarity measures based on the features RatingTomatometer and No. of Reviews, the analysis produced the top 10 movies with the most similar audience reputation and popularity. These findings help answer the proposed question to enable stakeholders in the streaming industry to enhance recommendation algorithms by suggesting movies that more closely align with users’ preferences. As a result, this can inform decisions surrounding improving user satisfaction and engagement, which will also lead to improved retention rates and overall platform performance.

Data Cleanup and Bugs:

To clean this dataset, I first removed the “%” symbol in the RatingTomatometer column to convert it to a numerical float to be used in the similarity calculation. I also removed duplicate movie entries in the dataset by only keeping the first instance of each movie. This ensures that the distance and similarity calculations are accurate and more comprehensive of the top 10 different movies that are similar to the target movie. A bug that I think others might encounter while analyzing this dataset is the duplicate entries I cleaned in the cleaning process. The dataset lists the same movie for multiple genres which causes inconsistencies for this analysis due to its reliance on numerical values. While having duplicate movies in different genres would be useful for an analysis involving the genre column, for the purpose of this analysis it could cause confusion.

Limitations of the Analysis:

The limitations of this analysis include the scope of the dataset. The dataset is limited to the top 100 movies in each genre which excludes less-popular but potentially relevant movies to this analysis. The analysis is also limited in the features that it uses to calculate the similarity. Focusing on the RatingTomatometer and No. of Reviews columns provides a basis for the calculation based on movie popularity however overlooks some contextual factors such as genre or release date. Reliance on the RatingTomatormeter feature also might not be comprehensive of ratings by other platforms and might not represent the overall quality and popularity of each movie. The data set’s overall reliance on Rotten Tomato scoring metrics and number of reviews presents bias towards popular movies that dominate in similarity calculations due to their notability on the site. Another area of bias that might be present in the analysis is the lack of demographic data or individual user preferences that are beneficial for stakeholders to improve recommendation algorithms.

Here is a link to my GitHub repository that contains the code I have developed for this assignment: https://github.com/cwangg/INST414-Modules/tree/main/module-3

--

--