Ten Similar Films To Classics

Ethan Meyer
INST414: Data Science Techniques
3 min readDec 9, 2023

For my module assignment, I opted to investigate a dataset encompassing the most highly rated IMDB movies in history. Initially comprising around 40,000 movies, I streamlined the dataset to a sample size of 10,000, focusing on key details such as revenue, genre, release_date, cast members, score (rating), budget, and other relevant variables. During the code execution, I encountered diverse formatting challenges, all of which were systematically addressed through debugging and normalization procedures. This dataset was a sample of the same dataset used in the previous module assignment, but for this assignment I selected a different sample size of 10,000 too be analzayed.

In my analysis, we utilized key features to capture the essence of movie similarities. These features include ‘genre,’ ‘score,’ ‘budget,’ and ‘revenue.’ which was chosen out of the number of different columns provided in the dataset. The rationale behind these selections was to encapsulate both thematic and financial aspects of each film; considering the genre and demographics of movies while considering the correlation of budge. These were compared to the success of the film measured in the score of each film.

To measure the similarity (please refer to the code), I employed the cosine similarity metric. Cosine similarity is particularly effective in text and content-based comparisons, making it suitable for our analysis when considering genre the each film. It calculates the cosine of the angle between two vectors, providing a normalized measure that ranges from -1 (completely dissimilar) to 1 (identical).

Insight:

When looking at the results, it gives obvious insight that movies like ‘How To Train Your Dragon’ and ‘Zootopia’ are similar with the queried movie ‘Shrek’. All the films are aminimated and marketed to same demographics. However, it is when you look closer and you see other non-animated movies and movies with a quarter of the budget Shrek had. Similarities with the movies Planes and Book of Life show that a different audience was reached aside from the main demographic. We see this in the other query movies Dog and Batman, where films with no external connection can be seen connected to each other based on hidden figures. This information would be valuable to a producer studio to analzye which demographics tend to watch which films, and possible plot or style reasons that could attract old viewers to new productions.

Network Analysis:

The softwares that I used to complete this task were primarily modules built into python like matplot, PANDAS and sckrit-learn. The csv files were ealy to upload due to the read_csv file which made maniplautes and debugging streamline. Numpy were used for the numerical operations but the CountVectorizer module was imported to analyze the categorical operations within the dataframe.

Conclusion and Analysis:

The use of Natural Language Processing (NLP) techniques, including the CountVectorizer, allowed for an efficient examination of textual and categorical attributes, capturing the essence of movie similarities beyond surface-level features.

The results and the output of the terminal are as follows, showing the ten most similar movies for Batman, Shrek, and Dog.

Top 10 most similar movies to ‘The Batman’ based on selected columns:
Wrath of Man
Frank and Penelope
Missing
10 Days of a Good Man
Marlowe
Se7en
The Hanging Sun
The Darker the Lake
Nightcrawler
The Man from Nowhere

Top 10 most similar movies to ‘Shrek’ based on selected columns:
The Book of Life
How to Train Your Dragon 2
Onward
Zootopia
Marmaduke
Planes
Planes: Fire & Rescue
The Super Mario Bros. Movie
Mummies
Puss in Boots: The Last Wish

Top 10 most similar movies to ‘Dog’ based on selected columns:
Beetlejuice
One Day
The Devil Wears Prada
Once Upon a Time… in Hollywood
Spoiler Alert
EverAfter
The Passion of the Christ
Creed
I’m Not Ashamed
Malena

Although this information could be valuable to a film company at a surface level, the reality of the situation is that finding the reason for the connections some of these movies have can not be defined in the metrics and variables for this assignment. This ideology and technique could be used on a larger scale to accomplish the same goal, but my assignment just scratches the surface.

Link to Github:
https://github.com/EthanMeyer41/414-Module-Assignments/blob/main/Mod%203.py

--

--