Similarities In Movies

Ethan Meyer
INST414: Data Science Techniques
3 min readDec 9, 2023
Sample Connection

For this module assignment, I chose to analyze a dataset containing the highest-rating IMBD movies of all time. There were approximately 40,000 movies that I normalized to have a sample size of 10,000 movies containing information about their revenue, genre, release_date, cast members, score (rating), budget, and other variables. While running the code, various formatting issues were fixed through a process of debugging and normalization. After the data was cleaned, checked for repeats, and normalized I was then able to create a script modeled off the weekly assignments in class.

The movie network was created in similarity to the datasets in class, but I chose to have each movie representing a node while the edges represent similarities based on shared features like genre, score, budget, and revenue. Once these variables were isolated as the ones that had the most influence, I was able to use scikit-learn to import modules like CountVectorizer and Natural Language Process techniques I was able to calculate the cosine similarity to measure the likeness between movies. The NLP techniques demonstrated in my code played a role in examining the linguistic or categorical attributes like genre, cast, and release date. The cosine similarity function was imported using sci-kit-learn and made the process streamlined once I knew what I was plugging in.

An interesting result from this was that the films The Book of Life,’ and ‘The Ant Bully,’ were films that served as pivotal nodes due to their thematic similarities that were unveiled. The films acted as the nodes with the most neighbors for the reasons that they connect across multiple genres and a spectrum of audiences that can be connected to a variety of films regardless of genre or budget. It brings the question of whether these movies can be classified as superior to the others for reaching across and having the most similarities among the top movies of all time.

Another insight I was about to extract from this is that there are connections between popular movies that you would not think to consider. Beyond genres or directors, we seek to identify movies that share thematic elements or resonate with audiences in similar ways but there is more to it behind the scenes. The similarity in budget and success (score) you would consider an obvious insight, when in fact there are a considerable number of outliers that exist where overbudget films received high ratings while lower-than-average budget films had higher ratings. However, this does count for inflation and the fact that the longer the film is out the more it grosses, favoring older movies to skew the budget and success correlation.

In summary, this project underscores the efficacy of Natural Language Processing techniques, such as cosine similarity and the CountVectorizer, in revealing hidden patterns within diverse datasets. By using these methods, I was able to examine the relationships between movies that were hidden before, which inspires me to always look between the lines and look for connections that may seem impossible or hidden.

Please refer to my GitHub where the code is explained and the visual to be seen.

Link to Github:
https://github.com/EthanMeyer41/414-Module-Assignments/blob/main/Mod%202.py

--

--