Using Jaccard Similarity to Measure Similar Movies Based on Actors

Casey Tabatabai
INST414: Data Science Techniques
3 min readMay 11, 2022

Within this assignment I will choose three movies to query in order to find and rank other similar movies based on actors that the movies share. The three query movies that I will use in this assignment are: Superbad, Harry Potter and the Goblet of Fire, and The Dark Knight. Comedy, fantasy, and action are my three most favorite movie genres, so I decided to pick one of my favorite movies from each genre that has multiple actors included in this dataset. The non-obvious insight that I wish to extract from my data is the movies that share the most actors with my target movies. Without doing significant research it would be difficult to find out which movies are the most similar based on actors. These insights will provide me with a list of movies that I will likely enjoy.

The data source that I used for this assignment is the imdb_recent_movies dataset provided within the INST414 GitHub Repository. This dataset provides IMDb data on recent movies that features actor and genre information for thousands of movies. The metric that I will use in this assignment to determine similarity is Jaccard similarity. Jaccard similarity measures similarity by finding the ratio of intersection over union. This means that with two sets of data, Jaccard similarity will be measured by dividing the amount of shared values between the sets with the total amount of values in both sets combined.

Jaccard Similarity for Superbad, Harry Potter and the Goblet of Fire, and The Dark Knight

Using Jaccard similarity above, I determined the top 10 most similar movies for each query in regards to actors shared between the movies. Obviously due to being the same movies, the most similar movie in each query is the actual movie itself. After the first movie is when the results started getting a bit interesting. The results for Superbad were probably the most surprising to me. I always thought that the main actors in Superbad such as Bill Hader, Jonah Hill, and Michael Cera performed in many different movies together due to their chemistry displayed in Superbad. However, out of the three movies the list for Superbad featured the least related movies as none of the main actors appeared in a movie together afterwards. The list of similar movies for Harry Potter and the Goblet of Fire produced the expected result as the top 8 movies listed are all from the Harry Potter series. The main actors from Harry Potter and the Goblet of Fire did not appear in any movies together outside of the series. I found the similarity results for The Dark Knight to be quite interesting as unlike the previous two movies, the actors in this movie actually did appear in movies together outside of the query movie series. Christian Bale appeared in another movie with Heath Ledger and in another movie with Michael Caine; both of which were outside the Batman series. Surprisingly, according to Jaccard similarity both movies were more similar to The Dark Knight than Batman Begins, which is a part of the same movie series.

One limitation of my data for this assignment is the exclusion of actress data. It was a bit confusing to see only actors in the dataset as many of the movies that I was interested in learning more about featured very famous actresses that played significant roles. For example, Emma Watson plays a massive role in the Harry Potter movies and has appeared in many other movies, so not having her included in the Jaccard similarity measure gives a much more limited perspective.

My main takeaway from this assignment is the knowledge I gained on different similarity measures. Although I only used Jaccard similarity within this assignment, after working through this module I learned about a few other measures such as cosine similarity and Euclidian distance.

--

--