Ranking Movies by Similarity with Python

Yang Lu
INST414: Data Science Techniques
4 min readMar 11, 2022

My data was a list of movies and their actors. This data was gathered from the umd.inst414 github repository. An insight I wanted to gain from these data was whether or not the actors have an influence on a movie being financially successful. The financial data was not provided by the original data, so I researched the gross profit for each resulting movie.

The similarity metric was to be measured by the euclidean distance of each movie in the dataset compared to the query movie, with the features being the actors. This distance is gathered by the code, with the libraries used being:json, pandas, scipy and sklearn.

query_idx = [idx for idx,m in enumerate(reduced_df.index) if m == “query_movie”][0]query_v = matrix_reduced[query_idx,:]distances = pairwise_distances(matrix_reduced, [query_v], metric=’euclidean’)distances_df = pd.DataFrame(distances, columns=[“distance”])for idx,row in distances_df.sort_values(by=”distance”, ascending=True).head(20).iterrows():print(idx, reduced_df.iloc[idx].name, row[“distance”])#code from https://github.com/cbuntain/umd.inst414/blob/main/Module03/04-Dimensionality.PCA.ipynb

The starting query movie I used was “Avengers: Endgame”, which is second of the top lifetime grossing movies. The resulting top 20 similar movies are pictured below.

The most similar movies with a distance of 0.0 are “Avengers: Infinity War” and “Avengers: Age of Ultron”. “Avengers: Age of Ultron” is 12th place at the time of this post, and“Avengers: Infinity War” is 5th.(boxofficemojo) It is possible that the grosses are due to branding and not the actors themselves. This branding possibility is further supported by the next couple of similar movies, with “The Detective Is in the Bar” having $15,404,986 and barely any grossing information from “Bato: The General Ronald dela Rosa Story”.

However, some problems with this result is that “The Detective Is in the Bar” is a Japanese movie, and “Avengers: Infinity War” is American. There are barely, if any, actors that are in both movies. From the json data, “Avenger: Endgame” has Robert Downey Jr., Chris Evans, Mark Ruffalo, and Chris Hemsworth. “The Detective Is in the Bar” has Yô Ôizumi, Ryûhei Matsuda, and Toshiyuki Nishida. This invalidates the result obtained from using the euclidean distance code as the features do not match up.

Since Euclidean Distance code did not match up, I used another similarity method, jaccard similarity, with target_movie being the previously queried movie.

target_actors = target_movie[“actors”]for movie in movie_actor_list:these_actors = movie[“actors”]numer = len(target_actors.intersection(these_actors))denom = len(target_actors.union(these_actors))jaccard_sim = numer / denomdistances.append({“movie”: movie,“similarity”: jaccard_sim})#code from https://github.com/cbuntain/umd.inst414/blob/main/Module03/01-Similarity.ipynb

The results were obtained for the top 10 most similar movies.

Ignoring the top 6 as they are of the same brand/production company:

Gross Worldwide data gathered from IMDb

Due to the range of the gross worldwide for the 3 movies with a jaccard similarity score of .25, and that “View From The Top” and “The Kids Are All Right” have a range of $15,232,937 despite both featuring Mark Ruffalo; it is reasonable to conclude that actors do not have a large impact on the gross profit of a movie.

However, just to confirm this conclusion, I did another query with “The Emperor’s New Groove”.

Gross Worldwide data gathered from IMDb

From the data obtained, the main takeaway would be that it is safe to assume there is little correlation between actor and gross profit.

Some problems I encountered while doing this investigation were mostly knowledge issues. For example, not knowing that json files had to be read line by line into a dataframe for it to be usable.

While this investigation does provide an answer that there is little correlation between actors and gross profit. This analysis is limited in that it did not go into if the “Avengers: Endgame” was an edge-case for the euclidean distance code. Another limitation is that it shows but does not really prove that actors are not a statistically significant factor in determining the gross profit of a movie.

--

--