A rough comparison of the Marvel and DC movies

Abigail Chen
INST414: Data Science Techniques
4 min readMay 11, 2022

Insight

As a huge movie fan, I go to the cinema every month to see at least four of the latest movies since AMC launched its A-list membership.

And the type of movie that has been driving fans around the world crazy for the last decade is undoubtedly the superhero movie. In general, Marvel and DC are undoubtedly the giants of superhero movies. Fans of both companies and fans of the original comics are constantly debating which is the better superhero movie. Even though each fan has their own best, I would extract an insight from IMDB and Rotten Tomatoes ratings to see how most of the fans evaluate these movies.

It may also inform the moderate audience’s choice of what to watch and thus help the movie industry and movie theaters, which were hit hard financially by the COVID 19 pandemic, to recover gradually.

The process of data analysis

My data is from Marvel and DC movie data compiled by a Kaggle user named HETUL MEHTA. This dataset contains most information that I would refer to before walking into a movie theater, such as the ratings given by professional media people on Rotten Tomatoes. It also includes vital data that I would consider when selecting older movies, such as IMDB ratings and North American box office. In addition, this data also contains the top 10 most similar items in your dataset and list them, including:

Movie: Title

Year: Year of Release

Genre: Genre

Runtime: running time

Rating: Certificate or Rating

Director: Director

Actor: Actor and Actresses

Description: plot

IMDB score: IMDB Score Metascore: Metascore Votes: No. of Votes in IMDB USAGross: Gross collection in USA

Category: Marvel or DC

After importing the data into python, I first read it using a panda frame To avoid making a passive judgment based on a specific variable as a reference initially. I chose the default ranking without sorting the data yet. I also used the pairwise_distance() function, which allowed me to create a pairwise_distance based on the default information from the data.

First of all, many movie fans, including me, may have more or less heard of the sensational change brought by Batman Dark Knight to superhero movies. Back then, numerous fans constantly browsed IMDB’s website and voted The Godfather and The Shawshank Redemption a score of 1 to boost The dark knight’s ranking to the first. From the results, the movies with a higher degree of similarity are the two Batman movies directed by Christopher Nolan as well, and the DC movies are also more similar to Batman.

Then I used Avengers: Endgame as a query movie because it is currently the highest-grossing superhero movie. The result is expected that the other three Avengers movies have a very high degree of similarity to this one. Interestingly, the Marvel Cinematic Universe movies completely dominate the top ten similarities. This may be a side note to the movies produced by Marvel in the last decade. Regardless of the actors’ repeated appearances, box office and word of mouth have a high degree of concentration. Moreover, this data also shows Robert Downey Jr powerful box office appeal.

Problems and Bugs

The most serious difficulty I encountered was that my data kept having problems with the UTF-8 format not being recognized. Finally, I tried to re-download the data in the school library and upload it to my Jupyter notes to solve it. Also, since the professor’s code example uses JSON format, while my data is in CSV format. How to use the CSV data to calculate the matrix also made me keep trying and searching various cases on the internet.

Conclusion

I learned to do a comprehensive similarity analysis using specific data and apply the matrix concept. Of course, due to the rushed time, I think the data I came up with is very different from the reference data I imagined. From this analysis, I also realized that I need to refer to different variables to make an accurate analysis. So far, I am not able to make good judgments believe that if I continue to use python in my work in the future, this will be a practical skill, but I will also face greater challenges.

--

--