Exploratory analysis of IMDB movies and actors 2000–2020

Varun Sharma
INST414: Data Science Techniques
4 min readApr 11, 2022

In this post I will talk about analysis I did on an IMDB dataset that includes actors and movies over a 20 year period. To begin, one non obvious insight I hope to extract is to show that centrality and node size does not always equate to a successful actor or a more known actor for future roles in movies. This will help movie directors and producers focus more on the content of their performance in a movie and not focus on how many movies or how many connections they have. This will ultimately lead to better quality movies coming out!

To start with the analysis, we see pictured above is the network I created. I intentionally filtered it to take out any non USA movie and the resulting actors for that. Furthermore, since I wanted to deal with high centrality I filtered it so that the lowest degree of centrality here is 26. The maximum in the dataset was 42. Before moving on, centrality in this context is defined as the connections and length/amount of them to other people. It actively correlates to their connections around other actors and the roles they shared for movies. This would mean the nodes in the network above are the actors themselves. For a node to be defined as “important” it just means it has a high degree of centrality including betweenness centrality. These tend to be your more popular A-list actors. For example Tom Sizemore, who played a large role in Saving Private Ryan, is on that network and he is considered an A-list actor.

Above we can see a very important node. The way I got this was by using gephi and filtering for only the highest degree and seeing what the highest node pairing was. That is the highest degree of centrality and connections to other networks and therefore people. Both people involved in this pairing are not only not A-list actors, they were foreign born and moved here to pursue Hollywood. The left side node represents Shota Sometani and the right one represents Masahiro Higashide. They were both born in Japan and played large roles in movies in the United states during the time period listed. While they are popular actors and have a large fan base they are objectively less known than other actors the network shows such as Samuel L. Jackson, Nicholas Cage, and Tom Sizemore.

To continue, I will mention some of the software and code I used to get the network and measures of centrality. Pictured above is the code that I used to get a csv file that has all the actors and movies I want (remember I wanted to filter out certain things!) that I could then load into another program called Gephi (code taken from this GitHub repository https://github.com/cbuntain/umd.inst414/blob/main/Module02/03-Graphs.ipynb). The main packages I used was Networkx which is a python package that allows you manipulate and create/manage your own networks with data you open, read, and then write with. I simply used Networkx to get me a file I could use and load into Gephi which is a program that allowed me to filter for the centrality.

The code above is what printed out everyone’s betweenness centrality. it measures how many connections they have or are made through them (code taken from this GitHub repository https://github.com/cbuntain/umd.inst414/blob/main/Module02/03-Graphs.ipynb). A few bugs that I ran into was fixing the file path and fixing the creation of the tsv file. The file path was not working on vscode my original coding environment so I switched over to Google Collab and imported the file straight into the environment and that fixed it. I fixed the creation of the tsv file by using networkx’s package called graphxml. This took my data and created a tsv file I could load into Gephi.

To end off this post I will talk about some limitations from the data and takeaways I found. The biggest limitation from the data was that the actual dataset did not label their nodes so as you can see they don’t have the name of the actors above it. This forced me to actually access the IMDB dataset so that I could figure out which node represented who. This limited the amount of analysis I could do because the overall organization of the data was poor so it opened up a lot of avenues to mess up with the data. However, I think the main takeaway was my example with the 2 Japanese actors who have the highest centrality yet are objectively less known compared to other actors within the data. This goes to show my original insight that having high centrality does not mean you are the most popular or even the best quality. We can see here that the 2 most “important” nodes are not A-list actors and the movies they produced were big enough for IMDB to include them in the dataset.

Notable Citations

https://github.com/cbuntain/umd.inst414/blob/main/Module02/03-Graphs.ipynb

--

--