Network Graphs of Actors Based on Popular Movies in Common

Published in

Web Mining [IS688, Spring 2021]

15 min readFeb 20, 2021

Introduction

If you’ve watched a lot of movies, chances are that you’ve seen the same actors in multiple movies. But how are all these actors and movies connected? To answer this question, I explored network graphs. More specifically, I treated each actor as a node, and I connected them based on whether they had a popular movie in common, with the weight of their connection being based on how many popular movies in common. Initially, I created and used a network graph to find which actors were the most connected to other actors based on popular movies they had in common, for the 100 actors with the highest grossing in the 2020 U.S. box office. I further explored for a much larger number of actors, and although my results were limited in insight, my methods are an excellent starting point for more in-depth analysis.

1. Initial Data Collection

To collect the actor and movie data, I coded in Python in a Jupyter Notebook. I explored various movie database options, and I ultimately chose IMDb as my main source of data because it’s the world’s most popular and authoritative source for movie, TV, and celebrity content. I downloaded IMDb’s datasets from https://datasets.imdbws.com/.

First, I imported a few libraries that I needed to use: pandas, matplotlib, and networkx

Next, I used the .read_csv() function from the pandas library to load the data from IMDb’s datasets as dataframes.

The first dataset I downloaded provides over 10 million celebrities from IMDb’s database.

The other dataset I downloaded provides over 7 million titles from IMDb’s database.

In order to initially focus on a meaningful subset of data, I also copied and pasted the actor names from https://www.the-numbers.com/box-office-star-records/domestic/yearly-acting/highest-grossing-2020-stars into a CSV file, and loaded it for my data processing.

2. Initial Data Processing

To filter the 100 actors with the highest grossing in the 2020 U.S. box office, I first filtered actors with titles they are known for.

I then converted the top_actors_df to a list and filtered actors_df based on whether the primaryName column values were in the top_actors list.

However, some actors have the same name as others, which isn’t differentiated by top_actors.

I noticed that many of the duplicates did not a valid birth year because they are not so popular, so I filtered those rows out.

There were still a few duplicates, so I found out which ones by using pandas’ .duplicated() function and put them in a dataframe called name_duplicates. I added .unique() and then .tolist() to filter out repeated duplicates.

Because there were only 7 of them, I manually looked the actors up in IMDb and copied and pasted their name IDs into a list called filtered_name_duplicates, and then I filtered name_duplicates based on whether the nconst column values were in filtered_name_duplicates.

I dropped all duplicates in actors_df, and I combined actors_df and the filtered_name_duplicates dataframe.

actors_df was now missing 4 actors because the size was 96 instead of the expected 100 that top_actors has. So, I printed out the missing names.

Henry doesn’t have a birth year in IMDb, Lakeith and Isabela have different names in IMDb, and Tyler Perry doesn’t have actor as one of his primary professions in IMDb. Like with the duplicates, I manually looked up their name IDs on IMDb, and added them to actors_df. I had to use names_df to retrieve their name IDs because they were no longer in the filtered actors_df.

I also filtered the titles to only include movies.

The knownForTitles values were strings, so I used Python’s .split() function on commas to split the titles in a list, in order compare or identify titles in the future. I applied this to every row using pandas’ .apply() function.

3. Initial Network Graph & Analysis

To iterate through and find values faster than with a dataframe, I created dictionaries.

Next, I created an initial graph using networkx’s .graph() function, and another dictionary called edge_attribute_dict to store the weights of the graph.

I added a node, using .add_node() to the graph for each name ID. I then iterated through the titles to check if name IDs have titles in common, and if so, how many. Names with titles in common were added as edges using .add_edge(), and the weights were updated in edge_attribute_dict to reflect how many titles were in common.

In order for the networkx graph to read the weights, I reformatted the weight values in edge_attribute_dict, and I set the weights to the graph using .set_edge_attributes().

I used networkx’s .degree_centrality() function to retrieve the network graph’s degree centrality. Then, I printed the 10 actors with the highest degree centrality. Degree centrality is based on the number of edges the node has. Therefore, it shows which actors have the most direct movie connections, meaning the movies they are known for that are in common with the most other actors’ movies they are known for.

I also printed the movies they are known for. Because this was based on the knownForTitles column, I filtered out titles that weren’t in ID_title_dict, which only contains movies. These actors only had movies for their known for titles anyway though.

Finally, I created the graph using networkx’s .spring_layout() and .draw(). I adjusted the k and iterations parameters of .spring_layout() until the nodes were sufficiently spaced apart. I displayed it using matplotlib’s .show().

I repeated the centrality calculations and displays for closeness centrality, which is calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. The highest closeness centralities are the actors that have the closest movie connections to all the other actors.

The last centrality metric I calculated and displayed was betweennness centrality, which, for each node, is the number of shortest paths that pass through the node. This indicates which actors serve as the most important bridges for actors to connect based on movies they are most known for in common.

All 3 centrality measures have mostly the same actors for the top 10 highest centralities. I noticed that all of them have Marvel movies as some, or even all, of their movies that they are most known for. This makes sense because the filtered data only includes the 100 actors with the highest grossing in the 2020 U.S. box office, which have a lot actors that are known for Marvel movies. Also, Marvel is a cinematic universe, so it’s much more likely for those actors to be connected by movies in common. This does, however, emphasize how connected Marvel movie actors are, especially because of how many edges are in the network graph.

4. Additional Data Processing, Network Graph, & Analysis

Since the network graphs I initially created were only for the 100 actors with the highest grossing in the 2020 U.S. box office, I was curious to explore all the actors. However, in my attempts to generate network graphs, I found that there was too much data. Therefore, I had to go back and filter out more and more actors until I could generate a small enough network graph.

I did not have to collect new data, but instead differently filter the data that I initially collected. I used the same code as in my initial data processing, excluding filtering based on the top actors.

When I tried to split the knownForTitles column using the .apply() function, it did finish after several minutes, so I tried doing the split during my network graph generation after resetting the network graph and edge_attribute_dict.

I used a print statement to keep track of how quickly it was iterating through the first loop that adds nodes. Based on the rate, it would’ve taken thousands of days to complete, so I decided to filter actors with 4 titles, which still had over 800,000 actors. I filtered this way because it’s more reasonable for actors that are known for more titles to have more connections, and by having the same number of titles for each actor, I was able to efficiently map them in a dictionary.

This time, I made a list that contains every knownForTitle, rather than making a list for each record. I did this by joining the elements of known_for_titles with a comma in between (the last title of each record doesn’t have a comma after it), and then splitting by commas.

I created another dictionary that simply maps 0 for the 1st name ID, 1 for the 2nd name ID, 2 for the 3rd name ID, and so on.

I found a function, to find all the indices that contain each title, from https://stackoverflow.com/questions/6294179/how-to-find-all-occurrences-of-an-element-in-a-list.

Using these new variables and function, I adjusted my code that was used to add nodes, edges, and weights to my network graph. Instead of iterating through name_title_dict a 2nd time, I used my indices helper function to find the indices of all_known_for_titles that equal the title. Using these indices, I determined the name ID using name_IDs_dict. Because all actors have 4 titles, I set the key equal to the index divided by 4 and converted it to an integer. For example, indices 0, 1, 2, and 3 are all known for titles that belong to the 1st name ID; they are mapped as such because each of them divided by 4 and converted to an integer equal 0. By the same logic, indices 4, 5, 6, and 7 would belong to the 2nd name ID because each of them are converted to 1.

I again used a print statement to keep track of how quickly it was iterating through the first loop that adds nodes. Based on the rate, it still would’ve taken hundreds, instead of thousands, of days to complete, so I decided to iterate through all the titles, instead of through of all the actors and their respective titles. By doing this, I was able to use the indices to create combinations of 2, using combinations() from the itertools library. These combinations were then used to create edges for all actors that have the title in common. Because I was no longer iterating through the name IDs to add the edges, I first iterated through the name IDs to add the nodes, to include nodes that don’t have edges. I then iterated through the set of all_known_for_titles, so that it would include all titles but wouldn’t repeat any of the same titles.

This brought the expected runtime down to a few days. To further lower the runtime, I excluded nodes that don’t have edges. By doing this, I was able to nearly halve the number of titles because non-duplicated titles means that actors don’t have those in common. I used Counter() from the collections library. to count the number of times each title was in all_known_for_titles, and I filtered only those that had a count higher than 1.

This code was able to complete in several hours. However, when I tried to draw and display the network graph, my Jupyter Notebook kernel died after several hours. After some googling, I realized that I still had too many nodes and edges. I restarted my computer and reran my code to try to at least get the centrality measures, but my Jupyter Notebook kernel again died for betweenness centrality, and then the same for closeness centrality.

I further filtered the actors_df by actors that were born after 1970, to get younger actors that would be more likely to have more recent movies in common that I would be more likely to recognize.

Then I reran the rest of my updated code, which finished within 30 minutes because there were over 80,000 actors rather than 800,000. I first displayed the degree centrality which was quick to calculate, but I didn’t recognize any of the actors with the highest centralities.

I printed the names of the movies they were most known for, which I did recognize.

I went to IMDb and manually found that they were mostly writers and stunt people in those movies. I noticed that those professions were listed first, and continued to see a pattern that famous actors had actor or actress as their first primary profession. Therefore, I further filtered actors_df to exclude those who don’t have actor or actress as their first primary profession.

There were still over 70,000 actors, for which I again reran my updated code.

This time, I recognized a few of the actors, but there were still some names I had never heard of.

I printed the names of the movies they were most known for, which I again recognized. Looking the actors up in IMDb, I saw that they played minor roles in the films they are most known for. Unfortunately, I have no way to distinguish actors that were in main roles using IMDb’s datasets. However, this is something that could be explored in the future using IMDb’s API, which provides the main actors for each movie, although it’s only for the top 3 main actors. Another option is to limit the dataset to a subset of popular actors, like was done for my initial network graphs. However, getting this data could be difficult because the names have to match what’s in IMDb and duplicates need to be determined by a name ID unless they have distinct birth years or ages that are provided. IMDb has this data, but it doesn’t seem to be publicly available for free.

networkx wasn’t able to closeness and betweenness centralities after several hours, but I decided not to make a smaller network graph because the only way I could reasonably filter more with IMDb’s datasets was by titles, which wouldn’t exclude the actors that played minor roles. Therefore, I concluded my analysis with a network graph. I used networkx’s .write_graphml() function to write a .graphml file to import in Gephi, the leading visualization and exploration software for all kinds of graphs and networks.

In Gephi, I used the Yifan Hu layout. I colored nodes as white to green based on their degrees, and I colored edges as white to red based on their weights. It also wasn’t able to calculate closeness and betweenness centralities after several hours.

Based on the highest degree centralities, there were a few Marvel movies compared to the mostly Marvel movies in the my initial network graphs. This makes sense because it was filtered based on actors born after 1970, for which there are still a lot of non Marvel movies that actors are connected by. The highest centralities don’t indicate that much about the actors themselves because movies that a lot of actors are known for result in more connections, regardless of their role in the movie. The movies that the actors are most known for that caused their high centralities does, however, show which popular movies have the most actors that are in other popular movies. Also, it’s difficult to see the nodes and edges of the network graph, but there are several edges per node, which emphasizes how many movies are in common for actors with 4 titles they are known for.

Conclusion

Network graphs were very helpful in understanding how actors are connected based on popular movies in common. By calculating the centralities of each node of the network graphs, I was able to determine which actors are the most connected to others based on the movies they are known for that are in common, and I was able to see which popular movies have the most actors that are in other popular movies. The high amount of edges per node also emphasized how many movies are in common for popular actors. My main limitations were that, with my datasets, I couldn’t filter a large number of actors based on specific criteria, and there was a maximum of 4 movies that each actor was known for. With better data and more in-depth analysis, there is a plethora of insights to discover by using various different filters on actors or movies, like actors that win awards or movies that were released during certain time periods.