Network analysis of movie
reference connections

Did you ever notice that the opening scene in Raiders of the Ark is similar to the opening scene from Akira Kurosawa’s Yojimbo? It is the silhouette of the main character, shot from behind, overlooking a mountainous expanse. Did you also notice that both Home Alone and Raiders of the Lost Ark have bads guys that burn their hand and place it in the snow?

Screenshot of references between Yojimbo, Raiders of the Lost Ark, and Home Alone, taken from the Wasserman paper.

For what it’s worth, I didn’t, but homages such as these happen quite frequently in cinema, and Max Wasserman has noticed. Max Wasserman, a Ph.D. in Applied Math came to speak at the Computational Social Science department of the Krasnow Institute about the significance of movies based on their reference networks in a talk titled ‘The IMDb Film Connections Network and Objective Evaluation of Movie Significance’.

After reflecting on the most well known application of social network analysis, networks of academic citations, Wasserman proposed doing the same thing with movie references. The dataset he used to do this was the IMDb references dataset provide for each movie on the site. If you are looking on the site, it is pretty hard to find — look at the right column for the movie you wish to look at, where it says “Quick Links”, click on “Explore More”, click on “Connections,” which is located under “Did you know?” — example: The Raiders of the Lost Ark connections page.

An example of part of the reference network studied by Wasserman.

Wasserman harvested this dataset to look at the network of movie references between each other. The network was built as a directed network where the nodes are films and the edges are connections. He included references, features, or spoofs. Sequels are not connected, because that would be overwhelming. TV shows are excluded. Short films and documentaries are included. There are no ‘forward’ chronological citations, meaning for a citation to be included, the referenced film has have been filmed prior. There are no same year citations, because of the issues with creating a noisy dataset which would include behind the scenes documentaries such as “the making of…”. In the end, films from the United States dominated the network.

An example of how the null model generation worked.

To generate a null model to test to see if the network is a valid thing to be studying and not just something that happened by chance, Wasserman swapped edges at random, but held the following: the node in-degree and out-degree constant. There were no swapping of if ‘forward’ edges — you can’t reference something in future. There were no swapping if duplicates were made. There were 800,000 iterations per simulation. After generating to two null models, Wasserman found the dataset to be unbiased.

What did Wasserman find?

Illustration of citation lag between movies.

Wasserman looked at a couple of the different attributes in the dataset. The first one being — citation lag. Citation lag is the lag time between two movies before one movie cites another. In the case of Home Alone to Raiders of the Lost Ark, this was nine years. After graphing the citation lags, Wasserman discovered that there is a distinct peak at 25 years. A movie has a much higher probability of being referenced in another film around it’s 25th anniversary.

Citation lag distribution. There is a peak at 25 years, which means that films have a tendency of being cited at the 25 year mark.

Another quality that Wasserman looked at was genre. In his research, Wasserman found that serious action movies do not cite comedies, but they have no reason to. However, there is a long tradition of comedies citing action movies. Therefore, genre plays a role in citation preference, according to Wasserman.

This is a comparison of outgoing vs incoming citations by genre. The ones circled are the most lopsided genre relationships.

Wasserman also considered film significance in multiple ways. One of those ways was by comparing the dataset against the National Film Registry’s dataset. The National Film Registry, housed at The Library of Congress, is a set of about 650 films that are considered culturally important enough to officially preserve. For this, he only considered movies with long gap citations, which are at least 25 years old — making their citations have a long gap. It is interesting to see what movies have made it into the NFR, what year they made it, and how many long gap citations they have.

There are a lot more interesting nuances to this research; I have only highlighted a few. I suggest you check out the paper: Correlations between user voting data, budget, and box office for films in the Internet Movie Database. Also, Wasserman is continuing the research by diving further into movie networks, so keep an eye out.