Who’s important here?

8 min readMay 31, 2016

Graph centrality measures to find the main characters in movies

Graph centrality is a popular metric used to identify important nodes in networks. The centrality of a node in a graph is a measure of how well-connected it is to others. There are quite a few popular centrality measures, each with its own area of application. In this post, I focus on the popular betweenness centrality and show how this measure works better than other metrics in extracting the main characters in movie social networks.

You can try out a demo of the algorithm to find important characters here, or get the code here.

Why movie scripts?

The idea of using graph centrality to study character interaction is pretty general and can be applied anywhere there is some form of interaction between participating members. This makes it popular in modeling different types of social networks. To see the effectiveness of centrality measures, movie scripts are ideal for a few reasons:

  1. Scripts, like plays, are semi-structured. At the expense of a little noise, it is easy to quickly parse a movie script to retrieve the names of characters, there dialogues and the scenes that included them.
  2. Social network graphs modeling social media are ideal targets for centrality metrics. Many companies concentrate on finding the most “central” person on social media and target ads towards them, increasing the rate at which the ad is popularized. This however, is not very good to demonstrate the algorithm, simply because we don’t know the people in the graph! We could just download a part of the Facebook graph and prune it, but can’t check if we really got the important people in the group, as we don’t know them. In a movie, we are well aware of each character and know just how important they are. It’s much easier to check if the extracted characters were indeed central ones.
  3. Another advantage of movie scripts is that we clearly know when a character entered the movie, and how long he/she was actively participating in the plot. This knowledge is useful in interpreting the centrality-time graph of the character (described below).

To analyze the importance of characters in movies, we scan the script and generate a character interaction graph. Every node of this graph is a character. Two nodes are connected with an edge if the characters have interacted by being present in the same scene.

Getting the main characters

Given a graph G, with vertex set V, a node x is an important node if it satisfies the following condition:

Here, C(v) is a function that returns the centrality of a node v and |V| is the total number of nodes in G.

Clearly, the formula uses the average centrality as a threshold to find the most central nodes. This is called the mean cutoff.

C(v) is an important parameter. There are lots of different ways to measure centrality and the method you use will affect the effectiveness of the algorithm. The Wiki page on centrality describes the popular centrality measures.

As an example, I’ll run the algorithm on one of my favorites — All The President’s Men. This figure shows the character interaction graph of the movie:

Character interaction graph for All The President’s Men

This graph is really cluttered, with lots of characters (nodes). Nodes such as “operator’s voice”, “first voice” are basically noise, as these have incorrectly been classified as characters while parsing the script. Besides these, the graph also has characters such as “nixon”, “barker” and “mitchell”, which are either pendant vertices, isolated, or connected to very few nodes. Though these characters are important for the plot, the movie is not really about these people, and they appear in very few scenes. To extract the main characters, we must ignore the noise and eliminate the nodes extraneous to the central theme of the movie.

Here are the main characters as per the mean cutoff when we use the betweenness centrality for C(v):

Main characters in All The President’s Men with the betweenness centrality

This is remarkably accurate. The entire movie basically tells the story of Bernstein and Woodward — two reporters of the Washington Post. These are the most central nodes. Even in the pruned graph, these two nodes are connected to all the others. Ben Bradlee is their boss, and Howard Simons oversees there investigative work at the Post. Indeed, the movie is centered around the work of these characters. Hugh Sloan was an important witness, whose testimony was vital towards the end. It is worth noting the presence of the node “middle-aged man”. This may seem to be noise that has crept in, but a glance through the script (here) explains this anomaly. On four different occasions, “middle-aged man” referred to the person Woodward and Bernstein were interviewing. While we understand that these are four different people, the simple parser I used to scan the script is fooled into treating the “middle-aged man” as one character who has strongly interacted, on multiple occasions, with the most central characters in the film. This inflates its centrality.

I should mention here that though this procedure almost always ensures that the pruned graph will have important characters, it may also leave out some characters. For example, the second graph here excludes the node for Rosenfeld, a colleague of the reporters at the Post. Also, there are characters, who play a very important role in the plot, but do not interact with many. Here, Deep Throat (yup, that really is his name) is a pendant vertex, interacting only with Woodward, but is vital to the story. Capturing such relations between the character and the plot is extremely difficult and involves modeling the exact manner in which characters participate in the progression of the plot.

Alternatives to betweenness centrality fail to be as effective. Here’s what happens with degree and harmonic centrality:

Pruned graphs with (a) degree centrality and (b) harmonic centrality

Clearly, these hardly prune away anything from the graph. Degree centrality only removes the isolated vertices and the harmonic centrality preserves all the 3-cliques.

The centrality-time graph

A more interesting plot we could generate is the trace of a character’s centrality as the movie progresses. This shows how active the character is at any point in time in the movie. Consider the variations in the betweenness centrality of three characters in the movie:

Betweenness centrality vs time plots for three characters

The first plot shows a spike in Woodward’s centrality right at the start of the movie. Indeed, the first scene in which an important character has dialogue is that of Woodward waking up in his apartment. He is the only node till now, and has very high centrality. This stays up until we reach the scene at the office, where there are other characters introduced. Notice how the fall in Woodward’s centrality is exactly where Bernstein’s centrality rises. He is introduced in the movie at the office, and participates in scenes involving colleagues other than Woodward. This sets up alternate shortest paths between nodes that do not include Woodward, and pull’s down Woodward’s centrality. At the same time, the centrality of the new nodes rise.

It is this property of betweenness centrality that makes it so useful for social networks. The main characters in movies (and most groups having social interaction) are the ones in the thick of things, interacting with many of the other characters. If a character does so at a point in time in the movie, his/her betweenness centrality rises. However, if there are many interactions between others, setting up edges not involving the character’s node, it is “punished”, for lack of participation, through a decay in centrality.

The “punishment” of a non-active node by decay of the betweenness centrality is evident in the plot for Ben Bradlee. Bradlee is introduced quite late. His centrality shoots up immediately as his first scene includes three characters that have already been established to be important — Woodward, Bernstein and Rosenfeld. After this, the movie focuses on the two reporters’ investigation, and Bradlee doesn’t come up until his scene with the editors meeting after which the two report to Bradlee. This corresponds to the second peak in the plot. Similarly, the third peak comes around the time Bradlee meets the reporters, discussing the story incriminating Haldeman.

Other centrality measures fail to capture the “activeness” of characters. For example, here are the plots for the same characters using the harmonic centrality (closeness centrality shows similar variations):

Clearly, there is no situation in which the centrality “decays”, and we do not know when the character has gone inactive. Another observation is that the centrality of the inactive nodes is not zero, and rises even when the nodes are pendant.

To conclude…

Betweenness centrality is a really useful metric to study social networks, and as the centrality-time graphs above show, can be used to track how active a participant is over time. A point to note is that in this post, I have not used any formal method to benchmark the metrics. This would involve getting a set of main characters of every movie (maybe from Wikipedia) and comparing this set with the set generated by the algorithm to get a quantitative measure of the accuracy. I’ll probably cover this in a later post.