Social networks from movie scripts — Part 1
One day, I don’t remember very well why, I was looking at the different corpus of text that nltk has. Among them one could find Brown and Reuters, typically used to test NLP systems, but after some searching I found… the movie script of Pirates Of The Caribbean 2. Really?!?
That reminded me of these articles that created social networks from books such as Game Of Thrones, Harry Potter, The Iliad, etc. Creating social networks from books has many challenges, among the first of them is identifying characters and identifying/defining the interactions between them. There are several ways to achieve this: one way is manually, by taking notes of the interaction between characters. Another option is giving a list of character names and define an interaction if two characters are named in the same sentence.
The first approach is slow and error-prone, while the second approach misses interactions and makes it tricky to identify characters that are named in different ways as the story goes forward.
Using a movie script however, saves us some problems. It’s organized by lines, and we know who said what line (and that name is constant throughout the script). It’s also a structured document — or that’s what I thought — divided in scenes with a description inside square brackets and the dialogues after that.
To retrieve the characters, we are going to find who has lines in the script. We will define an interaction between two characters if these characters share a scene. We can illustrate this with an example.
This models also has it’s limitations. Let’s say that there are five characters in a scene, it’s naive to think that all five characters interact equally between each other. It could happen, but it also could happen that two of them have a stronger interaction, and maybe no interaction at all with the other three. When the characters in a scene are speaking about a character that is not physically there, we could count this as an interaction. Well, our model doesn’t take any of that into account.
The good part of our model is that it’s simple and fast. We can process the text and compare the result with our understanding of the movie (did you know that this movie is from 2006? ten years ago…).
Some curious things first:
- The script was malformed, I had to fix (normalize) it before processing it.
- Bill “Bootstrap” Turner (Will’s father) and Governor Swann (Elizabeth’s father) have no lines in the script. The don’t talk according to the script, and are only mentioned by other characters. I clearly remember both these characters having lines in the movie, but for this social network they will be excluded since they don’t fit in our definition of “character”.
- Yes, the script has a character whose name is “?”, and it’s a member of the cannibal tribe.
Lets show the final result first and share some metrics and statistics later.
On this graph, the size of the nodes is relative to its degree (or how many people they interact with) and the width of the edges is relative to the amount of times those two characters interacted. The colors of the nodes are determined by a community detection algorithm running only on the information provided by our model.
There are seven communities that we can describe as:
- Violet: The Black Pearl crowd. It has some outliers like “Hadrus”, “Wyvern” and “Carruthers” who should be in the Flying Dutchman team and the cannibal island team (they should be their own community).
- Green: The Flying Dutchman, with “Davy Jones” as the head of it.
- Grey: The East Indian Trading Company, the real bad guys, with “Lord Cutler Beckett” in front of it.
- Orange: These are the new recruits that “Gibbs” gets in Tortuga, and it’s easy to see how they mainly interact with him and “Jack Sparrow”.
- Pink: Edinburgh Trader. This is the first ship that the Kraken destroys, the one were they find Elizabeth’s dress. All the characters in this community belong to this ship.
- Sky-blue: These are the characters that Will encounters while searching for Jack.
- Aquamarine: This community is formed only by two characters, “Turkish fisherman” and “Greek fisherman”. They have one scene in the movie, and the point of that scene is to introduce us to the Kraken and its destructive power.
Lets see the most common words of the main characters:
Nothing really surprising here, I particularly like how “love”, “dirt” and “bugger” have the same amount of mentions.
Somebody seems a little bit obsessed…
But to be fair he spends most of the movie searching for Jack (and it’s to free his love from certain death).
Well… it seems that there is a character that gets named quite a lot.
At least he proves himself a good pirate. “Aye” is his most used word, way to go Gibbs!
Very funny guy for a bad guy…
But it seems that when he is not busy laughing, he also wants our popular captain.
He is obsessed with the compass, Jack has the compass, everything makes sense.
Also for a character like him, words like “world”, “freedom” and “currency” are accurate.
It’s interesting how all the main characters of the film, other than Jack, have “jack” as one of their most used words.
We just stated that these characters are the main characters of the film. That could be our personal take on the film, or we could try to “calculate” them. One way to do this is to count the number of lines a character has and the number of scenes a character is in, which could give us an idea on who are the main characters.
We can see the number of lines per character in the following plot:
The plot of scenes per character:
Will is in more scenes than Jack, but Jack has more lines than Will. This is in part because Jack has several monologues in the movie.
Based on these two plots we could say that Jack, Will, Elizabeth and Gibbs are strong candidates for main characters. After them there is a break, and we have Pintel, Ragetti, Davy Jones, Norrington and Lord Beckett in the mix.
Now, if you are still with me, we can see the result of applying some metrics to the social network, using normalized results.
Degree Centrality: the number of other characters that you interact with.
Eigenvector Centrality: weighted degree centrality with a feedback boost for interacting with other important characters. You get full credit for the importance of your neighbors.
PageRank Centrality: weighted degree centrality with a feedback boost for interacting with other important characters. The importance of your neighbors is split among its neighbors (Google’s search base algorithm).
Closeness Centrality: the average distance to all other characters.
Betweenness Centrality: how often you lie on shortest paths between two other characters.
Again we see Will, Jack, Elizabeth and Gibbs dominating these metrics. If we were to chose THE main character it would have to be between Will Turner and Jack Sparrow. Will is the most important character on the social network, but Jack is the most named character and is also important in the network.
Well, thanks for reading, this was a fun little project and since the code for this is pretty reusable I wanted to try with a trilogy.
Check out Part Two of these series to see the social network of The Lord Of The Rings.