How to read The Lord of the Rings in 5 minutes using data science: A Trilogy (Part 2*)

6 min readDec 20, 2019

*If you haven’t read Part 1 yet, go do that. Reference code here.

Part 2: Character Network Analysis

With the list of characters produced by our fitted NER model from Part 1 in hand, we’re ready to start understanding their relationships and what this can tell us about the story. But first, we need to answer the question of “who knows whom?” The sentential structure of the text is sufficient for this task, since we can make the simplifying assumption that if two characters’ names occur in the same sentence, they’ve had an interaction. In order to understand the shades of gray (“how well two characters know each other”), we simply count the number of sentences in which each pair of characters co-occurs. Aggregating these results in a single table gives us a “co-occurrence matrix”

But with a list of 200 characters and 11504 sentences, we need an efficient way of producing this matrix. Nested for-loops aren’t going to cut it for this task (or probably most projects at Palantir). By using compressed-sparse-row (CSR) matrices to reduce memory allocation, a series of list comprehensions, and matrix algebra, we’re able to accomplish this goal.

Read the comments in the Colab cell below for a line-by-line walk-through.

While a bit complicated to compute efficiently, the co-occurrence matrix is a critical tool for calculating measures of centrality for each character (and are used quite successfully in computer vision). At a high level, these measures indicate character importance, and upon closer examination, can provide information about the character relationships and even the plot of the story!

But first, what is a measure of centrality? Each measure is calculated from a network graph diagram, which given the large number of characters in the Lord of the Rings, is useless on its own outside of providing some intuition of network density (see diagram below).

Network Graph Diagram: Too complex to be useful at a glance.

There are many types of centrality, but the four we explore in this post are betweenness centrality, degree centrality, closeness centrality, and eigenvector centrality.

Betweenness Centrality

Betweenness centrality is equal to the sum of the fraction of shortest paths between every pair of characters that include the character we’re calculating this for. An example is the pair of characters Gollum and Elrond. Even though they never met in the story, they are connected via the path Gollum → Frodo → Elrond, which is one of the shortest possible paths between them. If there were no other paths of the same length that didn’t require Frodo as a mediator, then the fraction would be equal to 1. To calculate the remainder, we simply iterate over all possible combinations of characters that appear in The Fellowship of the Ring (200choose2 = 19900 distinct combinations)

However, even just selecting the top 20 betweenness centrality results, we can see that a small number of characters are responsible for the vast majority of shortest-path connections. What do these results imply about the story?

If your interviewer asked which 9 characters you thought formed the Fellowship of the Ring, who would you say? If you read the top 9 ranked characters in terms of betweenness centrality, you’d only miss three. Why does this make sense? In the book, the Fellowship is the group of individuals tasked with the epic quest that requires them to travel across Middle Earth, thereby exposing them to more characters than other characters who stay more local. In order for the “local” characters to know each other, they often must rely on a connection through members of the Fellowship.

As far as why the other two are missed, there are two reasons: Gimli, the dwarf, is the only one of his species to appear in the book and even then only about halfway through. Peregrin, one of the 4 hobbits in the Fellowship, would be ranked higher except for the split between the use of his given and nickname (“Pippin”). Depending on the formality of the relationship, Peregrin/Pippin goes by one name or the other, often exclusively.

Degree Centrality

Degree centrality is the most intuitive measure of centrality: simply put, it indicates how popular a character is. It is equal to the fraction of other characters the character of interest is connected to.

Building on our understanding gained from betweenness centrality, we can see that most of Peregrin/Pippin’s relationships are informal rather than formal (since Pippin is ranked higher). Additionally, while Frodo (the assumed protagonist given his high rank in both measures) remains at the top of the list, individuals farther down have shuffled around. Bilbo for example (Frodo’s father figure), is connected with a large proportion of characters, but rarely serves as a bridge between them. This is sensible given his primary role in the story (finding the Ring) occurs prior to the Lord of the Rings timeline.

Closeness Centrality

Closeness centrality indicates how far away a character of interest is from all the other characters. While Bilbo has relatively low betweenness centrality, his connection to the most well-connected character (Frodo) suggests that he’d rank fairly well in closeness centrality.

Indeed, Bilbo is ranked in the top 10, and the ranks themselves are harder to distinguish. We’d expect something like this from a story where a single character or group of characters are connected to most of the nodes, since such a structure brings makes even the most tangential characters second-degree connections with most of the others.

Eigenvector Centrality

Eigenvector centrality is a measure of the quality of each character’s connections. Take a character like Thorin (Oakenshield). He makes no actual appearance in the Lord of the Rings except via allusions to the Hobbit (prequel to the Lord of the Rings) in which he was a primary character. This explains why he does not appear in either of the betweenness or degree centrality ranked lists. However, the few characters who do know him (Gandalf, Bilbo, Gimli) are each highly connected characters, so Thorin’s eigenvector centrality should be relatively high, since the average quality or prestige of his connections is high.

Conclusion

Each of these measures of centrality gave us a deeper understanding of the characters and even the plot of the story. Here are some key points we can infer from the prior analysis for you to takeaway into Part 3 of this medium series:

Frodo is the primary protagonist, based on his rank across each of these centrality measures.
Gandalf is second only to Frodo across all measures, suggesting he is either closer to the protagonist than other characters (perhaps as a mentor or friend?) or an antagonist, although the latter is less likely given this is the first book of the trilogy
This story is not unfolding in a vacuum, as there is history with characters that appear only in the telling of prior of events in association with some of the primary characters. (eigenvector centrality)
A small group of characters (the fellowship) serves as a connecting hub for the larger population of characters (see betweenness centrality), suggesting that the story focuses on them and their activities.

Armed with these insights, you’re prepared to wage war across the world of Middle Earth in Part 3 of this story where you will read and understand an extractive summary of the text!

How to read The Lord of the Rings in 5 minutes using data science: A Trilogy (Part 2*)

Part 2: Character Network Analysis

Conclusion

Written by Connor Mitchell