How to read The Lord of the Rings in 5 minutes using data science: A Trilogy (Introduction + Part 1)

6 min readDec 20, 2019

Follow along with the code used in this series here: GitHub

Introduction

An email from Palantir just arrived in your inbox; they want to schedule an initial interview with you tomorrow. You’ve been diligently honing your practical data science skills, including data structure manipulation with NumPy and Pandas as well as some machine learning with SKLearn and Keras. Assuming you make it through to the technical part of the interview, you’re confident in your abilities. But what if the personal section is harder than you anticipated?

Perhaps you know that Palantir was named after a crystal ball from Tolkien’s Middle Earth, but what if your interviewer starts asking you about the characters? There are literally hundreds to keep track of. Will you be able to prove yourself nerdy enough for the job?

Using a combination of Named Entity Recognition, Network Theory, and Text Summarization, this post will teach you how.

Overview

Part 1: Named Entity Recognition (NER)

Named Entity Recognition is an analytical method that uses language heuristics (i.e. capitalization) as features for training a model to recognize entities (i.e. people, geographical locations) in a body or “corpus” of text. The focus of this post will be understanding character relationships, but it’s useful to know NER can be used for a lot more. We’ll be comparing the results of a pre-trained generic SpaCy NER model with a fitted one.

Part 2: Character Network Analysis

With the help of the networkx library, we’ll cross reference our final character list from Part 1 with the corpus of sentences compiled from The Fellowship of the Ring in order to measure the centrality of each character. This will help us answer questions like “who are the main characters?” and “who is related to whom?” The answers to these questions will provide context to the 5-minute text summary we’ll generate in Part 3.

Part 3: Extractive Text Summarization

There are two primary methods of text summarization using data science. The first and simplest to implement is extractive text summarization whereby a component or sentence from a body of text is selected as representative of the main point. This is often done through some sort of sentence similarity comparison, for which we will use the PageRank algorithm from networkx.

The other method is abstractive text summarization whereby new text is generated to summarize the original. The abstractive method most resembles how humans summarize text, but in pursuit of simplicity and conciseness (as well as to avoid bastardizing Tolkien’s text), that will be reserved for a future post.

Armed with an informed perspective colored by our trained NER model and network analysis, we will use our extractive summary to build a foundational understanding of The Lord of the Rings: The Fellowship of the Ring.

Part 1: Named Entity Recognition

Let’s start off by reading Tolkien’s text into our local environment, which you can download here. We also want to split the text (“filedata”) into sentences, then split those sentences into lists of words to provide as input to our text summarization model’s co-occurrence matrix function (see Part 2).

But we’re getting ahead of ourselves. Let’s begin with running the default pre-trained SpaCy NER model:

Yes, it’s that easy. You have to be careful of overwriting SpaCy’s default maximum text length parameter (especially with long texts like Lord of the Rings), but if you run the code on Colab you shouldn’t have any memory problems.

The run_spacy function above produces a 2-dimensional dataframe containing the text recognized (“Gandalf”) and the label or entity type the model assigned to it (“PERSON”). If we filter this data frame to show only PERSON entities, we’re able to approximate the model’s precision at recognizing characters:

It looks like the default model thought exclamations like “Whoa” and “Coming” were characters, perhaps due to their capitalization, and “III” looks like a part of a name. We have some work to do to make this list functional.

In order to fit SpaCy’s model to this corpus of fantastical names, we need to supply it with some examples. From another self-proclaimed LOTR nerd’s GitHub account we can grab a full Middle Earth character list that contains characters from across all three books! This list doesn’t tell us anything about the importance of the characters (or which ones occur in the Fellowship of the Ring), but does supply us with a way of selecting sentences for training.

According to the SpaCy documentation, training data should be supplied in the form of:

(Sentence, [(Index Start, Index End, Label)]) which we can do with a couple of nice (and time efficient!) list comprehensions.

You may notice that the column selected from the “full_characters” dataframe is named “first.” This is because in a book, the characters are rarely referred to by their full (first and last) names, so searching for the first only will provide a selection of sentences to use as training data. Here’s an example of one:

Now we’re ready to define a function to train the model:

A new empty NER neural network model is initialized with an english dictionary, then a “pipe” or set of NER layers is added to the model before iterating through each training sample in order to minimize a multi-label log loss function. Stochastic gradient descent is used as the optimization function, as it allows neuron weights to be updated in the right direction (towards a local minimum of the cost function) after a wrong label/entity predictiction. To learn more about stochastic gradient descent, here’s a good resource.

Finally, let’s take a look at the trained model’s predicted character list:

While there are still quite a few false positive characters (i.e. Mortal, Lands) and one primary character’s name needed to be replaced with his nickname (Samwise → Sam), the trained character NER model produced a list with greater accuracy.

But just because our model produces a more accurate list of characters doesn’t mean we know which ones we should pay attention to. Can all 200 characters can’t be equally important? To be continued in Part 2 of this series: Character Network Analysis.