A textual analysis of Harry Potter by an amateur data analyst

12 min readJul 18, 2016

When I heard that Waterloo was offering an english literature course based on Harry Potter last fall, I immediately became interested in the opportunity to take the course. I mean how often do you really get the chance to say you took a course based on Harry Potter in university? Not to mention the fact that it is an elective that is quite different from my usual technical computer science/engineering type courses, I enjoy reading and writing as a way to distract myself and I do consider myself a Harry Potter fanboy. So I enrolled, and for this spring school term, alongside Adaptive Algorithms, Distributed Computing and Field Ecology, I found myself taking Popular Potter.

The third and final assignment in this course was an open ended project, in which it could literally be anything. An assignment in the format of your choice, as creative or as technical as you wish. With this flexibility in mind, I decided to take this opportunity to try to do something interesting I haven’t really done before — a textual analysis of the novels.

Textual analysis is described as follows according to this paper from the CS department at Columbia:

When we perform textual analysis on a text, we make an educated guess at some of the most likely interpretations that might be made of that text

So I went ahead, did a quick search, found the book series in .txt format, and did my analysis. I primarily used Python with NLTK and Pattern libraries as my main tools for analysis. I don’t intend for this post to be too deeply technical in my implementations to gather data, but I do have my code here to view.

Before we go into some of the results, I want to preface this by saying I am no data expert. To experienced analyst and data scientist readers, feel free to critique my analysis (or provide some feedback). I used this assignment as an opportunity to learn and definitely picked up some interesting natural language processing and textual analysis techniques while using Harry Potter as a basis. So with that in mind, here are some results I came up with.

Word Counting

So this one is a good easy way to start. Let’s just count the total number of words in each novel:

The Order of the Phoenix is twice as long as the Prisoner of Azkaban? Yeah okay, we didn’t need to do textual analysis to see that, it’s clearly obvious by just looking at the physical book size. But why do the books get significantly longer starting with the fourth book?

Well apparently Rowling wrote a massive plot hole in the Goblet of Fire and spent half the book trying to fix it. Apart from that, you would believe Rowling had more creative freedom to approach the novels at this point. By the release of the Goblet of Fire, Rowling clearly had a captivated audience. A solid foundation was set for the plot, characters and setting such that Rowling could continue world building and explore the darker themes that the series continued to move towards.

Okay, that was simple enough, why don’t we look at something a little more interesting, locations. Below are the word counts of various settings throughout the series:

Hogwarts is clearly the most mentioned location in the series, but we can see interesting information about some of the other locations. Azkaban is most often mentioned in the Prisoner of Azkaban and from then on, the location is mention in passing as people are sent and breakout of the prison. Places such as the Ministry of Magic and Grimmauld Place are featured in the later novels, but most noticeably in the Order of the Phoenix where Ministry of Magic representative Umbridge takes over Hogwarts and the Order uses Grimmauld Place as their headquarters.

Apart from locations, why don’t we look at another interesting count number — the number of times a spell/charm is mentioned. There are a wide variety of spells and charms used in the series, but some of the more popular ones include:

Expecto Patronum (spirit guardian summoning)
Accio (object summoning at hand)
Expelliarmus (disarming)
Stupefy (stun)
Lumos (flashlight)

With the introduction of Dementors in the Prisoner of Azkaban, it is not surprising to find that Expecto Patronum is the most used spell in that book. But the large number of nearly 25 explicit mentions in that novel alone is quite the anomaly. Examining the text further, we find that when Harry, Hermione and Sirius are surrounded by the Dementors in the Dementor’s Kiss chapter, Expecto Patronum is explicitly used a total of 16 times by the characters.

The dementors were closing in, barely ten feet from them. They formed a solid wall around Harry and Hermione, and were getting closer. “EXPECTO PATRONUM!” Harry yelled, trying to blot the screaming from his ears. “EXPECTO PATRONUM!”

Finally, we also have the following Unforgivable Curses: Imperio (mind control), Crucio (torture) and Avada Kedavra (killing). Upon their introduction in the Goblet of Fire, usage of the Unforgivable Curses peaks initially, but trends downwards until the Deathly Hallows where they are used quite a bit once again in the final battles.

Word Dispersion

Obtaining total counts of certain words provided some intriguing insights, but another interesting analysis is to observe a word’s relative position in the overall text. Word dispersion, or lexical dispersion, is the occurrence of specific words. This is essentially identifying where a specific word occurs relative to the overall text. Below is a word dispersion plot of some character names in the Philosopher’s Stone with chapters outlined at high concentration of mentions:

With Harry being the main character in the series, it’s clear that his name is mentioned extensively throughout the book. There is a cluster of Hagrid mentions after the initial chapters as he finds Harry and assists him in his trip to Diagon Alley. Ron, Hermoine and Malfoy are first introduced when Harry arrives to Platform Nine and Three-Quarters and on the Hogwarts Express. Dumbledore is mentioned initially as he leaves Harry at Privet Drive, but along with Snape, he is more frequently mentioned when they arrive to Hogwarts. Voldemort is mentioned very little in this book as he is still referred to as “He-Who-Must-Not-Be-Named” by most characters, but his return through Quirrell is evident by the cluster of his mentions at the end.

So looking at dispersion plots of the main characters doesn’t exactly provide much insights, they will clearly be mentioned throughout the series in general. So why don’t we look at a dispersion plot of some interesting supporting characters throughout the entire series:

One of the more interesting characters I decided to look into was Dudley Dursley. Since each novel often has Harry starting off at Privet Drive, we can see clusters of Dudley mentions at the beginning of each book. Sirius Black on the other hand doesn’t make an appearance until the third book, but from then on is mentioned quite frequently, even in post mortem after his death. Triwizard Tournament participants Viktor Krum and Cedric Diggory along with reporter Rita Skeeter are significantly featured in the Goblet of Fire, while sisters Bellatrix Lestrange and Andromenda Tonks are introduced in the Order of the Phoenix and with their involvement in the war, are mentioned several times in the final two books.

Sentiment Analysis

As I mentioned earlier, Rowling definitely began to delve into darker themes as the series progressed with subject matter such as murder being explored more thoroughly. To investigate this in more detail, we can do a sentiment analysis on the text. Sentiment analysis is the process of identifying and categorizing pieces of text to determine its particular attitude (such as positive, negative or neutral).

For this analysis, I’ll be doing a simplified approach to sentiment using the Python library, Pattern. Pattern provides a sentiment library which uses a set of predefined adjectives (good, bad, amazing, irritating, etc) that occur frequently in product reviews, annotated with scores for sentiment.

Using this, we can categorize sentences based on polarity (negative ↔ positive) and subjectivity (objective ↔ subjective). Polarity scores range from -1 (negative) to 1 (positive) while subjectivity scores range from 0 (objective) to 1 (subjective). For example:

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.

This sentence received a sentiment score of (0.403, 0.637), meaning that the sentence is more on the positive and subjective side. On the other hand:

“He always sp-spoils everything!” He shot Harry a nasty grin through the gap in his mothers arms.

This sentence received a sentiment score of (-1.0, 1.0), indicating that it is a very negative statement and also very subjective.

Harry was frying eggs by the time Dudley arrived in the kitchen with his mother.

The sentence above is neutral, neither positive or negative, with scores of (0,0).

For the overall analysis, let’s consider sentences that receive a polarity score of 0 to be neutral, less than 0 to be negative and greater than 0 to be positive. Let’s go through each book, calculate the percentage of neutral, negative and positive sentences and see what kind of trends there are as the series progresses:

So as the series progresses, the percentage of sentences with negative sentiment increases, with a large increase starting at the Order of the Phoenix, indicating that the series does get darker in the final books.

An interesting observation to note is that there is still a high percentage of neutral sentences in each book. This isn’t too surprising. In writing, most declarative sentences don’t really have much sentiment attached to them. But another reason for this with respect to the series is that we aren’t using a trained set of text.

I mentioned earlier that we are using Pattern’s predefined set of adjectives, and thus it can’t really identify Harry Potter specific pieces of sentiment. For example the following sentence was consider neutral with a score of (0,0):

“Then kill him, fool, and be done!” screeched Voldemort

Any sentence that includes the words kill and Voldemort should definitely have negative connotations associated with it, but these are likely not included in Pattern’s library and thus it failed to correctly classify this sentence.

So sentiment analysis using Pattern isn’t exactly perfect, but why don’t we take a look at some of the interesting positive sentences in the series anyways:

It had been Harry’s best Christmas day ever. (1.0, 0.3)
It was a delicious feast; the hall echoed with talk, laughter, and the clatter of knives and forks. (1.0, 1.0)
Haven’t people like Hagrid and Sirius told Harry how wonderful his father had been? (1.0, 1.0)
“Why shouldn’t I” said Hermione, “Mudblood, and proud of it! (1.0, 1.0)
From all of these things, Harry deduced that Ginny, and probably Neville and Luna along with her, had been doing their best to continue Dumbledores Army. (1.0, 0.4)
“You were named for two headmasters of Hogwarts. One of them was a Slytherin and he was probably the bravest man I ever knew.” (1.0, 1.0)

Along with some of the more negative sentences:

“I’m disgusted”, said Professor McGonagall. (-1.0, 0.8)
Dementors caused a person to relive the worst moments of their life. (-1.0, 1.0)
Moody raised his wand again, pointed it at the spider, and muttered, “Crucio!” At once, the spiders legs bent in upon its body; it rolled over and began to twitch horribly, rocking from side to side. (-1.0, 1.0)
“I hate that Skeeter woman!” she burst out savagely. (-1.0, 0.9)
“He dared — he dares — ” shrieked Bellatrix incoherently. “ — He stands there — filthy half-blood — ” (-1.0, 1.0)
The death of the headmaster at the hands of one of our colleagues is a terrible stain upon Hogwarts’s history. (-1.0, 1.0)

We were able to determine basic sentiment with Pattern, though using a trained set would definitely improve the way we identify sentiment in these books, a project for the future perhaps.

Grammar Counting

Now that we looked at sentences in general, why don’t we break down the sentences and look into their appropriate parts of speech. In text analysis, a process known as tokenization is used to break up text into words, phrases, symbols or other meaningful elements known as tokens.

With the text broken down into tokens, we can further classify the tokens into their respective parts of speech, such as whether it is a noun or verb. For example, tokenizing the following sentence:

Harry Potter was a highly unusual boy in many ways.

results in the following tokens:

['Harry', 'Potter', 'was', 'a', 'highly', 'unusual', 'boy', 'in', 'many', 'ways', '.']

and if we tag each token as a part of speech, we get the following:

[('Harry', 'NNP'), ('Potter', 'NNP'), ('was', 'VBD'), ('a', 'DT'), ('highly', 'RB'), ('unusual', 'JJ'), ('boy', 'NN'), ('in', 'IN'), ('many', 'JJ'), ('ways', 'NNS'), ('.', '.')]

It identified “Harry” as NNP (a proper noun), “was” as VDB (a past tense verb) and “highly” as RB (an adverb). A full list of what each tag means is available here.

Now that we can identify parts of speech, let’s start counting some adjectives, verbs, adverbs and nouns:

With the main series taking place in Hogwarts, sure enough the most popular non-proper noun is “room”. Interestingly “ministry” and “school” also make it into the top nouns.

For adjectives, we can see Rowling’s preferred use of colours, with “white”, “golden” (for the snitch perhaps?) and “black” all being part of the top 15 used in the series. And of course with the book being about wizards and witches, we do find “magical” there as well.

Relationships

What about looking a specific relationships between characters? For this, let’s look at all occurrences of one specific character in text, for example Harry. For every instance Harry occurs in text, let’s observe a slice of text characters, say 50 text characters, before and after the occurrence of Harry. Within this slice, let’s see how many times a second character appears, for example Ron. Let’s calculate this as a percentage of the number times Ron appears in the slices over the total number of slices.

For this specific example of observing occurrences of Ron within the text slices of Harry, the following percentages were obtained for the 7 books:

Harry with Ron: [0.153, 0.192, 0.173, 0.149 0.133, 0.127, 0.144]

This seems pretty consistent, where Ron occurs in the text slices of Harry about 15% of the time on average. Why don’t we observe other characters in the Harry text slices and compare them on a plot:

With Ron and Hermione being the main supporting characters, we find that they have the highest percentages of being found in the Harry text slices and thus we can say Harry has the strongest relationship with these characters.

Note that we find a peak in Hermione and Dumbledore percentages in the Prisoner of Azkaban and Half-Blood Prince appropriately. This is a result of Hermione spending a lot of time with Harry in the final parts of the Prisoner of Azkaban and Harry doing Occlumency lessons with Dumbledore in the Half-Blood Prince in addition to their mission to destroy a Horcrux.

This method of checking characters within text slices isn’t exactly bidirectional, so let’s check another character. Let’s use Ron as the main character text slice, and check for instances of Harry and Hermione in the Ron text slices:

So from Ron’s perspective, we find a strong increasing trend in Hermione occurrences in his text slices. Perhaps we could have predicted their relationship in the earlier novels by observing this trend?

Conclusions

The Harry Potter series clearly has a wealth of text and a wide variety of words that can easily be analyzed in different ways. I’ve provided a few ways in which the text can analyzed above, but it’s definitely not exhaustive and there are many other methods that can be done.