Harry Potter Text Analysis
The seventh and final book in the Harry Potter series, Harry Potter and the Deathly Hallows was published in July 2007 and sold 11 million copies worldwide within 24 hours. This makes it the fastest selling book in history . I was one of those dedicated fans who stood in line at midnight to get my hands on the book. I read the entire book in one sitting immediately after getting home. It was that good.
My first Python project is text analysis of the Harry Potter series. I’m familiar with the data, which in the scientific world is called domain knowledge. Hopefully, this improves the accuracy of the analysis.
Here is my code.
Goals for this project:
- Reinforce the Python fundamentals I’ve learned in the past few weeks by writing simple, easy to read code
- Use as few libraries as possible (to learn how to do things the hard way)
- Incorporate Amazon storage, one external API to pull data from and a visualization library
- Discover some interesting insights that maybe no one else in the world has realized about the Harry Potter series
I purposely did not use the Python Natural Language Toolkit (NLTK) because I wanted to learn how to write python code from scratch and not completely lean on what has already been written by others. I’ll probably revisit this project and run more sophisticated analysis using 3rd party libraries in the future.
Here are the titles of each Harry Potter book, the year it was released in the US, its size, and what I refer to it as:
- Harry Potter and the Sorcerer’s Stone — 1998/0.4 MB (HP 1)
- Harry Potter and the Chamber of Secrets — 1999/0.5 MB (HP 2)
- Harry Potter and the Prisoner of Azkaban — 1999/0.7 MB (HP 3)
- Harry Potter and the Goblet of Fire — 2000/1.2 MB (HP 4)
- Harry Potter and the Order of the Phoenix — 2003/1.6 MB (HP 5)
- Harry Potter and the Half-Blood Prince — 2005/1 MB (HP 6)
- Harry Potter and the Deathly Hallows — 2007/1.2 MB (HP 7)
Get Data Into Notebook
First, I make the 7 HP files accessible from a Databricks Notebook, which is my coding environment. So, I copy the 7 files to Amazon S3 storage and use a Spark cluster to pull the files down from S3 into my cluster’s local file system. From there I can write normal python i/o code to read the files from the local disk.
Bag of Words
For this analysis I use a ‘bag of words’ approach where each book’s text is represented as a bag with words from that particular book. For example, if there are 75,000 words total in HP 1, then its bag of words contains 75,000 comma separated words. The implication here is that each word is analyzed individually, not taking into account word order or grammar.
Bag of words analysis is an effective way to filter out spam emails. For example, emails that contain the words “Viagra”, “cash”, and “free” no matter the order they appear in, are more likely to be spam than emails which do not contain these words.
To transform the data into a bag of words, my code:
- removes end of line (/n) characters from the text
- splits the text by whitespace to break into words
- lowercases all text
- removes all punctuation
This part took an absurdly long time as I ran into numerous formatting issues and spent many hours learning about unicode characters. This is normal for data analysts — sometimes cleaning data and getting it in a useable format (called data wrangling or munging) takes just as much time as the analysis itself.
It is possible to derive many new data sets from the bag of words. Here is a screenshot of top word frequencies for Harry Potter and the Sorcerer’s Stone:
As you can see, most of the top words such as the, of and to are fluff words that don’t add much context. These are called stop words. It is common in Natural Language Processing (NLP) to remove stop words. Here’s the list of English stop words I remove for part of the analysis.
Here’s the revised top words frequencies without stop words, which gives a slightly better understanding of which words are actually important:
Some NLP libraries use stemming which transforms each word to its ‘stem’. For example, running, runs, and runner would be changed to run as they are the same base word. This reduces the number of words in the bag of words, and may be more accurate for analysis. I’ll probably implement stemming when I revisit this code.
Below are total and unique word counts for each book.
The number of total words increases from HP 1 to HP 5 (HP 5 being the largest book), drops at HP 6 and increases slightly for HP 7.
The number of unique words also increases with the series. In HP 1 there are 5687 unique words compared to 12624 unique words in HP 5, so the writing is more sophisticated or J.K. Rowling is introducing new words (could be character names, new locations, spells, etc.) to readers.
Overall, the later books in the series have significantly more words than the first couple books. Here are some predictions why:
- The first book may the shortest because J.K. Rowling was not a proven author when it was published. Perhaps it would have been harder to get a publishing company to read her manuscript if it was longer.
- The Harry Potter series is intended for a young adult audience and the average audience age grew with each book. For reference, I read the first book in fourth grade, so I was eight years old. I read the seventh book at age sixteen. At age eight, I wasn’t as likely to read a large book with over a quarter million words as I was at age twelve, when Harry Potter and the Order of the Phoenix was published.
- As the series unfolds, the plot thickens. Maybe all those extra words are necessary for the most enchanting story.
- After the first couple books, there was a loyal fan base and demand for the next book in the series. The number of fans only increased as the series continued. Once there was a real demand for more books, maybe J.K. Rowling had more control and flexibility in her work. By the way, J.K. Rowling is the first author to become a billionaire by writing books!
Readers often overlook punctuation in text, but I thought it might be interesting to look at the punctuation in my data. How many period marks are in each book? How about exclamation marks? Question marks? Does sentence length increase as the series progresses?
This chart’s trend looks similar to the word count chart (HP 5 has the most punctuation and the most words).
There’s significantly more period marks than question marks or exclamation marks. I assume this is similar to a lot of literature. For my analysis, I ignore period marks after Mr. and Mrs. and ellipses to avoid inflating the number of sentences ending in period marks.
In English, there are four types of sentences:
- Declarative sentences state facts and end with a period mark.
- Imperative sentences give commands and typically end with a period mark, but can also end with exclamation marks. For my code, I’ll assume imperative sentences only end with period marks.
- Interrogative sentences ask questions and end in question marks.
- Exclamatory sentences express excitement and end in exclamation marks.
J.K. Rowling’s sentence length stays fairly constant throughout the series. However, as you can see in the word counts and punctuation analysis charts above, there are more sentences overall in HP 5.
Sentiment Analysis (or opinion mining) uses NLP to determine if text is positive, negative, or neutral. This is used for binary decisions such as good or bad and like or dislike. Example use cases are Yelp analyzing whether a restaurant has good reviews or bad reviews or a marketing department mining tweets to understand their consumers’ view on a new product launch.
The sentiment analysis in my code is oversimplified, which creates some error. Consider this sentence.
The movie was neither funny, nor super witty.
A human reads this and can understand it is a negative review. In my ‘bag-of-words’ approach, this sentence is rated positive because the words funny, super, and witty are positive. There are more sophisticated models which take these instances into account and can understand this should be marked negative, but because I didn’t want to use the NLTK library, I decided this is a fair tradeoff for my own learning.
One such example of this oversimplification in my code is the word just, which is the most frequently occurring positive word in each HP book. There are multiple definitions of just and some are positive —1. based on or behaving according to what is morally right and fair and 2. exactly. However, there are alternate definitions of just which are not positive, but my code has no way to understand this and simply counts every instance of just as positive.
Another possible issue is the lack of weighted words as more positive or more negative. So the words terrible and odd are both equally negative, even though terrible has a much stronger negative connotation.
Here’s the list of positive words and negative words I used. Since these are general positive and negative words for the English language, the code has no way to know the word mudblood is negative and the word patronus is positive.
Notice the Y-axis above for number of words. HP 5 and HP 7 seem to be darker books than HP 1 or HP 2.
Looking at overall words, there are very few positive or negative words present. On average, only about 2.9% of words are positive, 3.3% are negative and the rest are neutral. This is probably so low because there aren’t any Harry Potter specific positive and negative words in my positive and negative word lists. If only there was a way to programmatically determine if a word is positive or negative based on the surrounding text…. hmm, sounds like a fun project for the future.
An N-gram is a continuous sequence of N words in a given text. N-grams of size N=1 are called unigrams, N=2 are bigrams, N=3 trigrams, and anything larger is referred to by the value of N so four-grams, five-grams, etc.
Here is the first sentence in Harry Potter and the Prisoner of Azkaban.
Harry Potter was a highly unusual boy in many ways.
If N= 2, here are the bigrams:
If N= 3, here are the trigrams:
The number of N-grams for any given sentence can be calculated by the equation below.
Here are some N-grams charts for the series.
I don’t know about you, but I was always curious to see if Hermione would date Harry or Ron. Interestingly, by applying text analysis on the books, I could have predicted that Ron and Hermione would date since their names appear together so often.
This chart shows that Hogwarts is mentioned more than other magical locations. It’s interesting to see the Ministry of Magic become important in the middle of the series. I was surprised to see its frequency drop for HP 6 and HP 7, but I think that’s because it starts being referred to as just “the Ministry” and not by its full name.
This charts compares courses taught at Hogwarts School of Witchcraft and Wizardry. Defense Against the Darks Arts and Potions appear to be the most important. The five-gram Defense Against the Dark Arts peaks in frequency during HP 5. This is probably due to the Harry versus Umbridge battle in this book. In HP 6, the Potions course is the most important as Harry excels afters finding the Half-Blood Prince’s textbook. All the courses drop close to zero for HP 7 since Harry does not attend Hogwarts this year.
Character Relationship Analysis
One of my goals for this project was to use an external API since it’s probably something I should be comfortable with for a data analyst job. I decided to use the Wikidata API to programmatically populate a list of all the characters within the Harry Potter Universe using a SPARQL query.
One important thing to note is that this may not be completely accurate. I’m assuming Wikidata has every character in the series and all the names are spelled correctly, which is not guaranteed.
Character relationship analysis was pretty complicated to wrap my head around. I went through a couple different approaches and got stuck before finally discovering the solution I implemented. Here’s a couple of the initial ideas I played around with:
- Approach 1: Put each instance of every character name into a list in the order it appears in the text and run some average distance calculation to determine the number of people between two characters. However, this doesn’t take into account how many other non-character-name words are in between the character names, so if that number is large then the characters aren’t really close and the code assumes they are.
- Approach 2: Break my text into a bag-of-paragraphs and determine which characters frequently appear in the same paragraphs together. When I wrote and tested this code, it didn’t appear very accurate. I think this is because the paragraph lengths vary drastically and sometimes a paragraph can be just a single line response in a conversation.
I had initially assumed character analysis is an equal relationship between two characters. After thinking about this a lot and hacking on the code I wrote for the approaches above, I realized relationship strength is two-directional. So Harry Potter’s relationship to Sirius Black may be different than Sirius Black’s relationship to Harry Potter.
Eventually, I decided the best approach is to first pick a viewpoint character and target character (character whose relationship to the viewpoint character we are analyzing). Let’s pick Harry Potter as the viewpoint character and Sirius Black as the target character. Then, I capture every instance where “Sirius” appears within 40 words before and 40 words after every instance of “Harry” and compare this number to the total number of times “Harry” appears in the text. This assumes the characters have a relationship (whether it’s positive or negative is another question), because they are either physically close at that time or one is mentioning the other in conversation.
The output shows there was no relationship at all between Harry and Sirius in the first two books. Since Sirius was only introduced as a character in the third book, this makes sense. The 8.88% from Harry to Sirius in HP 3 can be read as “when the word Harry appears, if you look 40 words to the left and 40 words to the right, 8.88% of the time Sirius also appears”. Similarly, Sirius’ relationship to Harry in HP 3 can be described as “69.17% of the time Sirius appears, Harry appears within 40 words”. I’m making the assumption that if names appear within 40 words of each other, they are together or talking about each other (hence, strengthening their relationship score).
Harry’s relationship to Sirius drops to 7.37% in HP 4. This is probably because Sirius spends most of HP 4 in hiding and does not see Harry (or anyone) often. Their communication is mainly via an occasional owl post. In HP 5 their relationship grows stronger, likely due to the time spent together at the Order of Phoenix headquarters at the beginning of the book. Sadly, after Sirius’ tragic death at the end of HP 5, Harry’s relationship score to Sirius dwindles to around 2% for the rest of the series. It doesn’t drop to 0% because Harry still speaks about Sirius to others, spends time in Sirius’ home and mourns Sirius’ death.
In each relationship I analyze, the other character’s relationship scores to Harry is much stronger than Harry’s relationship score to them. Since Harry is the main character and the majority of the story takes places from Harry’s narrative, this makes sense. Harry’s name appears significantly more than any other name in the series. So even after Sirius’ death, when Sirius’ name appears in the text, it is around Harry’s at least 80% of the time.
Below is a chart which shows Harry’s overall relationship score to and from other major characters. The size of the circles correspond to the number of times that name appears in the 7 books and the thickness of the line represents the relationship score.
It appears that Harry is closest to Hermione and Ron, which is accurate.
There are a few edge cases my code doesn’t account for. For instance, say Harry’s name appears at the very end of a chapter and the next chapter is a completely different place or time, but Sirius’ name appears within the first 40 words of that next chapter. My analysis still assumes they are either together or talking about each other, even though that may not be the case. However, I think the error rate here is low enough to be acceptable.
Also, my code searches for only one name per character, so Sirius Black is only referenced by Sirius in the code, even though in the text he may sometimes be mentioned by only last name. In the future, I’ll update my code to allow for nicknames and multiple names for characters. Right now, I picked whichever name I think the character is referenced by most in the text (ex: Dumbledore for Albus Dumbledore).
Here are some of the relationship scores:
Haven’t gotten enough Harry Potter knowledge yet? Then, follow my blog as I explore the Harry Potter Universe through different projects!