If you’ve ever worked on a text and wished you could get a list of characters or see how many times each character was mentioned, this is the tutorial for you.
The code explained in this post will yield a dictionary of proper nouns and the number of time they are used in the text, like the excerpt shown below for the text of Harry Potter and the Philosopher’s Stone. It also links instances of two proper nouns appearing one after the other, like ‘Albus Dumbledore’ or ‘Aunt Petunia’.
The full code for this project is available on Github. You can copy this and run it with the instructions in the README, or you can read the post below for a narrative on the thinking behind the script.
This exercise will introduce some basic Python 3.6 data structures, reading text files, and some basics of the Natural Language Processing Toolkit library (NLTK): tokenization and text tagging.
Step 0: Fill your toolbox.
Everything in this tutorial requires that you are able to run Python 3+ scripts on your computer. Probably the easiest way to do this is to download and install Python if you haven’t got it already. We’ll also be running this script using the Terminal (aka the Command Line if you’re on a Windows machine).
And, of course you’ll need a test corpus to make sure your program is working, ideally a file
.txt format to make it easy to read into your program right off the bat. My example will use the text of Harry Potter and the Philosopher’s Stone, but use whatever text you feel comfortable with — as you’re creating the script, it’s best to use a text that you’re familiar with, so you can evaluate how correct the output is, and then test it with something new.
If you don’t already have a file ready, you can get one directly from any eBook source that can generate a
.txt file. Project Gutenberg is a great resource to get started with, and includes texts of thousands of books in multiple formats. To get a plain text version of your book, select the “Plain Text UTF-8” link. This will open the full text in your browser, select it all, copy it, then paste into Notepad or a code editor and save as a
.txt file, ideally in the same folder where you’ll be saving your program to keep it simple.
Step 1 — Read in your text
Let’s create a function to read your
.txt file into Python so we can analyze it. I’ll put this into a function called
read_text that we’ll use to read the file and then print the content in our terminal.
Open is a built-in python function that accesses our text file (
hp1.txt) and turns it into a file object. The second variable indicates the mode that we will be using the file in: in this case
"r" , which means it will be for reading only.
Once we have the file object created, we want to read it, which means we are converting it from a file object to a giant string that we can do things with. We see this happening in the second line — our giant string is a variable with the name
Next, it’s good practice to call the built-in
close function, to save memory, although the program still runs without that line.
Then we want to print
book to make sure it worked and return it so we can use our big string in other functions.
Run this in Terminal (
python script.py ) — you should see a printout of the entire book as one giant string. The image below shows what the last few lines of Harry Potter and the Philosopher’s Stone look like. Once you’ve verified this works, you can delete the
print(book) line in your code
We now have the text in a format that Python can read it, and we can begin our analysis with NLTK.
Step 2 — Importing NLTK & tokenizing the text
Even if you’ve installed NLTK, you need to import it into your script to be able to use it. The whole NLTK library is massive, so we’ll be specific about what we need.
from nltk import word_tokenize
word_tokenize method will split our text up into a list of individual words and punctuation.
If you were to add the line
print(tokenize) on the line before the return statement and run the code, you’d see the same passage from the end of Philosopher’s Stone print in your terminal, but it would look like this:
Tokenizing is breaking up a big block of text into a list of words and punctuation that we can analyze. Doing that enables us to use one of the most interesting features of the NLTK library: tagging.
Step 3 — Tagging to identify parts of speech
When you use the NLTK tagging function, you’ll get a list of tuples back for each tokenized item. ‘Harry’ from our list in the last step will now look like this:
The first thing in the tuple is the tokenized word you’re analyzing, and the second is a tag from NLTK denoting a part of speech. You can see the full list of tags here, but in this tutorial we’re going to focus on one, ‘NNP’ — the singular, proper nouns — to extract character names.
To use the tagging function, you need to import it. To do this, you can simply add
pos_tag to the first line in your code, like this:
from nltk import pos_tag, word_tokenize
To use it, you call it on tokenized text. For example:
When we call this function and print the output, we get the following:
Now, let’s modify the function to list and print out only the ‘NNP’-tagged words — the
set is used to deduplicate it:
This will give you an output that looks like this for the highlighted passage above:
Limitations of algorithms and how to work with them
If you noticed that there are a couple of problems here, you’re absolutely right, and this is where knowing your text is really helpful. Let’s look at the issues we can see just from this tiny bit of text:
- Non-names are in our list: ‘Hope’, ‘Hurry’, and ‘See’ are not actually proper nouns . They are really verbs, but used in this colloquial way, have been tagged as proper nouns by the NLTK algorithm.
- ‘Uncle’ itself is generally not used as a proper name — it is almost always connected to the given name, most often ‘Vernon’ in the Harry Potter series. If we took a bigger sample, we’d have a similar problem with the Weasley parents (often referred to as ‘Mr. Weasley’ and ‘Mrs. Weasley’) and any of the professors (‘Professor Snape’, ‘Professor McGonnagal’, ‘Madam Pomfrey’.
There’s another potential issue — first and last names. For example, “Harry” and “Potter” will both be counted separately, just like “Uncle” & “Vernon”. But, we can’t always attach “Harry” to “Potter” — if we do we leave out Lily & James, Harry’s parents.
There are two straightforward ways to handle this:
- Make a list of prefixes that appear frequently in the text and link them to the next “NNP” word in your list — like ‘Mr.’, ‘Mrs.’, ‘Lord’, ‘Professor’, etc. This could be done working from your knowledge of the text and/or by examining all of the “NNP”-categorized words. I tried this and it worked reasonably well, but depends on you being able to anticipate every prefix, and it doesn’t take care of cases where first and last names are used together without a lot of work.
- Look for all instances where two “NNP” words appear together and count them as a single term. This will give you some false positives, but is a more programmatic way of looking at the text.
Step 3 — Identifying single and double proper nouns
I’m going to use option 2, since we also want to group instances of first and last names. I also split the tagging and proper noun identification into two functions, one to tag (
tagging) and one to add the proper nouns into their own list (
Notice that I made a couple of changes to the construction of the existing code: rather than iterate through
tagged_text directly (
for word_tuple in tag ) I switched to a numeric iterator (
i ) so I could easily check the next word in the list, to see if it’s a noun.
If we find a noun, we want to check if the next word’s tag to see if it’s a noun or not. One of two things can happen here:
- The next word isn’t a noun, we just add the noun we did find to the list and move on; or
- If the next word is a noun we stick them together and add them to our
proper_nounslist. Note that, we don’t want to double count the second noun (e.g. add both ‘uncle vernon’ and ‘vernon’ to the list), so we skip our iterator forward by adding 1 to our
An important note here: people sometimes remove punctuation when doing a textual analysis, but here’s why I didn’t. Consider this commonly occurring phrase in all of the Potter books: “Harry, Ron and Hermione”. The comma here is pretty important used in conjunction with NLP. If I leave the sentence as-is, NLP reads it as
NNP — , — NNP — CC — NNP , where CC is a conjunction. But, if I take out the comma, I get
NNP — NNP — CC — NNP . With this input, my algorithm above would read ‘Harry Ron’ as a single noun, like ‘Uncle Vernon’.
Running this code on the entire text of The Philosopher’s Stone, we get a list (which I’ve turned into a set to eliminate duplicates) of single and double nouns:
Remember this list includes every proper noun or proper noun-following proper noun in the whole first book, duplicates removed, and in no particular order.
How do we narrow this down to a list of characters? Remember that character names are often mentioned many times, combinations like “owl emporium” a good deal less often. In using a set in an effort to cut out repeated terms, I’ve made “owl emporium” of equal weight to “Harry Potter”. So, we’re going to use a set of methods built into Python called
collections to count up the instances, which should surface more of our characters.
Step 4 — Counting instances of proper nouns
Head back to the top of your document. We need to import the counter method from the Collections library into our script explicitly, so type:
from collections import Counter
Now, we’ll create a function to count up every unique instance of the single and double proper nouns we encounter. We’ll pass in our list of proper nouns, and the ‘top x’ number of results we want. The function will return a dictionary of the top nouns used in the text.
When we run our new code, we get a dictionary of the top 100 single and double proper nouns, where the value is the number of times they occur in the text. What you see in the console should look like this excerpt.
Here’s how to read this: in the text, Harry is mentioned by only his first name at least 1,280 times and by his first and last name at least 27 times.
Now comes the fun part: making the computer’s output better with your knowledge.
Step 5 — Turning data into insight or a research question
There are lots of practical applications for this, but let’s look at just one for the top 100 nouns: comparing mentions of the three protagonists — Harry, Ron & Hermione — in The Philosopher’s Stone.
As you can see from this really quick chart I made, aggregating mentions of these three characters just from the Top 100 noun list we made, Harry is mentioned (as ‘Harry’, ‘Mr. Potter’, and ‘Harry Potter’) much more than Ron, and Ron is mentioned nearly twice as much as Hermione (as ‘Hermione’ and ‘Miss Granger’).
You could use this script to answer questions like:
- Do mentions of female protagonists increase over the course of the series?
- Where are new characters or names for characters introduced?
- How are characters referred to in the books and what does this say about their social position or portrayal by the author?
- What place names are prominent in the Harry Potter series?
- What objects are introduced in the Harry Potter series, and when?
I hope you’ve enjoyed this post and the code is a useful tool in your research. Stay tuned for more blog posts on textual analysis in the near future, and leave any questions or feedback in the comments.