A Bossy Sort of Voice

Quantifying gender bias in “Harry Potter” with Python and NLTK

As a member of the Star Wars generation (or Xennial), I more or less missed the Harry Potter craze when it happened. Sure, I saw all of the movies in the theatre and witnessed the minor miracle of long lines of children outside bookstores every time a new installment came out. How could fast paced action scenes on flying vehicles, an orphan with hidden powers and a Dark Lord reviving an evil empire as a nemesis compare to Star W — ok, they were similar, but I liked my Space Opera and didn’t internalize any of the excitement about a magic British boarding school.

One earnest “chosen one”, one grouchy protector, one smart, resourceful woman who always gets the job done.

That is, I didn’t until my son decided he loved Harry Potter.

Having a rough idea of what was in the books, I was a little nervous about it being too scary for a 5 year old, but a distinct plus in my eyes for reading them as a bedtime story was the character of Hermione. What would be better for a little boy than a story with a tough, smart female character, written by a woman who takes on misogynists on Twitter like a boss? Seemed like a good way to help him see women as full people before the world taught him differently.

But the reality didn’t match my hopes. In the books, Hermione frequently cried or whined or timidly took action — the way the narrator told it, she was not the smart, capable hero I pictured her as.

She is introduced as someone with “a bossy sort of voice”. Bossy.

This blog post outlines my process for a project where I set out to programmatically identify gender bias in the Harry Potter series. I did it because I was disappointed in the reality the female characters are frequently described with biased language. And I wanted to prove it.

Channeling my inner Hermione

My first step was to make sure I wasn’t getting it wrong. Was I imaging this?

Nope.

There are some good blog posts and magazine articles on this topic, and academics have written articles about the gender stereotypes in the Harry Potter novels too. They point out how badly Ron treats Hermione (ostensibly her friend then love interest), portrayals of Aunt Petunia and Mrs. Weasley as nagging mothers, the lack of female representation in positions of power at Hogwarts and the Ministry of Magic, everything about the Veelas, and so on.

While some of these articles mentioned the language used to describe female behavior (with “giggling” and “squeaking” most frequently cited), I couldn’t find a quantitative analysis of the language anywhere. Probably because there are more than 1.1 million words in the seven books. Not trivial.

So, I decided to do that analysis myself. With code.

Approaching the problem

This could be a massive project, but I needed to start with a reasonable scope, so I settled on this as a research question: compared to Harry and Ron, are Hermione’s actions described exclusively in the narration with some negative words usually directed at women ?

To get the data I needed for this, I used the following tools:

There were three steps I needed to take to answer my question:

  1. Get the text into a format that Python can read.
  2. Isolate the parts of the text to analyze.
  3. Find and summarize occurrences of the the relevant words.

This post will take us quickly through high level explanations of the steps as I explained them in a talk at Codeland 2018. For more details on how I did it, you can check out the Github repo, or stay tuned for future posts where I’ll dive into each step in a lot of detail.

Finding the answer

Step 1: Get the text into a format that Python can read.

We need to go from a text file to a format that Python can understand and we can use our tools with. That means reading a file into our program and splitting the text into a list of individual words, as shown below.

Evolution of a snipped of text from “The Philosopher’s Stone”.

You might wonder why I didn’t take out the punctuation. Good catch. You’ll see why in a moment.

2. Isolate the parts of the text to analyze.

For this project, I’m looking just at narration, only for the three principle characters, and only how actions are described. That means that I need to: separate the narration from the dialog, and find references to the three protagonists.

Separating the narration from the dialog is what I need the quotation marks for — I wrote an algorithm to find text between quotation marks and categorize it as dialog, the categorize the rest as narrative. This will be the topic of a future post.

The next task was to chop up the narrative into sentences by splitting on the periods, and group them based on mentions of the protagonists by name. While this had the downside that I dropped some sentences where the characters were referred to in other ways (e.g. ‘she’, ‘he’, ‘her’, etc.) this approach made the exercise much more precise. I used a dictionary — with ‘Harry’, ‘Ron’ and ‘Hermione’ as keys — to organize the snippets of text into groups.

The final part of this step was to identify the parts of speech I wanted to analyze: verbs and adverbs. To do this, I used NLTK’s pos_tag function, which identifies what part of speech each word represents and assigns it the corresponding code (e.g. NNP is a proper noun, JJ is an adverb, etc). How this works will be the subject of a future post too!

Here’s how this looks using the same text snippet from The Philosopher’s Stone.

Breaking down step 2.

Here comes the really fun part — analysis!

3. Find and summarize occurrences of the the relevant words.

Now that we have each word labeled as a part of speech, we need to grab all of the verbs and adverbs for each character and find the ones that are unique to each.

To do this, I needed to write some code to find the verbs and adverbs associated with a specific noun. Even though I had split the narrative by character name, I couldn’t simply take all verbs and adverbs from those sentences and associate them with the character. The text in our example is a good demonstration of why:

[(‘Harry’, ‘NNP’), (‘looked’, ‘VBD’), (‘at’, ‘IN’), (‘Ron’, ‘NNP’), (‘,’, ‘,’), (‘and’, ‘CC’), (‘was’, ‘VBD’), (‘relieved’, ‘VBN’), (‘to’, ‘TO’), (‘see’, ‘VB’), (‘by’, ‘IN’), (‘his’, ‘PRP$’), (‘stunned’, ‘JJ’), (‘face’, ‘NN’), (‘that’, ‘IN’), (‘he’, ‘PRP’), (‘had’, ‘VBD’), (“n’t”, ‘RB’), (‘learned’, ‘VBN’), (‘all’, ‘DT’), (‘of’, ‘IN’), (‘the’, ‘DT’), (‘course’, ‘NN’), (‘books’, ‘NNS’), (‘by’, ‘IN’), (‘heart’, ‘NN’), (‘either’, ‘NN’), (‘.’, ‘.’)

Notice that:

  • It mentions Harry and Ron so it would be unclear which person to associate the verbs or adverbs with
  • Some of the verbs refer to Harry (looked, was, relieved, see) and others to Ron (had, learned) with some prepositions in between (he) that could muddy the waters.

English is a complex language, and while parsing this sentence in this way isn’t necessarily challenging for a person who speaks English, teaching a computer this isn’t trivial.

In my algorithm, I look for the following language patterns around each character’s name:

The verb patterns I searched for in the tagged narrative passages, where the noun was Harry, Ron or Hermione.

How I did this will be the subject of a future blog post, too!

This enabled me to make a new dictionary. Again, the character names are the keys, but the values are only the verbs and adverbs used to describe that character. For example, in our sample text:

{ 'Harry': [‘looked’, ‘said’],
'Ron': [‘muttered’],
'Hermione': [‘said’]}

The final step is to find which of these words are only used to describe each character in the book. To do this for each of the characters, I made a set of all of the words used to describe the other two, then looped through the characters’ words to check if they appeared in the set. If they didn’t, they went into a new dictionary of unique descriptors. With our text sample, that dictionary of unique words looks like this:

{ 'Harry': [‘looked’],
'Ron': [‘muttered’],
'Hermione': []}

Effectively, because both Harry and Hermione ‘said’ something, that word was not unique, and didn’t make the cut.

Here’s a summary of the actions in this step and the changes in the data with each iteration.

Breaking down step 3.

That’s it! But what did I find?

The analysis

It really isn’t just me — Hermione is described at times by Rowling with words that are applied almost exclusively to women and girls, words that are not used when she writes about Harry and Ron.

The image below shows word clouds for each of the 7 books, representing frequency of action words that are used in that book to describe Hermione, but not Harry and Ron.

Sigh.

Notice that Hermione is frequently described as squealing, crying shrieking, squeaking, doing things breathlessly, cooly and timidly. Sure, some of the words are innocuous, but many are not, especially when they are only being used to describe the female character, and when that female character is smart and resourceful.

Also, teenaged boys are squeaky too. Yet they are not described that way.

Sure, you may say, but what about the words used exclusively to describe Harry and Ron?

I looked at those too, and it doesn’t make me feel much better, to be honest.

In the table below, rows represent Harry, Ron and Hermione, columns are for each of the 7 books. In each cell, I’ve listed the top 10 “exclusive” verbs and adverbs for that character as word, frequency . The numbers may look low, but keep in mind I gave up some volume for accuracy.

So, what about Harry and Ron’s unique descriptors, what can those tell us? I think it tells us what’s special about each of these characters relative to one another.

Double sigh.

Here’s what I take from this data:

  • Harry’s unique descriptors are often about his observations and thoughts. This makes complete sense: he is the main character in all of the books, and Rowling tells the story largely through him.
  • Ron’s unique descriptors are often about his behavior. Irritation, bellowing, blurting, grumpy. Because, that’s Ron’s personality.
  • Hermione’s unique descriptors often don’t establish her as the “greatest witch of her age”, more knowledgeable and clever than her exceptional friends. Instead, her unique words slot her squarely as a traditional female character. Especially in Book 7 when she’s owning some really bad witches and wizards.

Conclusions

I don’t hate Harry Potter, and I didn’t stop reading the books to my son. I’m still partial to the galaxy far, far away, but I’ve become a Potter fan. On Pottermore I found out my house is Ravenclaw (obvs), and I went out and purchased a t-shirt to let the world know this. My patronus is a mole, which is unfortunate when I was hoping for wolf or lynx or lioness or something more clearly badass, but I have learned to accept it. When my son has screen time, I suggest a Harry Potter movie. Totally for him, of course.

Talking to someone about this project, they told me that they had always found Hermione “low key annoying” and hadn’t really stopped to think why.

I think this is why.

Even the best writer can’t undo all of their biases when creating female characters. Hermione might be the greatest witch (or wizard, let’s be real) of her age, but our inherent biases let this coexist with narrative that paints her in a way that reinforces stereotypes about women. These biases so engrained, we barely notice they are being tripped.

So, what would I like you to take from this? Always watch for bias, even in the literature and media that you love, and when you see it, name it, acknowledge it, and point it out. Especially to the kids in your life.

Like what you read? Give Eleanor Stribling a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.