Dialog, delineated!

5 steps to separating conversation from narration in any English text

About two weeks ago, I wrote a post about how I used Python and the Natural Language Processing toolkit (NLTK) to uncover gender bias in the Harry Potter series. This post goes into detail about how I separated the dialog from narrative using Python.

Why do this? I wanted to find out if bias existed specifically in the narrative — how J.K. Rowling was describing the characters’ actions — I needed to parse out the dialog and look only at narration. My goal was to get two lists: one of all of the dialog in order, and one of all the narration in order, as shown below.

Separating dialog from narrative

I thought this would be a thing that a billion people had already done, but had trouble finding an explanation of how to do it.

We’re going to do this in five steps — the ones with <code> after them involve some programming:

  1. Read in some text data from a .txt file. <code>
  2. Tokenize the text. <code>
  3. Confirm what quotation marks look like in your text.
  4. Write an algorithm to put everything between quotes into one list and everything else in another. <code>
  5. Verify our results.

You can see all the code together on Github.

For this test, I’ll use a passage from A Christmas Carol by Charles Dickens, because it has a bunch of dialog mixed in with narrative that is a good length for us to use to verify our code works. I got the text from Christmas Carol from Project Gutenberg by copying and pasting text from the website (shown below) into a file and saving it with the .txt extension. All of the code and this text snippet are available in the repo, here.

Text from “A Christmas Carol” on gutenberg.org

If you want to see similar code in action with the Harry Potter project mentioned above, check out the walkthrough here.

These examples are shown in Python 3.6 and NLTK 3.2.2.

Let’s get started!

Step 1: Read in our text with Python

Fortunately, if you can get your text into a .txt file, Python makes the process of getting the data into your program pretty straightforward.

text = ''
with open(textfile, 'rt') as file_in:
for line in file_in:
text = text + line

This function returns a giant string of text, which our Python program can now read. Printed out in the terminal, it looks a lot like it did on the webpage.

Sample of our text read in to Python.

Step 2: Tokenize the text

The Natural Language Processing Toolkit (NLTK) will make a guest appearance here — we can use the word_tokenize function to split up our gigantic string into a list of individual words and characters.

tokenized = word_tokenize(text)

Here’s a sample of what’s in the tokenized variable this function returns.

Tokenized text!

Now remember, this text is all in the same order as it was on the webpage where we found it, but we’ve read it into Python and split it into individual words and punctuation. We want to split it into a dialog list and a narrative list.

The key here is finding a consistent rule that differentiates one of those types of text from the other. Fortunately, in English, dialog is almost always enclosed in quotation marks!

3. Confirm what quotation marks look like in your text.

Quotation marks are always around dialog, but the characters themselves vary. They might be very straight, slightly slanted, sometimes with a serif, some times without. In some texts, the opening and closing quotes are the same character, in others, they slant in different directions.

It’s important to figure out the standard in your particular text so that you can tell your program what to look for.

One thing that differs in this example from the original code I wrote is that in this text, the tokenization didn’t split the quotation mark from the first word in the dialog; I got "Christmas instead of " , Christmas . I think the code I’ve written here is more flexible and applicable to different texts because it accounts for this possibility. We’ll see how in a moment.

4. Write an algorithm to put everything between quotes into one list and everything else in another.

Here we go! The comments (text with the # before it denotes comments that explain each step.

parsed_dialog = []
parsed_narrative = []
# and this list will be a bucket for the text we're currently exploring
current = []
# now let's set up values that will help us loop through the text
length = len(tokenized)
found_q = False
counter = 0
# here's where the quote characters are important!
quote_open, quote_close = '“', '”'
# now we'll start our loop saying that as long as our sentence is...
while counter < length:
word = tokenized[counter]
# until we find a quotation mark, we're working with narrative
if quote_open not in word and quote_close not in word:
current.append(word)
# here's what we do when we find a closed quote
else:
# we append the narrative we've collected & clear our our
# current variable
parsed_narrative.append(current)
current = []
# now current is ready to hold dialog and we're working on
# a piece of dialog
current.append(word)
found_q = True

# while we're in the quote, we're going to increment the counter
# and append to current in this while loop
while found_q and counter < length-1:
counter += 1
if quote_close not in tokenized[counter]:
current.append(tokenized[counter])
else:
# if we find a closing quote, we add our dialog to the
# appropriate list, clear current and flip our found_q
# variable to False
current.append(tokenized[counter])
parsed_dialog.append(current)
current = []
found_q = False
# increment the counter to move us through the text
counter += 1

Take a look at the code on Github to see how this all comes together!

5. Verify our results.

Make sure to print your parsed_dialog and parsed_narrative lists and compare them to your original text. This is where running a test piece of text that isn’t too long before feeding in a longer piece of text is really important. If your quote characters are off, for example, that could really mess up your results.

Here’s what my parsed_dialog list looks like after running the original text through my program.

[['“A', 'merry', 'Christmas', ',', 'uncle', '!', 'God', 'save', 'you', '!', '”'], ['“Bah', '!', '”'], ['“Humbug', '!', '”'], ['“Christmas', 'a', 'humbug', ',', 'uncle', '!', '”'], ['“You', 'don’t', 'mean', 'that', ',', 'I', 'am', 'sure', '?', '”'], ['“I', 'do', ',', '”'], ['“Merry', 'Christmas', '!', 'What', 'right', 'have', 'you', 'to', 'be', 'merry', '?', 'What', 'reason', 'have', 'you', 'to', 'be', 'merry', '?', 'You’re', 'poor', 'enough.”'], ['“Come', ',', 'then', ',', '”'], ['“What', 'right', 'have', 'you', 'to', 'be', 'dismal', '?', 'What', 'reason', 'have', 'you', 'to', 'be', 'morose', '?', 'You’re', 'rich', 'enough.”'], ['“Bah', '!', '”'], ['“Humbug.”', '“Don’t', 'be', 'cross', ',', 'uncle', '!', '”'], ['“What', 'else', 'can', 'I', 'be', ',', '”'], ['“when', 'I', 'live', 'in', 'such', 'a', 'world', 'of', 'fools', 'as', 'this', '?', 'Merry', 'Christmas', '!', 'Out', 'upon', 'merry', 'Christmas', '!', 'What’s', 'Christmas', 'time', 'to', 'you', 'but', 'a', 'time', 'for', 'paying', 'bills', 'without', 'money', ';', 'a', 'time', 'for', 'finding', 'yourself', 'a', 'year', 'older', ',', 'but', 'not', 'an', 'hour', 'richer', ';', 'a', 'time', 'for', 'balancing', 'your', 'books', 'and', 'having', 'every', 'item', 'in', '’em', 'through', 'a', 'round', 'dozen', 'of', 'months', 'presented', 'dead', 'against', 'you', '?', 'If', 'I', 'could', 'work', 'my', 'will', ',', '”'], ['“every', 'idiot', 'who', 'goes', 'about', 'with', '‘Merry', 'Christmas’', 'on', 'his', 'lips', ',', 'should', 'be', 'boiled', 'with', 'his', 'own', 'pudding', ',', 'and', 'buried', 'with', 'a', 'stake', 'of', 'holly', 'through', 'his', 'heart', '.', 'He', 'should', '!', '”'], ['“Uncle', '!', '”'], ['“Nephew', '!', '”'], ['“keep', 'Christmas', 'in', 'your', 'own', 'way', ',', 'and', 'let', 'me', 'keep', 'it', 'in', 'mine.”'], ['“Keep', 'it', '!', '”'], ['“But', 'you', 'don’t', 'keep', 'it.”'], ['“Let', 'me', 'leave', 'it', 'alone', ',', 'then', ',', '”'], ['“Much', 'good', 'may', 'it', 'do', 'you', '!', 'Much', 'good', 'it', 'has', 'ever', 'done', 'you', '!', '”'], ['“There', 'are', 'many', 'things', 'from', 'which', 'I', 'might', 'have', 'derived', 'good', ',', 'by', 'which', 'I', 'have', 'not', 'profited', ',', 'I', 'dare', 'say', ',', '”'], ['“Christmas', 'among', 'the', 'rest', '.', 'But', 'I', 'am', 'sure', 'I', 'have', 'always', 'thought', 'of', 'Christmas', 'time', ',', 'when', 'it', 'has', 'come', 'round—apart', 'from', 'the', 'veneration', 'due', 'to', 'its', 'sacred', 'name', 'and', 'origin', ',', 'if', 'anything', 'belonging', 'to', 'it', 'can', 'be', 'apart', 'from', 'that—as', 'a', 'good', 'time', ';', 'a', 'kind', ',', 'forgiving', ',', 'charitable', ',', 'pleasant', 'time', ';', 'the', 'only', 'time', 'I', 'know', 'of', ',', 'in', 'the', 'long', 'calendar', 'of', 'the', 'year', ',', 'when', 'men', 'and', 'women', 'seem', 'by', 'one', 'consent', 'to', 'open', 'their', 'shut-up', 'hearts', 'freely', ',', 'and', 'to', 'think', 'of', 'people', 'below', 'them', 'as', 'if', 'they', 'really', 'were', 'fellow-passengers', 'to', 'the', 'grave', ',', 'and', 'not', 'another', 'race', 'of', 'creatures', 'bound', 'on', 'other', 'journeys', '.', 'And', 'therefore', ',', 'uncle', ',', 'though', 'it', 'has', 'never', 'put', 'a', 'scrap', 'of', 'gold', 'or', 'silver', 'in', 'my', 'pocket', ',', 'I', 'believe', 'that', 'it', 'has', 'done', 'me', 'good', ',', 'and', 'will', 'do', 'me', 'good', ';', 'and', 'I', 'say', ',', 'God', 'bless', 'it', '!', '”'], ['“Let', 'me', 'hear', 'another', 'sound', 'from', 'you', ',', '”'], ['“and', 'you’ll', 'keep', 'your', 'Christmas', 'by', 'losing', 'your', 'situation', '!', 'You’re', 'quite', 'a', 'powerful', 'speaker', ',', 'sir', ',', '”'], ['“Don’t', 'be', 'angry', ',', 'uncle', '.', 'Come', '!', 'Dine', 'with', 'us', 'to-morrow.”']]

Looks good! 🎉

****

Thanks for reading this post — I hope it’s useful to you, and I welcome suggestions and questions in the comments.

If you haven’t already, check out my post about the analysis of gender bias in Harry Potter project that required the original code to parse narrative and dialog.

Follow this blog, Agatha, for more programming tips.