How often do characters chat?

8 min readMay 7, 2016

Who talks more: Huck Finn or Robinson Crusoe? Turns out, it’s Huck. Conversation between the characters in The Adventures of Huckleberry Finn accounts for nearly half (48%) of the narrative; in contrast, conversation in Robinson Crusoe is only 3% of the book.

For my final project in the Data Visualization class I took this semester, I thought it would be interesting to compare how much my favorite characters talk in their respective novels. I teamed up with my husband, Derek Miller, and together we wrote Python code that counts the number of words in between quotation marks (“ ”), then divides that number by the total number of words in the book. The result? The conversation proportion for that book.

The full project of Dialogue in Classic Literature

With that data, I created a visualization comparing the conversation proportions across a variety of works in classic literature. To see the full project, check out my Behance page.

Insights

I recently read War and Peace by Leo Tolstoy, and didn’t like it. It was long, dry, and felt very different from my favorite book, Anna Karenina. I found that less than 25% of War and Peace is dialogue between characters, while over half of Anna Karenina are characters chattin’ it up! That might be why they felt so different to me.

A few other interesting data points emerged as I worked on this visualization. Robinson Crusoe by Alexander Dumas (one of my dad’s favorite books) is only 3% dialogue. This makes sense. Crusoe is stranded on a desert island for most of the book, and his native friend Friday doesn’t speak much English. I don’t imagine there would be lots to talk about.

Two other strange outliers are Les Miserables by Victor Hugo and War and Peace by Leo Tolstoy. Written within a couple of years of each other, these novels are the two longest books that I used in my analysis. They both have nearly the same proportion of conversation (21% and 22% respectively), and are almost identical in length (559,469 and 562,483 total words respectively). I find this odd and interesting.

The Penguin Classics Series from 1940–1960

Process

I started by getting titles of classic literature from this beautiful series of 1940–60 Penguin Classics. Many of these books were not familiar classics, so I turned to Barnes and Nobles’ classic books series for titles that are better known to readers today. I focused my analysis on literary fiction, excluding histories, poetry, plays and religious works since they don’t have the same richness in dialogue found in works of fiction.

Collecting data was tedious. Thanks to gutenberg.org, the texts were free and easy to access. The downside was that each text had a different format. Sometimes there was a table of contents, or a note from the translator, or a section of commentary. I had to go through each novel and remove extraneous bits of text that weren’t relevant to this study. This took about ten hours to go through all 100 texts, and was the most monotonous part of this project.

For full details on what was and wasn’t included check out Appendix A at the end of this article.

Early ideas (left); testing understanding of 50% mark and author average symbols (middle); comp of how to arrange the charts together (right)

From the beginning I wanted to incorporate text and letters into my graphs for their rich texture and obvious relevance to the project. I sketched a few ideas of how to display the data, and brainstormed questions to answer.

Data problems

Once I had collected the texts, figuring out the proportion of dialogue proved to be trickier than I thought. The easiest way to count conversation words is to use quotation marks as a guide. Of course, not everything inside quotation marks is conversation (for example when talking about a “specific” thing), but at the time I didn’t have a way to exclude those moments of irregular quotation use. I’m still a newbie to Python, so I enlisted the help of my husband, Derek, to collaborate in writing a script to parse the text. We wrote code that began counting words when it saw a quotation mark, and then stop when it encountered another mark, like this (with the colored words counted as conversation words):

This worked fairly well, until I realized that the code wouldn’t work if there was a character who spoke for more than a paragraph, or if a letter or poem with multiple paragraphs or stanzas was read aloud, like this:

In this instance, the lack of a quotation mark at the end of the body of the letter threw off our counting system. We came up with a solution. When faced with paragraphs of text that don’t end with quotation marks (but are still considered dialogue), the code counts how many quotation marks are in a paragraph and, if an odd number, will add a quotation mark to the end of the paragraph before counting the number of conversation words. Like this:

After running the script, I put the conversation word count and total word count of each book into a spreadsheet where I added data about when the book was published and the author’s name, nationality and gender.

This turned out to be a rich dataset. I kept finding new questions, such as: do women write more dialogue in novels than men? Was there a time period when it was in vogue to write a certain proportion? Are books written for children filled with more or less dialogue? I ended up focusing on comparing the proportion of dialogue and lengths of books, as well as answering: do authors tend to write the same proportion of dialogue in all of their books? What is the spread of published dates for the top classic books look like? And, do certain countries have tendencies to write a certain proportion of conversation in their books?

With the data (finally!) ready to go, I began to explore and experiment in Processing with as many ideas as I could think of.

As I got closer to finishing, I printed out a range of type sizes and leading options to see what would be legible while still maintaining an even texture.

When I presented my project at the class final, I asked questions to make sure that people understood what my graphs were trying to communicate. There were things I hadn’t caught, and I made a few alterations to make the charts clearer.

Going forward

I think this would be a really fun show to exhibit. This poster is meant to be viewed on paper. Each of the small multiples of novels is set in type just big enough to read if you get real close. I love that aspect of discovery, and honestly just reading little bits of books!

The Harold B. Lee Library at BYU has offered to print and hang this poster, so watch for it on the fifth floor of the library! I might make more posters like these, but for different languages. For example, a whole poster might focus on comparing works written by Brazilian authors. Next time, I’m going to try and have texts that are already cleaned up (no epilogues, table of contents, illustration captions, etc), since that is the most time consuming part of the process.

Q&A

Where’s Shakespeare?!?
I didn’t include his work because he wrote plays, and it’s difficult to compare a play’s proportion of dialogue (nearly 100%) with that of a novel.

You totally missed some of the greatest classics!
I know, but some books are not available on gutenberg.org, which was the source I decided to use for this project. Some books are under copyright so I would need to pay to access the text, and others don’t have English translations yet. Also, some classics use single quotation marks (‘ ’) instead of double (“ ”), which made the scraping script not work.

Appendix A, or the Nitty Gritty

Items that were removed—when applicable—from the texts because of irrelevance to the novels actual content:

List of characters in book
commentary
footnotes
introduction
preface (unless characters in the novel are writing it)
letter to/from the author
table of contents
dedication
publishers notes
translators notes
illustration captions

Items that are included in the word count, when applicable:

some epilogues (when considered part of the story)
chapter headings and summaries
footnote numbers

Notes on country of origin:

Greek and Latin texts are grouped together
Icelandic and Danish novels are grouped together as Scandinavian
Austrian, Swiss, and Hungarian novels are grouped with German texts

Further notes

In the case of books published in series or parts, the latter publication date is used.

In the case of verbal stories, I used the date the earliest manuscript was found (Beowulf and Thousand and One Nights).

English translations were used regardless of the original language the text was written in. This is to create a standard for word count between languages, and also helps with consistency of punctuation for our scraping algorithm.

The published date is used, not necessarily the date the text was finished being written (for example: The Trail by Kafka, finished writing in 1915, published 1925).