Exploring Topic Modelling with The Bard

Jamie Rivera
Introduction to Cultural Analytics
12 min readMay 21, 2021

Analysing Genres and Monologues in Shakespeare’s Work, Using Tomotopy

The other day, I wanted to look for lines in Shakespeare specifically about “light versus darkness.” So I went to Google and started poking around, but I was only finding the same one, maybe two lines, when I knew for a fact there were definitely more speeches that used this poetic device. That got me thinking: what if there was a way to more accurately parse the content of speeches in Shakespeare, and maybe do it on a larger scale? Using Tomotopy, a Python application that can separate documents by topic, I’ll be doing just that to almost 100 of my favorite speeches, with at least one from each of Shakespeare’s major plays.

The data I’m working with comes from the Folger Shakespeare Library’s digital download link, which has free, online versions of all the plays in ePub format, rich text, and several others. The Folger is a primary institution that studies Shakespeare, and although they publish print versions, it’s clear that they want to make the Text as accessible as possible. These plays were edited by Barbara Mowat, Paul Werstine, Michael Poston, and Rebecca Niles. I edited them a little further, too — I took out the sonnets, long poems, and a handful of plays that we’re not really sure Shakespeare wrote or not, leaving me with 36 canonical works.

Since this work has been in the Public Domain for several centuries, the ethical concerns about using it are low, but I did have several notable misadventures while preparing the texts to be analysed. I realised pretty quickly that because the name of whichever character was speaking would be included in the analysis, I’d have to remove them from consideration in making categories by adding them to a bank of Stopwords — the same goes for ordinary Stopwords and Elizabethan Stopwords (like thou and forthwith and shalt). There was no easy solution. It seemed no one had ever made a Stopwords bank like this before, and if they did, they didn’t make it publically available. I found a list of common Elizabethan words, then a list of named characters in Shakespeare’s plays, and set about putting them by hand into a Stopwords list that came out to be 81 lines long, with around ten terms on every line. Here is a plaintext link to that file, that anybody can use, so they don’t have to toil like me.

Plays By Genre

As mentioned above, I chose Tomotopy to analyse these files. It’s a lightweight Python application, that compares the way words are used across a series of documents, and creates categories based on the probability that a certain set of words appears in a given document. I’m also using “Glob” to prepare the texts in a way Tomotopy can understand, and “Little Mallet Wrapper” to present the results in a clearer, more concise way.

The first thing I was interested in was simply testing out these tools on the plays in general, so to that end I separated the plays into Histories, Tragedies and Comedies. What, if anything, were the differences in categories between the genres, and could Tomotopy pick up on them?

Selected Categories for Comedies:
✨Topic 4✨

love boy well comes three enter keep eye sweet heart hold wit fear show away

✨Topic 5✨

let life brother made though yet must may exits leave one death die upon see

✨Topic 6✨

would like one much upon well know aside fool see nothing pray enter never cannot

✨Topic 7✨

first good mine great say make give honor would poor must know speak whose might
Selected Topics for Tragedies:
✨Topic 2✨

know much think come let may must little time soul good sweet even indeed words

✨Topic 3✨

love night heaven fair good nurse true god head without son eye father bed faith

✨Topic 4✨

come upon dead till look stand see dear name brother every hands blood house find

✨Topic 5✨

hand heart death day aside mine young back word full stay many man tears nay
Selected Topics for Histories
✨Topic 0✨

make mine say think fear many much good life way one done right farewell find

✨Topic 1✨

upon come good peace give well true look world fair would day made old know

✨Topic 2✨

enter let blood tell see scene name dead love majesty england unto death give stay

✨Topic 3✨

gentleman good norfolk one man highness grace first pray cause noble please must heaven great

So there’s definitely some distinguishing features we can see here! Some words appear between all three genres, but they are in mixed company. Let’s take love for example.

In the Histories, love appears alongside dead, blood, death and majesty. It strikes a pretty dark tone, but when I used a function to examine the top documents, it made sense — the top documents contained most of the Henriad, or the Henry IV plays, Henry VIII and Henry V, all plays which are concerned with dynastic change, war and monarchy, but which are also innately linked with family. But what about Comedies, and Tragedies, which we might assume to take an equally morbid tone about love?

Well, in the Tragedies, we can see the topic with love in it also has night and heaven and fair. Comedies has it paired with sweet and heart. Ultimately, there is not really anything that drastically different, although maybe we’d expect the Tragic category with love to be more depressing! It’s really the Histories that stand out here. Despite violent delights having violent ends and all that, it appears that the connotations of love in Tragedy and Comedy are still dealt with as a net positive.

My Favorite Monologues

Now, on to the main event. I found 96 of my favorite monologues and lines in the plays and assigned them to random documents, naming them with my abbreviation for the play. Using Tomotopy, I wanted to see what themes could be drawn from such a broad corpus of lines and scenarios and character dynamics. Maybe I could do it myself, but it was curiosity that drew me — how effective would the computer be at doing this, and what categories would it make?

For my monologues, it gave me a list of 20 topics, which I’ll paste here. I’ve given them names after examining the documents in each one, to try and get a better picture of what each topic represents semantically.

TOPICS OF SELECTED MONOLOGUES:
✨Topic 0✨

man men great honor country home lives sleep good last meat athens care romans pride

✨Topic 1✨

earth fear come away look dust bed believe heat lovers sister foul measure throw married

✨Topic 2✨

much nothing never better ring loves tale none lover air motion seems thought daughter brain

✨Topic 3✨

god comes lie little hold gentle ere wait new blood upon lips trust talk spirits

✨Topic 4✨

would come speak death life die noble heart back might part mad heads live heard

✨Topic 5✨

crown within first thoughts told thousand arms proud straight strong kings oaths horses since voice

✨Topic 6✨

every wind rain place act mercy justice thrive done came play like letter appetite falls

✨Topic 7✨

upon time even ever dead forth things put turn use done called witness blood mere

✨Topic 8✨

love may night heaven make cannot fie find hell friend leave first forsworn moor two

✨Topic 9✨

make stand fair lose true gold base speaks brother wrong take tongue scorn stars legitimate

✨Topic 10✨

fool strange hour says ten world terms motley truth cuckold laugh met forms break seem

✨Topic 11✨

like one man day nature poor many full soul woman reason friends old wise false

✨Topic 12✨

let thus whose know tell gods take give makes fortune war fly answer suit precious

✨Topic 13✨

let name rome till yet unto tears sword peace honorable mother father remember cry lack

✨Topic 14✨

well yet end honest wife though shows grow wear pitch sense serve native office hope

✨Topic 15✨

say bear hand point could rest drink lost one wed ambition think worth wives enough

✨Topic 16✨

revenge dream mind sometime jew summer dreams winter venice dread merchant madness food sport christian

✨Topic 17✨

still yet men sweet would eye show upon shame rather call name touch vain need

✨Topic 18✨

made eyes mine half wound green pale light sure kill cheek since kind murder stands

✨Topic 19✨

see world good must women hear sun power give fair way set body fire keep

Now here’s the way I categorized them!

Zero: Addressing a Crowd
One: Loss
Two: A Trick; Subverted Expectations
Three: Phantasmagoria
Four: A Mortal Heart
Five: Monarchical Power
Six: Justice and Balance
Seven: Bearing Witness
Eight: Damnation; Insincerity
Nine: Stand Up For Bastards
Ten: Humor
Eleven: The Nature of Mankind
Twelve: Fortune, Good or Ill
Thirteen: Mourning Something
Fourteen: Plotting and Pondering
Fifteen: Personal Anecdotes
Sixteen: Grudges and Madness
Seventeen: Intense Passions
Eighteen: Murder
Nineteen: This World of Ours

I needed to think a lot more about some categories, but others were pretty easy to categorise. Humor for example, and Justice and Balance were plainly drawn categories. Fifteen, Personal Anecdotes, was somewhat complicated, as was Fortune, twelve. Perhaps there was a way I could have named them better. I also wasn’t wholly satisfied by the way Tomotopy split up the speeches — I saw some documents categorised in ways I hadn’t really been expecting. I decided to make a heatmap of the topics and try to visualise for you, as well as for myself, what was going on here.

Heat Map generated by the Python library Matplot.lib, from Maria Antoniak’s “Little Mallet Wrapper.” It shows my randomised monologues and the probability they’ll fall under a given topic. The darker the color on a plot point, the more closely linked to a topic the monologue is.

Phew! That’s a hefty graph. I’ll post a little guide to the monologues at the end of this article, so you can look at more famous ones yourself! I’ll zoom in on a couple here, to try and understand better how these topics are working.

Obviously, these categories are not perfect. We can see on this graph that the darker the cell, the more likely a certain monologue will belong to a topic. But it’s clear that this strategy can’t single the monologues down to belong to just one topic. For an example, look at HV1 (Henry V; “O, for a muse of fire”) which has four darkened cells of different shades, and Mcb1 (Macbeth; “The raven himself is hoarse…”) which has nearly eight cells of medium shades!

Let’s focus on an example, to see if we can figure out what’s going on — “TAnd1” which is a monologue by Tamora, from Titus Andronicus:

TAMORA — TITUS ANDRONICUS — V ii 31
Know, thou sad man, I am not Tamora.
She is thy enemy, and I thy friend.
I am Revenge, sent from th’ infernal kingdom
To ease the gnawing vulture of thy mind
By working wreakful vengeance on thy foes.
Come down and welcome me to this world’s light.
Confer with me of murder and of death.
There’s not a hollow cave or lurking-place,
No vast obscurity or misty vale
Where bloody murder or detested rape
Can couch for fear but I will find them out,
And in their ears tell them my dreadful name,
Revenge, which makes the foul offender quake.

We can see on the table that it falls under topics 12, 16 and 18 with the greatest correlation: Fortune, Grudges/Madness, and Murder, which I feel actually fit this speech pretty well. While other speeches in the Fortune category have to do with divine luck, this speech frames Tamora herself as an avenging divinity; Grudges and Murder clearly play into the language here too. Perhaps it’s not so much that this system will neatly cordon off monologues into topics, but there is a certain indexing that can be performed by “combining” categories under which a particular monologue falls. Of course these speeches are too complex to be cut apart so cleanly, but with this graph, you can more clearly see the “threads” of thematic material woven into them and try to quantify that fact.

Basically, maybe it shouldn’t be a “simple” distinction! The language itself sure isn’t — why try and force each text into one category to begin with, when the meaning can best be gleaned by examining shades of grey… or in this case, green.

A Weird Category

Something that stuck out to me oddly was category nine, which I have called Stand Up For Bastards. This is because it basically only has one “most likely” document! When I ran “Get Top Docs” on this topic, the most likely document was “KL1/King Lear 1” or Edmund’s speech at the beginning of Act 1, Scene Two, which was rated at over 5% likelihood, compared to the others which were in the 1–2% range.

EDMUND — KING LEAR — I ii 1

Thou, Nature, art my goddess. To thy law
My services are bound. Wherefore should I
Stand in the plague of custom, and permit
The curiosity of nations to deprive me
For that I am some twelve or fourteen moonshines
Lag of a brother? why “bastard”? Wherefore “base,”
When my dimensions are as well compact,
My mind as generous and my shape as true
As honest madam’s issue? Why brand they us
With “base,” with “baseness,” “bastardy,” “base,”
“base,”
Who, in the lusty stealth of nature, take
More composition and fierce quality
Than doth within a dull, stale, tired bed
Go to th’ creating a whole tribe of fops
Got ‘tween asleep and wake? Well then,
Legitimate Edgar, I must have your land.
Our father’s love is to the bastard Edmund
As to th’ legitimate. Fine word, “legitimate.”
Well, my legitimate, if this letter speed
And my invention thrive, Edmund the base
Shall top th’ legitimate. I grow, I prosper.
Now, gods, stand up for bastards!

This speech basically gets its own category — actually, it feels like the whole topic was generated pretty much for just this document. There is something so different about it from the other speeches that it justifies a separate topic. It’s an odd category, but I stand by it. There’s something about Edmund that differentiates him from other Shakespeare characters, even other “villains” like Iago, Aaron, Macduff and Cassius. In addition, he uses the words “bastard,” “base” and “legitimate” so often as to make this speech stand apart from the others, even ones that do have those words in them.

Concluding Thoughts

Overall, I had a lot of fun with this analysis. I was really interested by the way Tomotopy indexed the speeches and compared some speeches with ones I wasn’t even considering to be thematically similar! This feels like a good start for a much larger-scale project, one that would be able to parse a massive amount of Shakespeare’s lines by topic and generate a list with perhaps dozens of quotes on any given theme. It would be a different challenge to generate so many topics, and considerations would need to be made for proceeding line-by-line through any given play, but it could surely be done… given enough time and skills, that is. I hope that my work on creating a list of Stopwords specifically for Shakespearean works can be expanded upon as well, and that it helps other people who might search for a similar thing.

Here are some acknowledgements and citations, plus a list of notable/my Very favorite monologues:

Ham3 — “To be or not to be…” Hamlet
Ot1 — “I will wear my heart on my sleeve for daws to peck at…” Iago
KL1 — “Thou, Nature, art my goddess…” Edmund
AYLI5 — “I would not be thy executioner…” Phebe
2GMV1 — “To leave my Julia, I shall be forsworn…” Proteus
LLL1 — “Thus pour the stars down plagues for perjury…” Berowne
AYLI4 — “It is my only suit, provided that you weed your better judgements…” Jaques
JC2 — “I can as well be hanged as tell the manner of it…” Casca
JC3 — “It must be by his death. And for my part…” Brutus
Ot5 — “And what’s he, then, that says I play the villain?” Iago
TwN3 — “When that I was and a little tiny boy…” Feste
MSND1 — “Thou speakest aright, I am that merry wanderer of the night…” Puck
RJ1 — “Gallop apace, you firey-footed steeds…” Juliet
RJ4 — “Tybalt, here slain, who Romeo’s hand did slay…” Mercutio
MoV1 — “The villainy you teach me I will execute…” Shylock
MoV3 — “The quality of mercy is not strained…” Portia
TAnd5 — “Now climbeth Tamora Olympus’ top…” Aaron
TrCr1 — “This she? No, this is Diomed’s Cressida…” Troilus
CnA1 — “Sir, I will eat no meat…” Cleopatra
Te2 — “O brave new world…” Miranda

Shakespeare, William. Shakespeare’s Plays, Sonnets and Poems from The Folger Shakespeare. Ed. Barbara Mowat, Paul Werstine, Michael Poston, and Rebecca Niles. Folger Shakespeare Library, May 19, 2021. https://shakespeare.folger.edu

— Elizabethan stopwords partially gathered from https://github.com/Kaguilar1222/gutenburg_nlp/blob/master/stopwords_elizabethan
— Typical English stopwords and lots of other help with code (such as Maria Antoniak’s Little Mallet Wrapper) gathered from the INFO 1350 Course Text, by Professor Melanie Walsh, which can be found at: https://info1350.github.io/Intro-CA-SP21/welcome.html
— All names intitally sourced from https://www.behindthename.com/namesakes/list/shakespeare/alpha but edited extensively by myself, added to and discarded from, then put into appropriate list format.

--

--