Genre in a Bottle: How to Perform a Genre Analysis Online

An easy half-day project for beginning Digital Humanists

Joel Sprinkle
14 min readDec 14, 2015

So you’re a budding digital humanist —you dig this intersection of humanity and technology but you’re not too savvy on how technology works. You really want to dig deeper into your text, and in particular try and unlock some nuances of genre in one of your favorite authors. You know a little HTML and you’ve watched a few episodes of The IT Crowd and you’re ready to go.

Except you don’t know how to use the complicated analytical tools that are available. Perhaps Stylo caught your eye, but you don’t know how to use R and you have no idea what statistics means. Let me warn you: if you don’t know the first thing about statistics and you’ve never encountered R, your experience will only cement whatever theoretical and moral objections you might have already held about the eugenicist founder of modern statistics.

So what are your options?

There are several tools that offer basic analytical capabilities that are free online. Voyant is a good one, as is Lexos. There are several others with varying levels of difficulty, but for our purposes we’ll focus on the first two I mentioned, using them in tandem to construct a full idea of our genre.

As with any kind of research, what you need first is a question. I once argued that Edgar Allan Poe, who was critical of almost anything, also did not care for the Gothic Romantic genre, openly criticized it, and used it subversively in an attempt to make some money off of a popular genre and terrify his audiences through more sophisticated means than earlier writers. I used a close reading to identify several instances of satire in his short stories, but I wondered if there was a way to see it as a general trend in his work. My question is this: Does Edgar Allan Poe use the Gothic genre differently than other practitioners?

Before we get going . . .

Before you begin your analysis, you need a text or group of texts to dissect. The best way to do this is to work with .txt files which (for many works) you can get from Project Gutenberg. For my project, I’m going to focus on the works of Edgar Allan Poe, since I already have a question in mind. Once you find the text you want, download them all into a well-marked folder.

They’re not ready to be analyzed right out of the gate because there is some text that Project Gutenberg places before and after their eBooks. These are well-marked with “START OF THIS PROJECT GUTENBERG EBOOK [TITLE]” and “END OF THIS PROJECT GUTENBERG EBOOK [TITLE]”. You’ll want to delete these, because the words they contain could skew your data. In addition, you may choose to delete any forewords not written by the author, end-notes, or other superfluous text included in the Gutenberg file.

Once you have your texts cleaned up, they’re ready to be analyzed.

Voyant

Establishing your genre

For my own purposes, I needed to compile another corpus besides Poe. You may find this necessary as well, depending on what your project is. In a paper I wrote as an undergraduate, I argue that Edgar Allan Poe uses the Gothic genre subversively, operating in a genre that he despised and poking fun at his audience the entire time in his employment of its conventions. To make my original argument, I generally used close readings of Poe’s own works as well as letters and works that expressed his own views of literature; but now I’m interested in seeing if I can make my distinction visible in a Jockers-esque macroanalysis.

In order to establish what my analysis findings might mean, I first needed to establish what the Gothic genre looks like. To do this, I returned to Project Gutenberg and started downloading some of the prominent texts of the Gothic romantic genre —Mary Shelley’s Frankenstein, Horace Walpole’s The Castle of Otranto, Ann Radcliffe’s The Mysteries of Udolpho, Matthew Lewis’s The Monk: A Romance, and John William Polidori’s The Vampyre. I chose these texts because of their publication dates (all before Poe) and because I felt they were representative of the Gothic Romantic genre before Poe. I chose five texts because Poe’s works are divided into five volumes on Gutenberg and the word counts end up being comparable, although my genre corpus is about 100,000 words larger. A larger control corpus isn’t a bad thing in this case, since the more data I have the more apparent any patterns will be; but ultimately it’s up to you how you want to build your corpus. With this goal in mind, I began with the control corpus in order to establish what the Gothic genre looks like.

Starting off in Voyant

Here’s an overview of what you’ll want to do — I’ll go into detail on each point:

  1. Upload your corpus
  2. Establish stop-words
  3. Analyze your data

The first thing you’ll want to do, of course, is upload your corpus. Once you’ve chosen all the files you want to analyze by clicking the “Add” button (green plus-sign), hit “Reveal” to see what Voyant makes of your text.

What you’ll notice first is that Voyant has told you what you already know — “the” is the most common word, with “and” in close second, and rounding out the group are some more of the usual suspects. Click the settings button above the word cloud and a popup window will appear. Here’s where you can set the stop words.

Stop words are common words — articles, prepositions, pronouns, and so forth — that you may not want polluting your data. Voyant has a list already set up for you, but if you want you can add more, take some off the list, or create your own list if you’re feeling frisky. I chose the “English (Taporware)” list and did not modify it. You’ll also want to check the box “Apply Stop Words Globally,” since this will streamline things later on.

Once you apply the stop words, you’ll see your results clear up quite a bit. Voyant will automatically generate a word cloud, with the most frequent word appearing the largest in the word map. You can hover over each word to see how many times it appears.

In terms of results, we can see that names are very common, in particular throughout Radcliffe, where “Emily” appears 2,031 times, “Montoni” 816, and “Valancourt” 593 (link to the data here). Among other notable data: heart (586), castle (508), count (502), father (500), lady (459), madame (453), chamber (413). At this stage, we can only postulate what will be useful later on, but so far a heavy focus on character names and titles (e.g. count, madame, lady, father) as well as words like heart and castle could definitely be significant in a genre that deals heavily in emotion, aristocracy, religion, and ancient edifices. If we begin another Voyant project and upload our Poe corpus, we can see where the two groups differ.

If you know what to look for, this data can be helpful in establishing or affirming trends. For instance, Poe’s most frequent words corroborate some of what we know about Poe’s styles and philosophies. “Said” is the most frequent word at 645 occurrences, and some other notable words include “man” (446), “eyes” (338), “head” (301), “feet” (269), “hand” (257), “mind” (245), “body” (241), “matter” (237), “heart” (214), “death” (210), and “nature” (210). Poe had a very materialistic idea of the world, except that he believed the spiritual also had a material aspect. Thus his word choices reflect this in recurrent themes of the sense, body, matter, and nature. There is much more focus on the immediate, on the sensory, and on the cerebral than the aristocratic and social themes of the control genre.

What if we add these names to the stop-words list in the control analysis? Click on the “Options: Cirrus” button and then click “edit stop words.” Type in whatever words you want to cut out and save as a different list.

After I apply these new stop words, “said” takes Emily’s place as the most frequent word, appearing 2,062 times. Although Poe’s most common word was also “said,” it appears over three times as frequently in the control corpus, which is only 16% larger; but this may or may not mean anything significant because of the stylistic differences of the media (short story vs novels). Click on “said” in the word cloud or the summary, and a line graph will appear in the word trends box in the upper right-hand corner of the display. We can look at how the words are distributed in each of the corpora, in appearances per 10,000 words.

Notice the Y-axes are different, with Poe’s graph (left) appearing at its most frequent around 22.62 times per 10,000 words in Volume 4. In the control corpus, on the other hand, “said” appears 109.24 times per 10,000 words in Walpole’s Castle Otranto. You can divide each individual document in the corpus into smaller fragments that make these graphs more striking, but in Voyant you can’t divide your entire corpus if it has multiple documents. To combat this, you could create a new txt file with all of your control corpus in one document, if you feel the need.

Conclusions: Voyant Overview

Voyant is a great tool for an easy visualization — it’s easy to use, streamlined, and it doesn’t ask too much from you, the user. The word clouds in this particular tool are very dynamic and visually pleasing. You get what you put into it, however, and while word clouds are neat they’re perhaps not as in-depth as we might need to make an informed judgement about differences in genre uses. Let’s move on to Lexos, which is a little more complicated but can also give us some different results.

Lexos

Analyzing through Lexos

Lexos requires a little more effort, and if you use it right you can apply your knowledge of statistics to get some more honed results. If you’re like me, you might find this portion challenging, but you can get by if you smile a lot and avoid direct eye contact with the math.

Here’s an overview of what you’ll be doing here:

  1. Upload files
  2. Scrub files (stop words)
  3. Cut files into sections
  4. Analyze

If you go to the Lexos site, you’ll notice there’s a box to paste text into and a button that says “Browse” with a file-folder design. You’ll want to click that Browse button and track down your control corpus files.

After you’ve uploaded the text files, your next step is to start telling Lexos what you want. You may not know what you want yet, but we’ll talk about that as we go. Go ahead and click on the “Scrub” tab in the “Prepare” drop menu.

Scrub

First, you need to establish stop words. Unlike Voyant, Lexos does not supply its own stop words (Voyant’s are borrowed from TAPoR, anyway). Again, you can compile your own list — or, you can do what I did and go to TAPoR’s website. Since Voyant already uses the modified Glasgow stop list, I chose to do the same in Lexos in order to keep the variables down to a minimum; but here is another stopwords list that is a little less glitchy with Lexos’s mechanics. Click “Apply scrub” and get ready to move on to “Cut.”

Cut

Here’s where you can divide your corpus into more manageable chunks. It’s up to you how you want to do this — again, my genre corpus ended up being around 552,631 words, so I divided it into segments of 15,000 words. This is still a significant chunk of words but it divides the works up fairly evenly into segments. You can split these segments even smaller if you’d like. This creates more dynamic visuals when you start generating graphs. Again, you can decide on how many segments you want or how large they need to be based on your corpus size.

You can cut each individual segment into smaller pieces if you want, but again it’s all about how much you know and what you’re looking to get out of it. After you’ve played with this, you’re ready to look at some data.

Visualize

Lexos gives you more options in terms of visuals, but it also requires more from you in terms of input. I’ll narrate what it feels like to look at the visualizations tab: You click “Visualize.” Lexos starts throwing around words like “Archimedes” and putting Sigmas with frightening super- and subscripts full of numbers. You feel your face going numb, and you clutch whatever degrees you’ve earned, asking the universe for validation as flashbacks from the last math class you took (ten years ago) shake the very core of you. You caress your BA diploma like a rosary, counting the digits in your major GPA and closing your eyes against the torrent of statistics. You Ctrl+W the tab and frantically return to Voyant’s pleasant color palette and baked-in numbers.

The previous is a dramatization, and you might be more comfortable with these controls and get more out of them than I did. Regardless, I think Voyant’s visualizations are better on the whole; but Lexos does have a few features that might come in handy, which brings me to the Analyze tab and the Similarity Query.

Similarity Query

Here’s where the clear difference between Voyant and Lexos lies. While Voyant rolled you out a nice, pretty word-cloud immediately, Lexos’s initial appearance is not quite as impressive, and more than a little intimidating if you haven’t seen a sigma in a while. You probably won’t be interested in the word cloud, but there might be some other visual components you might want to play with. What I’m going to focus on — because it’s the most unique, in my opinion, and the most useful — is the similarity query.

In the drop-down menu of the “Analyze” tab, click on “Similarity Query.” This tool allows you to analyze how similar the bulk of your corpus is to one section. The tool is pretty quick and will load the results in a few seconds, so you can see how your texts stack up to each other in a lot of different ways. Feel free to play around with it for a few minutes.

Basically, this tool analyzes one section (whichever document you select) and determines how similar the other documents are to that one. It gives you a number between 0.00001 and 1.0, with 1.0 signifying “most similar.” These numbers are analogous to percentages, with 1.0 being a 100% match. Compared to Shelley, we see that our control corpus seems to have a lot in common — all are 99% similar to Shelley’s Frankenstein stylistically. Poe’s collection seems to be the most similar to Shelley in Volume 2, at 90%, but the other volumes deviates strongly.

Across the board, Volume 2 seems to be a bit of an outlier: 0.90 to Radcliffe segment 2 and 0.98 to Radcliffe 1 (compared to Poe 5 0.05 to Radcliffe 2 and 0.66 to Radcliffe 1) and 0.91 to Lewis (Poe 5 was 0.07). If we compare all the texts to Poe 2, it’s 0.99 similar to Polidori, Lewis, Shelley, Radcliffe 1 and 2, and Walpole; while it’s only 0.10 similar to Poe 5, 0.07 similar to Poe 4, and 0.06 to Poe 1 and Poe 3. If we look at the text files, Poe 2 contains “The Black Cat,” “The Fall of the House of Usher,” “The Cask of Amontillado,” “The Premature Burial,” “The Pit and the Pendulum,” “The Tell-Tale Heart,” and several other tales that are not only his most famous but are also the ones that use Gothic tropes and conventions more than many others, such as “The Unparalleled Adventure of one Hans Pfaall,” “Balloon Hoax,” and “Never Bet the Devill Your Head.”

Poe 4 and Poe 5 seem to be the most similar after Radcliffe, although the gap is huge. Volume 5 is the shortest, which might account for its slightly higher correlation, and Volume 4 has “Metzengerstein,” which is an obvious parody of the German Gothic tradition, as well as “How to Write a Blackwood Article,” which openly lampoons Gothic short fiction writers in the UK. This might account for the closer correlation of these two volumes to this outlier volume.

If we are to make any kind of meaningful analysis of this data, we will of course need to go back to the text, as I have done in a small way, but with a new question in mind: What about the works in Poe 2 is so similar to these earlier European Gothic texts? How does Poe 2 stack up to the control corpus in other features, such as most frequent words? Between Lexos and Voyant we can build more meaningful lines of inquiry and extrapolate the significance of our statistical findings.

Conclusion: Go Forth and Analyze

If there’s anything Punk Rock taught the world, it’s that you can definitely do something without knowing how to actually do it. If you really want to do a genre analysis, you don’t need to know how to navigate R’s dark and sinuous corridors. You just need to consolidate pre-existing tools to piece together the analyses you want to perform. Using the simple, visually-pleasing Voyant and the more thorough Lexos can help you build a clear idea of the stylistic features of your corpus. I chose these tools because, to me, they are the simplest to work with, but there are many others out there that perform similar functions. Of course, as with any kind of macroanalysis it will be up to you to determine what the data you find means, but whatever the analysis turns up you will have strong visuals and statistical evidence to make your argument.

Good luck with your own genre analysis, and remember: be excellent to each other.

--

--

Joel Sprinkle

Graduate student, musician, soccer enthusiast. Bassist for Scratch River Telegraph Company. Co-author of The Dying of the Light.