Voyant Tools: A Tutorial for Text Analysis
I’m about to embark on my first real research journey, where my data set is large and cumbersome. I’ll be utilizing a few different tools to analyze it, one of them being Voyant Tools. Voyant Tools is a web-based tool that reads and analyzes texts in a variety of formats, including: plain text, HTML, XML, MS Word, RTF, and PDF. It can also strip text from webpages, and has the option to input URLs. If you just want to get started with the tool to figure it out before you input any data, there are example texts pre-loaded that you can choose from.
As you can see from the image above, the first impression of Voyant is relatively un-intimidating. As a newcomer to the digital humanities realm, I found this refreshing. What I did not find refreshing was the typo in the textbox on the homepage. I can overlook this, given the utility of this software, but the English major in me had to make note of it.
Let’s dive in, shall we?
Above, I have highlighted, in order, the functions to explore. Those three buttons boxed in red in the upper right-hand corner are in every tool box in Voyant. The one to the left that looks like a wheel is the settings button. Clicking on that will let you change the input format and XML options. I’m using .txt files, so I didn’t change anything. The middle button that looks like a floppy disk (90s throwback!) gives you the option to export your data. The button on the right that looks like a question mark is the help button. This has limited usefulness because, while it briefly describes the function of the tool it is ascribed to, the link between it and further help is broken.
Next, choose your option for inputting text. I first tried copy and pasting URLs, because it seemed like the easiest method. However, when I did this, Voyant pulled every bit of text from the webpage, including the comments, text in side advertisements, and for some reason, copied the entire text twice.
I’m going to backtrack here for a second and briefly discuss the kind of data I’m analyzing. My project involves looking at food blogs, so I’ve selected ten posts to constitute a sample set for the purpose of this tutorial. The posts come from Liren Baker’s Kitchen Confidante. The recipes that I’m analyzing are for: Avocado Milk, Braised Chicken with Chickpeas, Butterbeer, Chicken Pho, Cilantro Sriracha Turkey Burgers, Columbian Chicken Corn Potato Soup, Jambalaya, Mango Ginger Sorbet, Mushroom Quinoa Risotto, and Smoked Salmon Bites.
I uploaded them into Voyant by first clicking on “Upload,” then on “Add.” You have to select files one at a time. Once you’re done, you click on “Reveal,” and it uploads and reads them. This process could take a while, depending on your file size.
When it’s done processing, you’ll go to this page. Here’s where the magic happens. There’s a lot that can happen on this page, but I want to direct your attention to three main places for now. I boxed in orange an example of the kind of data you’d pull by inputting URLs. That big section is where all of the data is that I uploaded. Since you can’t delete material once you've uploaded it without starting over—unless your intent is to include this kind of material—I would recommend not using the URL option. I wanted to just look at the text of the posts themselves, and I didn't want this extraneous material to skew my analysis. The red and black arrows both point to something pretty important: function words. The red arrow points to the word cloud, which, as you can see, has pulled these function words (and, my, I, the, it, etc), as the most common words. The black arrow points to the “most frequent words” list, where you can see this confirmed. In order to get rid of those pesky words and get at the real material, you’ll need to click on that options wheel in either of those two windows.
Those function words are also called stop words, and there are a couple of different ways you can go about removing them. I've chosen to exclude the ones common in English, since I’m analyzing text in English, but there are many other languages to choose from. If you want to control for the words you remove, you can click on the “Edit Stop Words” button, and work with that. Be sure to check the “Apply Stop Words Globally” button to ensure that your change goes into effect in all of the tools, rather than just in the box you've selected.
That looks better, doesn't it? So much more relevant content.
Here’s the cloud along with the other two segments in the left-hand panel.
Cirrus, which is the cloud, is pretty intuitive. If you’re not familiar with them, it’s just a visual aggregation of the most common words. The words that are larger occur more frequently across the corpus. The segment below it is called the summary section. It provides a textual overview of the corpus, including: number of documents, number of words, number of unique words, longest documents by words, highest vocabulary density, most frequent words, words with noticeable peaks in frequency across the corpus, and distinctive words. I’ve collapsed the top two sections by clicking on the arrows next to the help buttons to give you a better look at what’s in the next section.
Here’s where my favorite part comes in: graphing word frequencies. Figuring out the different options for this part is what my biggest learning curve for this tool was. On the right side of the screen is a panel for the graph.
As an example, I clicked the box next to “love” in the word list to the left. It highlighted all the instances of the word in the entire corpus and created a graph of that to the right. In the graph itself you can choose between relative frequency, which indicates the word in relation to other words in the individual document, and raw frequency, which indicates how many instances of the word occur in each document. There’s an interesting spike in one of the documents, so I clicked on the corresponding color boxed in red to bring me to that spot in the corpus. In this post, the author is writing a letter to her children on Mother’s Day, which would explain why the frequency is so high.
I then decided I wanted to look at all words that indicate relationships in this corpus. So I selected things like “husband,” “children,” “daughter,” and “son.” As you can see at the bottom of that section, there are 28 pages of words. The boxes that you check don’t save between pages, so when you get to the bottom of the page, you have to click on that pink heart with the green plus sign at the bottom right. It will add those words to your favorites list. You then have to toggle it off to return to the word list by clicking on the other heart next to the one with the green plus sign.
This took me quite a bit of time to figure out. First I tried to highlight words in the middle section, which worked, but took much longer than just scrolling through the list to the left. Then when I was reading the text in that main section, I realized that there were relational phrases I wanted to include, but couldn’t because of the constraints of the tool. I thought I might be able to search for phrases by just putting quotation marks around them, but when I did that and hit enter, nothing happened.
So after scrolling through 28 pages of word tags and selecting the ones I wanted to look at,
Isn’t that a beautiful graph? I thought so too. In the left-hand section, you can see the words that ended up in my favorites list. When I selected them all, Voyant generated a graph with all of them to the right. So nifty. By looking at this graph, I can already start talking about general trends, and the kinds of relationships that are displayed across the set of data. There’s a lot going on in that graph. You can select an area of the data to look at more closely by clicking and dragging over a section of the graph. It will zoom in, and you can zoom out again by clicking on the “reset zoom” button.
Additionally, you can see the overall trend of these words by checking the box next to “collapse terms” at the bottom of the segment.
This allowed me to look at the trend of all relational words together across the corpus.
Another neat trick will allow you to view each word more closely. If you hit the “reset zoom” button and return to the original graph, you’ll see that “husband” occurs quite frequently in one document in particular. If you click on it, you’ll get this:
By clicking on that individual pink dot where the frequency peaks, it brought up a list of the keywords in context in the section below the graph. This allows you to simultaneously look across the breadth of the data, and deeply into a particular slice of it.
Moving onto the last box, “Words in Documents.” Here, I found out I could look at the distribution of words in a specific document more closely. “Type,” “Count,” “Relative,” and “Trend” are the standard columns, but if you hover over the arrow next to any of these columns, and then hover over “Columns,” you can see there are many more options.
If you’re looking to analyze statistical data, these options could prove to be very helpful.
There’s also the option of exporting this data into a format of your choice. By clicking on that floppy disk icon, you’ll get a range of options, or you can create a URL that will link you back to the same set of data in Voyant Tools in the future.
So, given all of this information and all of the different ways to look at it, what’s the benefit? Even from just trying out Voyant Tools on this small data set, I’ve learned that I need to ask different kinds of questions. Or rather, that using this tool allows me to ask different questions. Beyond just looking at word frequencies, you can look for words, or categories of words that are absent from the corpus. As the graphs showed, you can examine various trends, and look for and at specific places in the corpus where trends exist.
I hope you’ve found this tutorial useful. It’s certainly not an exhaustive exploration, and there are many hidden features that I didn’t get to cover. You can find a list of them here. Please feel free to comment below this post with questions or input on your own experience with Voyant Tools.