Graphing Word Trends with Bookworm

A Tutorial for the “Average” Humanities Student

12 min readMar 24, 2014

By: Katie Garahan

The digital humanities tool Bookworm—a product of The Harvard Cultural Observatory or Culturnomics—allows users to create graphs which chart the use of words or phrases within various publications over certain periods of time. The graphs are very similar to Google Ngrams; in fact, Culturnomics has collaborated with Google Books in the past. A self-created graph (called a bookworm) works most efficiently with a large corpora of data; however, users not interested in creating their own bookworms can utilize the pre-made ones, which use freely accessible texts from the Open Library and the Internet Archive. The “freely accessible” aspect sets Bookworm apart from Google Ngrams, which utilize sometimes inaccessible Google Books. According to the Bookworm help site, Bookworm “uses the information to let you search for trends in any corpus you can create out of library metadata, and to link to the underlying books so you can read them” (http://bookworm.culturomics.org/OL/OL.html). Therefore, Bookworm is an excellent tool for macro-level discourse analysis.

In his article for journalfordigitalhumanities.org, Boone B. Gorges, a web developer who specializes in wordpress plugins, suggests that “Bookworm provides a link (both figurative and literal) between the very distant reading of macro-level quantitative analysis, and the close reading of specific texts that is crucial to contextualizing the qualitative results” (http://journalofdigitalhumanities.org/1-1/bookworm/). This notion hits close to the heart of digital humanities’ (DH) goals. Bookworm responds to the commonly asked DH question: How can the humanities utilize “the digital” to enhance already-present humanities’ practices (such as close reading)?

But, can the average humanities student use this tool? The pre-made bookworms are intuitive and user-friendly, and I—the average humanities student—had no trouble exploring them. Creating my own bookworm, on the other hand, was difficult. The following tutorial is broken into three sections:

Easy: exploring pre-made bookworms
Somewhat difficult: constructing a bookworm from sample data
Hard: constructing a bookworm from scratch

**Please note that my tutorial was created on a PC.

Part 1: Exploring pre-made bookworms (i.e. the easy part)

The first thing you should do when learning this tool is examine the pre-made bookworms. On the homepage, (http://bookworm.culturomics.org), under “Our Bookworms” you can view the examples created with data from the Open Library, AiXiV, Chronicling America, US Congress, and Social Sciences Research Network (Figure 2).

For this tutorial, I chose to explore the Open Library. Click a category you want to explore, and a pre-made bookworm like Figure 3 opens. The bookworm in Figure 3 is an excellent example of how this tool could be an asset for a comparative/ contrastive study: in this graph you can see the frequency of terms related to transportation from 1860 to 1920.

You can see in Figure 4 that I searched the use of the word “evolution.” I deleted the search terms in the other text boxes and then clicked the tab labeled “All books” in order to further narrow my search. You’ll see in Figure 4 that you can narrow by subject, location, author and LOC code. These subcategories change depending on the overall category you choose (Open Library, AiXiV, Chronicling America, US Congress, or Social Sciences Research Network). The available subcategories provide a rich depth to analysis that Google ngram viewer does not.

In Figure 5, notice that I have narrowed my search to the term “evolution” in books published in England. As to be expected, the graph shows that the use of the term “evolution” grew steadily throughout the 19th century. Notice the X and Y axis on the graph. Right now, the X axis is set to “Year of Publication” and is spaced every ten years. The Y axis explores the instances of “evolution” per million words throughout English publications.

You can change the X and Y axis using the tool in the right hand corner of the screen that looks like a wheel (Figure 6). While X must display the year of publication and Y must display number of instances of that word, you can change the time span and the type of quantification (Figure 6). You can also choose to “smooth” the graph differently. Smoothing refers to how often the bookworm places a mark on the graph. For example, in Figure 6 the graph is set to every two years; therefore, a dot is plotted for the number of instances of evolution every two years, causing the graph to look choppy.

The graph in Figure 7, on the other hand, is plotted every 10 years and thus looks “smoother.”

You can also add other searches to the same bookworm, which is excellent if you want to compare and/ or contrast word usage based on a subcategory. For example, in Figure 8, I have also searched the term evolution in books published in Germany.

While the bookworms are fun to simply explore, they can clearly serve a more significant purpose as well. For example, look at the bookworm in Figure 1, which displays on the Bookworm homepage. This graph shows usage of the words “war” and “deficit” in Acts of Congress, and, essentially, explores possible correlations between the two. I highly suggest taking time to explore the existing bookworms. As I mentioned, this aspect of the tool is intuitive and opens up nicely to self-tutorial.

Part 2: Creating a bookworm from a data sample (i.e. the some-what difficult part)

Bookworm kindly eases users into creating their own bookworms from scratch by providing sample sets of data with which to practice. Unless you are fairly familiar with coding, I highly suggest you download a sample set of data to explore. In Part 2, I will walk you through this exploration.

Before you “Create a Bookworm,” make sure you read through the “Documentation” description seen in Figure 9, which explains how to organize your data. In order to create a bookworm, you must organize your data into three components: raw texts, a metadata field, and a JSON description (Figure 9). If this seems like a foreign language right now, don’t worry, I was in the same boat at first too.

The website suggests that you begin by looking at their completed examples, which, again, I highly recommend. I chose to explore the Baby Names example, as the raw data is closely related to the data corpus I used in my made-from-scratch bookworm example (which you will read in Part 3). So, choose one of the three data sets to explore, and save it to your computer.

Once the file is downloaded, extract the files from the compressed zip folder, and investigate the contents of the file to familiarize yourself with the data format (Figure 11).

In the folder, you will find two other folders: “metadata” and “texts.” Open the “texts” folder first (Figure 12).

Inside the “text” folder is another folder containing the raw data for the bookworm (Figure 13).

The raw folder contains the plain text documents (.txt), which make up the actual data for the bookworm. You’ll notice in the baby names example in Figure 14, the files are all named in very similar formats: babynames_gender_year_sample number. The naming of these files are imperative for a working bookworm.

As you can see in Figure 15, the only words in each file are the gender and the baby names for that year.

Once you feel comfortable with the raw texts, open the folder titled “metadata,” which is in the original Baby Names folder. Notice that the other two components differ in format (Figure 16). The jsoncatalog is a plain text document just like the raw data. Open this one first.

Figure 17 may look like an overwhelming hodgepodge of letters, numbers, and symbols. It did to me too, at first. It represents the raw data, and is made up of corresponding code. For example, the first line reads: {“date”: “1920-01-01", “gender” “FEMALE”, “filename”: “babynames_female_1920_0"}. Basically, this code communicates how to organize the raw data on the X and Y axis on the bookworm. Remember the X axis represents the date, and the Y axis represents the word—in this case the baby name.

Next, open field_descriptions.json, which is formatted as a JSON file. This can be tricky, as you will need to download a code-reading tool. I use Komodo IDE, which I have found to be novice-friendly. Through the tool you've chosen, open the JSON file. The code you see in Figure 18 communicates how the data will be organized in the X and Y axis as well as the “categories” that will be available to search. You can see in Figure 18 that the main category is gender.

Next, practice creating a bookworm using the sample data you've chosen. Click on the tab “Create a Bookworm.”

Before creating this bookworm, you need to save the babyname zip file to dropbox or another online folder-sharing site. After this, copy the URL for your dropbox file, and paste it in the “Zip File URL Box” on the “Create a Bookworm page” (Figure 19). Name your bookworm, and hit “Build Bookworm.”

This is the point where I ran into some trouble. My initial few tries failed, but the bookworm finally worked on my fourth try. In between tries, however, I thoroughly checked my work but didn't change anything. I did, however, realize that the bookworm takes time to build, so I suggest that you attempt to build it when you have time to let your computer stay on.

In Figure 22, you can see my properly created bookworm titled “Practice.”

Once the bookworm is up and running, take some time to explore it in order to understand how the codes communicate to create the graph. In Figure 23, I have searched the name Tracey in two search bars; one looking at men named Tracey and the other looking at women. We can learn from this graph that the name “Tracey” was so popular between 1960-1970 that even male babies were named “Tracey.”

Part 3: Creating a bookworm from scratch (i.e. the hard part)

In what follows, I explain how to create your own bookworm, which is not an easy task. The steps must be followed very closely, as a small mistake may prohibit your bookworm from working properly—something I experienced many times throughout this process. As I mentioned at the beginning of my post, bookworms work much better if you are working with a large corpora of data. The data set for my sample bookworm is relatively small, but it suffices to demonstrate the steps to create one.

I compiled my data from the digital humanities site, The American Presidency Project (http://www.presidency.ucsb.edu/index.php). Looking at the political party platforms for 2004, 2008, and 2012, I counted how many times specific words (rights, immigration, military, energy, abortion, same-sex, firearms, healthcare, and crisis) were used in both the Democratic and Republican National Platforms. This became my raw data.

Then, I created my file folders (metadata and texts) to emulate the “Baby Name” sample (Figure 24). However, I waited to transfer these to a compressed zip folder until after I compiled my data.

Next, I collected my raw data into plain text documents (.txt). I carefully named my files in a similar format (Figure 25).

Notice in Figure 26 that I used Notepad to create my raw data, .txt documents. I typed the word as many times as it appeared in the national platform from the given year for the given party. For example, in the 2004 Democratic National Platform, the word “immigration” appears five times.

Next, I created the corresponding “metadata” documents, field_descriptions.json and jsoncatalog (Figure 27).

Notice in Figure 28 that I created my jsoncatalog in Notepad, and I modeled it after the baby name example.

Using Komodo, I created my field_descriptions.json document, which, again, I modeled after the Baby Name example (Figure 29).

Then, I saved both folders (“metadata” and “text”) into a compressed zip folder titled “Party Platforms” on dropbox.

Next, I copied the URL for my zip folder and pasted it into the textbox on the “Create a Bookworm” page (Figure 19).

And again, I ran into some trouble. At first, of course, I had some flaws—mainly typos—in my data, which I had to fix.

As you can see in Figure 32, I encountered several failures before I actually constructed a working bookworm. I recommend you check your data carefully several times; even a misplaced space can prevent your bookworm from constructing properly. Again, I also suggest you have patience and time to allow your bookworm to construct completely. At this point, I contacted the site managers for help. Unfortunately, I received no response so continued to plug along on my own.

I did, finally, create a “successful” bookworm. This “ working” bookworm, seen in Figure 33, however, still did not work properly. As you can see , the X axis spans from ‘04 to ‘08 instead of ‘04 to ‘12, which I could not seem to fix.

Also, my bookworm is supposed to narrow by political party, but it will not. You can see in Figure 34, when I type “democrat,” no results are found. This is highly disappointing because a working bookworm would allow me to compare/ contrast Republican and Democratic party platforms, which is what I had in mind when I created my data sample.

After spending a good deal of time with Bookworm, I’ve realized that it could potentially be an incredible research asset. Like the Google ngram viewer, bookworms provide a medium for comparative/ constrastive and term-correlation analysis for students and scholars in just about any stage of research and discipline. For me—an English graduate student in the beginning stages of thesis research—creating a bookworm with my own data sample does not make sense. As you have seen and perhaps experienced, the data preparation necessary for bookworm creation is intense, lengthy, and frustrating. Plus, a bookworm only really becomes provocative and interesting when it is created with a large corpora of data, which is something that at this point in my research I do not have. So, for now I plan to utilize the easy components of this tool with the hopes that maybe someday I’ll have a need to create a (properly) working bookworm.

Graphing Word Trends with Bookworm

A Tutorial for the “Average” Humanities Student

Written by Katie G.