Gold in Them Thar Texts?

The Complete, Absolute Beginner’s Guide to Getting Started with Textmining Using CATMA

Andrew Kulak
12 min readMar 24, 2014

Hello, world! In the spirit of every computer science textbook ever made for some reason, allow me to introduce myself to you all out in the Medium community.

I’m Andrew Kulak, and I am a second year M.A. student in English at Virginia Tech, focusing on rhetoric and writing. My evil Introduction to Digital Humanities professor, Quinn Warnick, asked each of us in his class this semester to explore a different digital humanities tool and to share our experience learning it on Medium for (hopefully) the benefit of other digitally inclined humanists. I’ve got to say, I’m digging this whole Medium thing so far.

Elements of Textmining for Liberal Arts Majors

Today I’ll be your tour guide in the wonderful world of Computer Aided Textual Markup and Analysis or, because everyone into technology loves acronyms that kind of spell things, “CATMA” for short. Prof. Warnick said we could use our own voices, so your CATMA correspondent at times might (will) employ sarcasm and thinly-veiled attempts at nerdy wit. Very thinly-veiled. You have been warned.

CATMA is a tool for what’s called textmining, or the digitally-assisted analysis of large bodies of text. Matthew Jockers, whose book Macroanalysis served as my first exposure to a large-scale textmining project, has used computer algorithms capable of processing huge corpora of novels to, among other really cool things, identify the gender of authors based on how they use pronouns and punctuation, reliably ascertain authorship based on comparison to known bodies of texts, and determine once and for all that Moby Dick is a book about whaling. Who knew?

If you’ve never textmined before, fear not! My introduction will get you down in the digital mineshaft and developing a technologically-mediated case of black lung in no time. But before we jump into CATMA, which I discovered can be a bit intimidating for the first-time user, I’d suggest you direct your attention to a little website called Voyant Tools first.

Voyant Tools has a very simple interface. I’d suggest it for first time textminers looking to get their feet wet

I’ll start by walking you through finding text and using Voyant Tools. If you feel comfortable with your digital chops right now and already have some text you’re just itching to mine, feel free to skip this little excursion. You can meet us in the next section.

Staking Your Claim

The first thing you need for any textmining project is, well, some text to mine. A pickax and headlamp are optional, although they do make you look very cool. The text you choose should be plaintext, or text without formatting, page numbers, or anything like that.

If you have some text you’re ready to work with, great! You could try mining your work, your students’ work, favorite websites, song lyrics, or anything else you can think of. Just keep in mind that textmining software was designed to dig into large-scale corpora, so you’ll get more interesting results if you use a bunch of text than just a small sample. To consolidate your text and get it ready to mine, copy and paste into a simple text editor like TextEdit, which comes on Mac OS computers, and save your selected text as a .txt file. This will eliminate digital markup that you can’t see, but could nonetheless throw a monkey wrench into a computer program.

If you don’t already have some text in mind to work with, no problem! You can head on over to Project Gutenberg, as we did in our Digital Humanities class. I grabbed The Man Who Was Thursday by G.K. Chesterton. I’ll be using this fun little novel in my examples, but if you’re into boring writing, feel free to snag some other book. There’s no accounting for taste.

Use the text field in the top, lefthand corner of the page to search Project Gutenberg’s free digital library

If you decide to get a Project Gutenberg book, make sure to follow the same steps as above — save the book as a .txt file using a simple text editor and eliminate all other data included in the eBook not part of the work itself, like Project Gutenberg’s boilerplates and any introductions, section headings, etc. This will help keep your results relevant. Make sure to give your text a meaningful title, so it is easy for you to find on your computer. Saving your text to your desktop for the moment might also make your life a bit easier.

Alright, so we’ve got some text. For now, let’s throw it into Voyant Tools before we get into more complicated stuff. This should give you a quick idea of what textmining can do. Simply hit “Select All” in your text editor, then copy and paste your text into the text field on the Voyant Tools homepage. The whole process is very user friendly thanks to the Google-esque minimalistic design. You can also upload the .txt file you’ve created, if you’d prefer, using the button below the text field on the homepage. Now hit “Reveal.” Boom, you’ve just textmined:

Voyant Tools “reveals” your text as a word cloud and a list of words used in the text and their frequency

Take a few minutes to explore your results and get a feel for what textmining can offer. It’s not a substitute for reading, but it can visualize interesting trends about a body of text that a standard reading might miss. Try using the “Options” buttons (the ones with small gears on them) to apply a stopwords list, which will remove common words like forms of the verb “to be” and articles, allowing the analysis to focus on meatier words. I will not be focusing on Voyant Tools here, but one of my colleagues is (edit: link added), should you wish to dig into that site more in-depth. When you’re ready to try your hand at CATMA, read on.

Taking the CATMA Out of the Bag

I found CATMA on the Bamboo DiRT website, which I highly recommend for anyone in the market for humanities hacks. I’m very interested in textmining, especially in developing projects to analyze large-scale social networks like Reddit, so that’s why I decided to take CATMA for a spin. It looked like a user-friendly program similar to but a bit more feature-rich than Voyant Tools. Let’s take it out of the garage and see what it can do!

CATMA’s homepage provides some basic information and directs you to the actual program, which lives online

Immediately upon accessing the homepage, you can probably tell that this isn’t going to be as straightforward as Voyant Tools — it lacks the simple interface and drag-and-drop functionality. To get started, let’s click the “Work with CATMA” button. You should end up here:

To use CATMA, you first need to login. Don’t worry — it’s free and works through your existing Google account

I know what you’re thinking: “Repository? Tag manager? Plus I have to login? Screw this, I’m going back to my happy little Voyant Tools cocoon.” I felt the same way, but don’t flee just yet! Login is quick and painless via the Gmail account you already have (c’mon, it’s 2014) and using CATMA is totally free. So just click the button in the top righthand corner and follow the login instructions. You’ll get to a place like this:

The CATMA Repository Manager is where your adventure will begin

If you were scared before, you’re probably terrified now. All these buttons and windows make Voyant Tools look like a day at the beach. But take a deep breath, and don’t panic. First, let’s focus on the “Corpora” section. Mine looks a little different than yours will because I already have some documents uploaded, but you will, too, soon enough. Click the button that says “Create Corpus.” You will be prompted to name your new corpus. Corpora can contain a number of different documents, so consider naming it for an author, a genre, a time period, or something similar depending on your text and interests.

I made my first corpus before I realized that it would hold more than one document, so I just named it for the document. I’ll demonstrate how to make a new corpus by creating a corpus called “G. K. Chesterton” to hold The Man Who Was Thursday, to replace the corpus I created named for the novel. That way, I can add other Chesterton works to it later if I want to, and the organization will make sense:

Your corpus can contain many documents, so name it according to author, theme, genre, etc.

You can name your new corpus whatever you’d like. Note that CATMA lives online and will save corpora up in the cloud for you, a convenient feature that Voyant Tools lacks (see, I told you trying out new tools would be worth it). Now, let’s add a document to your new corpus. Highlight the corpus you just created, which should appear in the “Corpora” section, by clicking on it. Next, click “Add Document.” A popup window will appear like this one:

You can either upload a document from your local hard drive or pull a text file directly from the web with a URL

Note that you can use a URL, if your text exists as a plaintext file on the Internet, but we’ll stick with the file you’ve already saved on your computer. Select “Upload local file” and find where you saved your selected text, then select “Open.” A progress bar should indicate the status of your upload. You’ll get a message once it’s complete. Once you get that message, click “Next” in the lower right-hand corner.

The next step will prompt you to indicate the file type. Autodetection should handle this for you. If for some reason it does not, select TEXT as the file type, and UTF-8 from the UNICODE menu manually. Assuming you’ve properly saved your document as a .txt file, this should work just fine. If you have no idea what UNICODE means, it’s not really a big deal. In the industry, I think that’s called “black boxing.” Let’s move on. Click “Next.” You know you want to.

Now you will be prompted to identify the language of your selected text to help with analysis functions. As you saw with Voyant Tools, textmining will look for instances of words in a text, so this helps to generate an appropriate word list. You can also add inseparable character sequences, but here CATMA provides some helpful, albeit slightly condescending, guidance for the new digital humanist: “If you are unsure what to do, just select the language and leave everything else unmodified.” That’s what I’ll recommend, unless you know there are some groups of characters and punctuation marks that you want to be considered as a unit. Don’t know what that means? I don’t either. Click “Next.”

Finally, you’ll be prompted to input some metadata. Relax, this isn’t Edward Snowden metadata, although an NSA algorithm is probably reading this (Hi, NSA-bot!) This is a place for you to input some information about your work, like title and author, to help keep everything organized. Now hit “Finish.” Woo! We did it! You should have something that looks like this:

Admit it: adding a document wasn’t as hard as you thought it was going to be

Well, kind of like that. Only with the name of your work where the picture has the name of my work, and the name of your corpus where the picture has the name of my corpus, and…OK, you know, it’s not going to look exactly like the picture. But the important part is that you now have a happy little document in its own happy little corpus. Now let’s just make a happy little tree over here…or, I mean, do some analysis!

Click the document you just successfully added, and then click “Open Document.” You should get a window like this:

You can add tags in the document screen, but that’s beyond the scope of this tutorial. If you know how to tag or work with markup, go wild

Right, so let’s ignore the entire right half of the screen. That section is a whole other long blog post, let me tell you. Tagging is something that I’m just figuring out myself, so for now, let’s just focus on simple analysis tools. Click the “Analyze Document” button. In the window that appears, click the “Wordlist” button. This will get you a familiar list of word frequencies, not unlike Voyant Tools. While it’s loading your word list, hover your cursor over the “?” next to the Wordlist button for an idea of what functions are available to you here. Depending on the size of your file, it could take some time to analyze. Have some coffee. Follow me on Twitter. You should eventually get something that looks like this:

Clicking “Analyze Document” will bring up a list of all words that occur in your text

Here, you can select any word in the text, then select the button that looks like a line chart for a simple chart of the term’s frequency throughout the text. You can select the button next to that for a doubletree visualization that represents the word in context throughout your document. Of course, navigating through every word in your document is cumbersome, so you can also use the “Query Builder” button to find particular phrases of interest. Just like when we uploaded a document, CATMA will walk you through building a query for the words you are looking for. I’m going to look for words relating to “anarchist.”

Click “Query Builder.” I’ll use “anarch” in the “First word starts with” field, which will return all instances of anarchy-related words including anarchy, anarchist, and anarchism.

You can look for exact words, words that begin or end a certain way, or words that contain a set of characters anywhere

After clicking “Finish,” here are all the phrases my query returned, as seen in my analysis window:

My query found five different anarchy words

Selecting the “Visible in Kwic” checkbox will show each instance of the word in context and with its absolute position in the text on the right side of the screen. That’s boring, though, so let’s make some visualizations! Build your own query, searching for words or phrases you are interested in. Click on the words in your “Results by Phrase” tab to highlight them. Using command+click, I have selected all instances of “anarch” words:

Use command+click to select multiple terms at the same time

Now, click the line chart button in the bottom left-hand corner, and you’ll get a nice little graph:

Anarchy talk is heavy at the beginning of The Man Who Was Thursday, when protagonist Syme encounters an anarchist poet

It looks like mention of anarchy is concentrated in the beginning of the book, so if I’m interested in G. K. Chesterton’s views on anarchy, that’s where I should focus. You can close out of the chart by clicking the “x” in the top right corner, and return to your word list.

Now, select one word that’s especially interesting to you, and we’ll create a doubletree visualization. This type of visualization shows your word in question, along with an interactive graphic representation of its context. I’m going to highlight “anarchist,” since it’s used most frequently, then select the doubletree chart button, the one to the right of the line chart button. CATMA creates this neat little visualization:

Clicking on a term on either side will reveal additional layers of context

Clicking on any word to the left or right of “anarchist” will show you more context regarding that word. Play around a bit with it — it can help you get a feel for how your chosen word appears throughout a text or corpus.

LOL CAT(MA)

Well, there you have it. The basics of CATMA. Now you can upload your own documents, build queries, and generate some basic visual representations. Now, I know what you’re thinking: that basically did the same thing as Voyant Tools, only it took me an entire afternoon. And I’d agree with you.

I came into CATMA thinking I’d be making visualizations like Jockers and the rest of the digital humanities pantheon in no time, lounging in my leather recliner over my Manhattan, booking speaking engagements and laughing at the simpleton first-wavers who can’t even prove Moby Dick is about seafaring. I dream big, OK?

Computer science, too

The real power of CATMA is beyond the basics I’ve just walked you through, and thus beyond what I can cover in one Medium post. To get full functionality from CATMA, you’re going to need to start working with tags and markup. If you have experience with this, I’d encourage you to dig in and see what you can do. If not, you can check out the fifty page long manual, but I found that to be written at a level just over my n00b digital humanist head. If you don’t know what a tagset that contains subtags in a tag hierarchy is, you’re going to have a bad time.

CATMA can do more interesting things if you have the digital humanities chops to use tags and markup for more thorough analysis, but if you just want to quickly generate word clouds and frequency lists, it’s probably overpowered for your needs. Even though storing documents online is nice, I’d recommend Voyant Tools for basic analytical applications.

--

--

Andrew Kulak

Virginia Tech PhD student. Adrift in a sea of rhetoric, video games, social media, and design. I enjoy reading, writing, baseball, anime, and a good cocktail