I got hit by Mallet

I doubt any of my classmates feel real good about their understanding of Mallet. If so, I should have talked to them about it. It’s a topic modeling program that processes text to find commonalities among files. Perhaps you’ll find that the same keywords were used by disparate authors or in texts written in the same period. It could be used to study the changing lexicon in texts over time as well as anything else, I think. It was a bit of a difficult program to use because of the command line interface, but I think I could have overcome that with more experience. Still not easy though, so I procrastinated on this one.

This is the code to run a topic model. Not long ago this would have looked Greek!

The real challenge with Mallet is understanding the data output once the topic model is provided. The product comes in two forms:

  1. A plain text document that shows a list of keywords that are divided into sections for analysis. I think each set represents a paragraph. The cool thing about this is that you can read through the keywords and they almost read like a narrative without the stopwords included.

2. The other product of Mallet is a numerical list. It needs a bit of decoding to make sense. If I understand this correctly, the topic column refers to one of the models on the keywords page, and the proportion is a decimal percentage that “weighs” how much correlation there is among the documents. Higher is more -I get that- but beyond that I’m really not sure how to reconcile those numbers in my head.

There are two clear solutions to my confusion. First, more experience with the software and running known texts through it would probably help me to come to conclusions about what the output data shows. Second, I should have talked about it with my classmates. That may be a lost opportunity now. But, there is opportunity to explore it more online by reworking the tutorials and reexamining the output.

It seems like the Voyant application does similar work to texts, and I’m sure it has more functionality than I’m proficient with. Perhaps the ease of the graphic interface and the advantage of a user friendly data output makes that a more valuable program for the novice DHer. I may sound especially green for saying it, but the GUI really makes a difference to the user experience. Can’t I apply the same logic here that I use when trying to convince my classmates to write for broad audiences? I understand the idea that scholars should raise the bar of complexity to search out new ideas about how to interpret data, but I’m split. I generally think we should simplify our discourse so that complex ideas are more digestible.

Or, perhaps I just need to keep practicing with the Mallet.

These are mallets

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.