A Review of Topic Modeling Projects

Topic modeling is a technique that emerged from the statistics and information engineering communities. Researchers in the field of digital humanities have eagerly adopted this tool and applied it in some innovative ways. This post is a jumping off point for a project that I am working to identify and catalog some of these projects (there are far too many for an exhaustive list) and to provide a critical review of this work.

My goal is create a gallery of prior work in the field that offers insight into the range of work that is currently being done and to provides inspiration about how to might employ topic modeling in your own area of research. I try to adopt a critical perspective, highlighting areas where I think the work could be improved or flaws in the research design. I do this somewhat hesitatingly.

The creators of these projects have put a lot of thought into them. Several were blazing new trails for humanities research — trails that introduced many, myself included, to the opportunities and challenges of topic modeling. This work was done under the constraints of time and resources so often imposed by reality. Few things are easier than to sit back and critique creative work from afar. I don’t mean to do that, nor do I mean to detract from the excellent work that has been done, but rather to point the way toward new possibilities.

Below is a list of projects and software tools that I am in the process of reading through and writing a detailed reviews of. Links to those reviews will follow in the months to come. For now, even this list is an evolving experiment as I work to track down various interesting projects. At the moment, it is missing recent work in the field. I’ll be adding projects over the next few weeks, but wanted to get a partial online as a work in progress sooner rather than waiting for perfection (we’d all be waiting a long time).

Topic Modeling Projects

Topic Modeling the PMLA

Andrew Goldstone and Ted Underwood have presented a LDA-based analysis of the Proceedings of the MLA. This project stands out as an example of using correlations to come to terms with how topics relate to each other and a variety of technique to look at how topic usage changes over time. Underwood has written extensively on topic modeling from a humanities perspective and been one of the primary figures in popularization of this technique within DH.

Figurative Language

Lisa M. Rhody has applied topic modeling to the task of analyzing figurative language. This work is particularly significant because “poems exercise language in ways purposefully inverse to other forms of writing.” This stresses many of the core assumptions of the information engineering research agenda that motivated the development of LDA. This also serves as one of the key experiments that inform the conversation between Ted Underwood and Rhody about what the so-called “topics” really are (hint: not topics).

Mining the Dispatch

In Mining the Dispatch, Robert K. Nelson explores the social and political life Richmond, Virginia using a full run of the Richmond Daily Dispatch from the eve of Abraham Lincoln’s election in 1860 to the evacuation of the city in 1865. This project is significant in part for its use of topic modeling to analyze change over time as well as for Nelson’s careful and detailed analysis of his findings.

Martha Ballard’s Diary

In Topic Modeling Martha Ballard’s Diary, Cameron Blevins uses topic modeling to identify themes within Martha Ballard’s diary of 27 years. This project demonstrates the usefulness of topic modeling to help make sense of the thematic diversity of entries within a diary and to track thematic changes over time.

Comprehending Digital Humanities

Elijah Meeks integrates topic modeling and network analysis in Comprehending Digital Humanities. This work suggests a number of fascinating directions for visualizing and interpreting the results of topic modeling using network science tools. Jeff Drouin builds on Meeks’ technique in his Foray into Topic Modeling.

The Pennsylvania Gazette

David J. Newman and Sharon Bock analyze 80,000 articles from the Pennsylvania Gazette in order to understand how the topics covered by the newspaper and to describe how the prevalence of those topics changed over time. Beyond the scope of this specific project, they reflect on the value of topic modeling as a tool for historical research.

Texas Newspapers

Like the Pennsylvania Gazeette project, Tze-I Yang, Andrew J. Torget and Rada Mihalcea use topic modeling to analyze historical newspapers from Texas as a tool to identify topics of potential interest to historians. This work may be of particular interest due to the use of noisy OCR data rather than clean, hand-edited texts.

Software and Tools

Woodchipper: Visualizing Austin and Byron

Travis Brown describes Woodchipper, a tool for analyzing groups of texts based on a combination of principal component analysis (PCA) and topic modeling. You can find the source code on GitHub.

Paper Machines

Paper Machines is a plugin for Zotero that supports a variety of information extraction and data visualizations, including a stream graph for topic model data. This provides a relatively easy to use front-end for Mallet that imports document collections from Zotero.