Rolling Your Own Code for Text Mining a Corpus of Academic Journals

Eric Kowalik
Digital Scholarship Lab @MarquetteRaynor
2 min readSep 8, 2021
Photo by Clément Hélardot on Unsplash

The increase in text digitization has allowed researchers the ability to engage in “text mining”, a term that has nearly as many definitions as it has applications. For example, it can be applied to searching a large corpus of text for particular keywords or phrases or engaging in sentiment analysis to systematically identify, extract, quantify, and study affective states and subjective information in the text.

While plug and play software such as Voyant and LIWC (Linguistic Inquiry and Word Count) have made it easier for those without a technical background to engage in text mining, these tools are not a panacea and sometimes one has to roll their own code to accomplish their goals.

This was the case in 2016 when a faculty member from the Department of Counsellor Education and Counselling Psychology contacted the Lab to discuss a research project which involved conducting a content analysis of a large body of journal articles to make inferences about publication patterns related to social class and socioeconomic status in American Counselling Association (ACA) journals.

This content analysis project required a method for automating the text mining process to allow the researchers to search a corpus of over 7,500 journal articles for more than 500 keywords to identify concepts related to social class and socioeconomic status. Experimenting with Voyant found that the large number of keywords was too large to effectively run a single search.

The Lab team eventually developed a process (detailed in this article) that the researchers could use to search a large corpus of counselling journals that included the following steps:

1. Work with publishers to access the content.

2. Use the CrossRef API to download PDF versions of the articles.

3. Convert PDF articles to XML.

4. Write Python scripts to search content and create reports.

While technical skills were essential to the completion of this project, a slew of intangible skills contributed to a successful outcome. A strong and cordial working relationship with the research team was essential. Throughout the project, Lab and research team members were in continual communication to keep each other apprised of issues that arose.

A key intangible skill for the library team was perseverance. Locating the appropriate contact at the journal publisher to discuss licensing issues was not straight forward and often required starting with an e-mail or cold call to the publisher or vendor’s general “contact us” link and working through the hierarchy until one reached appropriate individual.

In the nature of academic sharing and research transparency, the code used in this project is available on the GitHub repository, and we invite you to join us in the continuing evolution of this project.

--

--