Teaching Text Mining Online

Clifford Anderson
The Startup
Published in
11 min readMay 21, 2020

I am co-leading a research group on text mining for literary analysis at Vanderbilt University. Like others around the world, we pivoted in March 2020 from meeting in person to holding instruction online. This story is about how we adopted better methods of teaching text mining following that abrupt transition.

Our Goals

During the fall and spring semesters this year, a group of colleagues and I at Vanderbilt University have run a noncredit research seminar to explore how best to teach the fundamentals of large-scale text mining to undergraduates, particularly students who come from disciplines other than computer science.

We are members of a research group, the Computational Thinking and Learning Initiative, that is investigating methodologies and best practices for teaching computational thinking across the curriculum. Our goal in the seminar is to enable students to get started with text mining by lowering barriers to entry while still making it possible for them to undertake genuine research.

The seminar meets once a week for an hour during the academic year. Our field of inquiry is literary studies and our primary data set contains roughly four million periodical articles published from the seventeenth to the twentieth centuries. The leaders of the seminar include: Mark Schoenfield, Professor of English in the College of Arts & Science; Corey Brady, Assistant Professor of Mathematics Education at Peabody College; Brian Broll, Research Scientist at Vanderbilt University; and Jim Duran, Director of the Vanderbilt Television News Archive and Curator of Born-Digital Collections. Roughly a dozen students and staff members take part, though attendance varies from week to week.

Photo by Marvin Meyer on Unsplash

During the weekly meetings, we alternated between plenary lectures and breakout groups. Students joined one of three subgroups. The first subgroup is learning to query an XML-encoded corpus using the XQuery programming language and BaseX, a native XML database. The second is exploring large-scale text mining with Apache Spark, a framework for distributed data processing, and Spark NLP, John Snow Lab’s natural language processing library for Spark. The third subgroup is seeking ways to broaden access to these technologies through a block-based computing platform called NetsBlox, which makes it easy for students to integrate data from external sources into their programs.

The research aim of teaching these heterogenous technologies is twofold. First, we want to compare searching textual corpora for suspected patterns with using unsupervised machine learning to discover patterns. To what extent do these approaches provide similar results and to what extent do they differ? If they differ, do they provide complementary or dissonant sets of information? Second, we also want to make both modalities of exploration — searching and machine learning — accessible to students who have no prior experience with text mining. How can we lower barriers to entry while still maintaining high ceilings for research and discovery?

Before the Pivot

During the first half of the spring semester, we met together in person at the Center for Digital Humanities at Vanderbilt University. The Center provides a welcoming and congenial space at the heart of the university campus. As with any high-tech classroom, students gathered around a long table and faculty presented from three large screens at one end. When breaking into small groups, we experienced productive cacophony, with technical phrases floating from one group to the next. Actually, it could be a little distracting, but it mostly worked.

From my perspective, the seminars flowed nicely, though we regularly ran short on time for conversation and discussion. Given the range of theoretical concepts and software tools, our instructional aspirations were ambitious. We expected the students to spend three to four hours during the week walking through our class exercises and preparing variations to share at our Friday meetings.

Spark

Here is how we were teaching these technologies before moving online. The Spark subgroup worked primarily from the command line. A major hurdle at the beginning of the semester was to install Spark on students’ computers. For students with Macs, the installation process was easy. For students with Windows, not so much. Configuring Java and setting paths gave everybody headaches.

Teaching from the command line also was not great. I typically projected my terminal on screen and students dutifully followed along on their systems. After the class sessions, I shared the code from my terminal sessions on a private gist for the students to adapt and reuse.

XQuery

The XQuery subgroup had a better environment. While students had to install BaseX on their systems, the process was easier. BaseX comes prepackaged with an attractive graphic user interface for writing queries and displaying the results.

The chief challenge for this subgroup was sharing data. The students had to clone the data to their systems, creating parallel versions of the database. For small data sets, the task was easy to accomplish. But for larger data sets, it became difficult as systems differed in processing power and memory. Given that the data was proprietary and came with restrictions about sharing, the group opted to have students write queries using a small subset of the data with the intention of porting successful queries over to a larger database.

NetsBlox

The NetsBlox subgroup had it easiest, at least from the student point of view. NetsBlox is a browser-based application, meaning there are no setup requirements. You just need Google Chrome (the preferred browser) and you are good to go. The purpose of NetsBlox is to teach programming in a visually-appealing way that minimizes the amount of syntax students must learn to become effective. The drag-and-drop interface allows students to retrieve data, process them, and create visual displays without having to learn how to make API calls, parse JSON, and configure visualization libraries. (For more about the philosophy and design of NetsBlox, see A Visual Programming Environment for Learning Distributed Programming.)

The focus of this group was to experiment with natural language processing, that is, applying machine learning algorithms to provide “interpretations” of text, on a small scale to learn its applicability as well as limitations. Students wrote short programs that integrated textual analysis from a commercial natural language processing provider, ParallelDots. For instance, an exercise had the students adding a profanity filter to a client/server program to screen out messages labeled as vulgar. The students also encountered the limitations of using natural language processing on historical collections of text where the meaning of words may have shifted semantically.

Making the Pivot

On March 9th, the Chancellor of Vanderbilt University announced that the university was suspending in-person instruction and would be moving to distance learning the following week. Like other instructors around the country, we scrambled to move our seminar online. We created a sandbox course in Brightspace, our learning management system (LMS), with the help of our Center for Teaching and then connected it with the university’s newly-licensed Zoom account. We resumed teaching on Friday, March 17th.

Creating this basic framework for teaching online was straightforward, though it did not go off without hitches. I forgot to add one of our students to the roster for the course, meaning that she did not get the messages I started to send through the LMS. Some also experienced problems logging into Zoom when I tightened the permissions to avoid the growing incidents of so-called Zoombombing. But we managed to recreate a digital simulacrum of our physical meetings without great difficulty.

However, our teaching styles needed to evolve as well. We had relied primarily on physical cues for gauging how students were managing the technological assignments, such as asking people to raise hands when completing tasks or walking behind them to observe the status of their programs. In this new environment, we lost that sense of connection.

Our students were experiencing major disruptions in their personal lives. We owed it to them to minimize the setup costs of the technologies they were learning. We also wanted to provide a level playing field for students in other time zones. We did not want students who had returned to other continents to have to wake up in the middle of the night to attend synchronous meetings. But just providing access to recordings of those sessions seemed unfair. How could we change the way we were teaching to make our approach more accessible and equitable?

Our Emerging Strategy

We introduced three changes during the next weeks that carried us toward those goals. First, we switched from relying on the command line to using notebooks for teaching Spark. Second, we adopted a new tool for writing and sharing our queries to BaseX. Third, we built integrations in NetsBlox to tie our three subgroups together. These changes did not solve the problems we faced when pivoting online, but they definitely helped and, in certain aspects, they improved our teaching methods.

Notebooks

Switching from the command line to using notebooks seems in retrospect like a no-brainer. For anyone who has not experimented with code notebooks like Apache Zeppelin or Project Jupyter, they provide frameworks for iteratively and interactively walking through blocks or “cells” of code, allowing you to understand how the code contributes code block by code block to the final computation. Notebooks typically offer two kinds of cells: code cells and document cells. If written effectively, a notebook resembles a scientific article that intersperses prose and computation.

We went with Google Colab to host our notebooks. While open source projects like Anaconda make setting up the environment to serve notebooks locally relatively painless, we wanted to avoid further issues with desktop support for students. Google Colab lets you get started with notebooks without set up costs. Google dynamically allocates the computing resources necessary to run code in the notebooks. The free tier has limitations, but our students never ran into them. Given that students almost universally have Google accounts, they could sign into Colab and get started immediately. If you want to share data privately, you can create a shared folder on Google Drive and connect Colab to your Drive.

Setting up Spark to run on Google Colab was also straightforward thanks to John Snow Labs. John Snow Labs provides a set of notebooks for Colab that demonstrates the use of its natural language processing libraries. I adapted these notebooks for our data and shared them with the participants in our working group. While running Spark on Colab limits its effectiveness for data processing, it provided an ideal teaching environment for the sample queries we wanted our students to write.

Postman

The problem of providing shared access to BaseX had led to students huddling around a laptop computer, trying to follow along as one person executed queries on the server. Not ideal. Students could experiment with queries using BaseX on their own systems but, without access to the large data set on the server, they gained little insight into the scalability of their queries. As any database administrator knows, queries that run efficiently on small data sets may break down with larger data sets. Before our pivot, students worked on query expressions during the week and then tested them on our larger data sets during our in-person meetings. Discovering that a query which runs fine on your machine fails to make efficient use of indexes is frustrating.

How to improve that anti-pattern as we transitioned to working remotely? We decided to leave behind the graphic user interface of BaseX and switch to the REST API. The REST API defines an interface for passing queries over HTTP to the database. The number of acronyms in the previous sentences makes the process sound complicated but, given the ubiquity of REST APIs, there are solid tools for rendering them easier to use.

We adopted Postman as our tool of choice. Postman provides a full suite of services for developing, testing, and interacting with web-based APIs. Postman offers a graphic user interface for writing REST requests so students did not need to figure out yet another command line utility (like curl). Postman also provides a way to form teams, allowing students and faculty to collaborate on queries. Most importantly, everyone can send queries to the database whenever they want to test their expressions against a big data set. This ability led to the discovery that we needed to fine-tune several queries to make them performant on big data sets.

NetsBlox

If the shift for the NetsBlox subgroup was less dramatic, credit goes to the robust interface that NetsBlox provides for distributed projects. NetsBlox features built-in functionality for collaboration, using the metaphors of rooms and roles. Instructors can create projects in NetsBlox which consist of one or more roles and then invite students into those roles, allowing them to edit the code alongside them if they wish. For instance, students might collaborate on a distributed Tic Tac Toe game, one playing in the role of “X” and the other in the role of “O.” During a previous semester, Brian Broll co-taught a couple sessions of my offering of the Beauty and Joy of Computing course. From his home in Minnesota, Brian interacted seamlessly via NetsBlox with students in my classroom in Nashville.

The primary task of the NetsBlox subgroup during the pivot was to design new activities for students to explore the fundamentals of text mining while also providing them with access to actual data sets. Toward that end, Brian Broll created a service to integrate NetsBlox with our BaseX server, allowing students to use custom blocks (with XQuery under the hood) to request data from BaseX. With this service in place, students could retrieve a few sample documents from our corpus and send them in turn to ParallelDots for analysis of their sentiment.

The results of this experiment were mixed and, in some cases, outright nonsensical as might be anticipated when using a commercial natural language processing service, trained on contemporary texts, to analyze Georgian literature. But, as we agreed retrospectively, learning that lesson itself provided insight into the limitations of off-the-shelf models for natural language processing.

What’s Next?

Our project to make text mining more accessible to students across a variety of disciplines will continue this summer and again during the fall semester. In addition to the projects described above, we have bigger ambitions in mind. We hope, for instance, to connect Spark with NetsBlox, allowing students to kick off big data queries from within a block-based environment. We are also looking at ways to move our notebooks from Google Colab to Spark by using services like Zeppelin on Amazon Elastic Map Reduce or Databricks. We also want to integrate tools like Rumble to bring our XQuery and Spark groups into closer alignment.

Beyond these technical goals, we aim to consolidate what we learned from moving our working group online. While we had to move in a rush due to the pace of events, we managed to put together solid building blocks for our next semesters. Indeed, the set of tools that we turned to when moving online will remain in our teaching arsenal whether we meet in person, online, or in some hybrid during the fall. In that sense, we are ready for the so-called Hyflex that has attracted widespread attention recently. By making it possible for students to contribute effectively to our project without having to gather in person or in a Zoom classroom, we hope to foster independence, experimentation, and provide better training for the projects they may undertake in the classrooms and work places of the future.

Thanks to the Computational Thinking and Learning Initiative, the Vanderbilt University Library, the Center for Digital Humanities, and the Mellon Partners for Humanities Education for supporting the Text Mining Working Group at Vanderbilt University. Many thanks also to Brian Broll and Sarah Burriss for helpful editorial suggestions.

--

--