Finding the Missing Departments

Topic Modeling in the Harvard Course Catalog

After his 2017 graduation from Brown University, Berkman-Klein Center intern Dash Elhauge spent the summer at metaLAB, working on Curricle, our curriculum-mapping project. In our inaugural metaLAB Medium post, Dash discusses experimenting with topic modeling, a text-mining method, to discover “missing departments” in the Harvad curriculum. Dash’s post is the first of a series, written by lab members, affiliates, and friends, that will explore Curricle and other metaLAB projects. —Matthew Battles

Since Harvard’s inception, students have discovered courses roughly the same way. They open up the index of a catalog, flip to a department that sounds cool, and start combing through the contents.

This picture features the hand of a real Harvard student.

While this isn’t a poor method of finding courses, and has certainly worked for centuries, it doesn’t lead to a whole lot of radical course discovery. Students tend to flip to departments they’re already familiar with. But what if there’s an interesting class in a department they’d never expect? Would students still be able to find it?

This, in part, is the question posed by MetaLAB’s Curricle Project. Curricle hopes to construct an entirely new online course catalog system for Harvard students that encourages educational exploration, empowering students to find their own path through Harvard.

Course catalogs are an ideal format for fostering academic discovery because they are, in many ways, the atrium of a student’s academic career. It is there that that they flip through and choose the ideas and questions that will occupy their lives during their time at Harvard, there that they find the communities which they will look to for guidance and support.

So how do we stop students from narrowing their focus too early? How do we encourage them to branch out to other departments, rather than staying cloistered in the ones they find most comfortable?

A few months ago, Curricle got a large data dump from Student Information Services. This dump contained information about courses going all the way back to the 20th century, including when they were held, the professors who taught them, the departments under which they were classified, and most importantly to Curricle, their descriptions.

Course descriptions are of particular interest when trying to encourage students to branch out because they can be so broad in scope. For instance, check out this description of a course called “Computational Models of Discourse” offered by Barbara Grosz of the Computer Science Department in 1990:

Computational theories of discourse (text and dialogue) structure and processing. Topics include: anaphora, focusing, plans and speech acts, plan recognition algorithms, representation and use of world knowledge and reasoning for interpreting and generating discourse. Discussion of dialogue and story understanding systems. Involves exercises and experiments with computer programs. Computer Science 180 or permission of instructor.

Some discussion of algorithms and processing seems expected from a computer science course — but anaphora? Speech acts? This class is touching on topics typically associated with rhetoric and literature, way outside the traditional bounds of computer science.

We can use the unusual topic spread of these descriptions to our advantage. Since certain descriptions hint at disciplinary boundaries beyond the traditional ones, they might be able to help us group the courses in novel ways that enable students to find courses in places they never would have looked. All we have to do is put all the course descriptions in a big digital vat, stir them together, and regroup them from scratch.

So how do we do this? Topic modeling.

Topic modeling is a type of Text Mining that attempts to extract topics from a large set of text data. Topics take the form of weighted groupings of words. For instance, here’s an example from the results of this project:

‘Live’ (14.4%), ‘Foundational’ (14%), ‘Ancient’ (13.6%), ‘Musical’ (10.8%)

We know that ‘live’ is the most characteristic word in this topic because it has the most weight, but how it connects to the other words is totally left to interpretation. Of course, that’s the fun part.

To perform topic modeling we used an algorithm called Latent Dirichlet Allocation, implemented by the good folks at gensim. LDA is a form of topic modeling that looks for the co-occurrence of words within blocks of text and organizes them into topics in a pseudo-random manner. Meaning, different runs of LDA can extract different topics from the same text data.

This may not sound immediately appealing — don’t we want the “true” topics to be extracted, instead of random gobbledy-gook? LDA proceeds under the assumption that there’s simply no such thing as true topics — since there are a myriad of ways to define a large body of text, we can only do our best to find topics that are as representative of the data as possible. No set of topics is perfect. While there are some topic modeling algorithms available that always extract the same topics, like Latent Semantic Analysis, LDA consistently outperforms them.

To use LDA, we performed 3 steps:

  1. Parsing the data into the required format
  2. Running the LDA algorithm to generate topics
  3. Graphing the results

Parsing the Data

Parsing the data is the simplest step in the process. First we want to toss out any special characters ($%@) so that we’re left with only words. Then we want to toss out what are called “stopwords” that are useless for topic extraction (words like “it”, “a”, “an”, “and”, “while”, “then”, etc..). Finally, since a lot of the descriptions in the data are placeholders like “NO DESCRIPTION AVAILABLE” or “Course created for student import,” we want to toss out any descriptions from the collection that are 10 words or less.

After all this, we are left with about 100,000 course descriptions, 1/3 of the original data dump.

The LDA Algorithm

LDA frames topic extraction as an optimization problem. Basically, its goal is to create topics that maximize the likelihood that someone could reproduce a dataset using only the topics. The hope is that if this is possible, then the topics must be truly representative of the data.

This is easiest to visualize as a series of spinners:

An overview of the Latent Dirichlet Allocation process.

So, for instance: let’s say we’re trying to reproduce the description “An interdisciplinary study of epidemiology and literacy in the United States.” We begin by spinning a “Meta-Spinner” that has a listing of all the topics. Then we look at what topic the spinner lands on, and pick up that topic’s spinner. Then we spin that topic’s spinner and it lands on a word. We write the word down. Finally, we repeat this process over and over until we’ve (hopefully) generated the whole course description.

To make things a little easier, LDA makes what is called the“bag of words” assumption, meaning it doesn’t care about reproducing the words of the description in order. So if the spinners generate “An interdisciplinary United States: literacy and the study of epidemiology” it will consider this a roaring success. In the case of topic modeling, this seems fair. We’re only interested with how the nouns and verbs in the descriptions relate to one another — we don’t care if the syntax is off.

LDA aims to adjust the boundaries of the spinners to make them as likely to succeed as possible at description generation. If we’re doing this for “An interdisciplinary study of epidemiology and literacy in the United States” this isn’t very difficult. We want the spinners to favor the words “interdisciplinary”, “study”, “epidemiology”, “literacy”, “United” and “States.” We just let those words have bigger slices. Where things get challenging is when we’re trying to make these spinners work for 100,000 descriptions. Anytime we adjust one of the spinners to better produce one description, we run the risk of being less successful at producing another one. This is the balance that LDA has to strike.

So, it iterates through the course descriptions over and over again (30 times in this project) in an effort to strike a good balance. At the end, we’re left with a series of spinners that represent out topics.

Graphing the Results

To visualize our spinners we generate a series of topic pie charts. To do this, we used a service called plotly.

The results are pretty fascinating. Some topics are exactly what you’d expect, like this one:

This is pretty indisputably the Department of Mathematics. It’s nice to see this topic in the results because it confirms that the topic modeling is working! But of course, this doesn’t bring us any closer to our goal of finding the missing department.

But topics also can offer surprising conjunctions. Take a look at this one:

What might this department be? The Department of Museum Innovation? The Department of Multimedia Archiving?

We can try and figure it out by looking at the courses that fall under this department, but the plot only thickens:

  • Intro to Algebraic Topology
  • Composition in the Electronic Medium
  • Plato
  • 18th-Century Music: Seminar
  • Literature and Film

This is just what we were looking for! A missing department. Some totally novel way of grouping courses that lets Harvard students think differently about their education.

But it’s just the beginning — MetaLAB hopes to implement many of these new departments in the Curricle course catalog so that students can use them to carve out totally new paths during their time at Harvard.

How exactly that happens is still an open question. Should we display some massive visualization that shows how related all the departments are? Should we just slip the departments in the list with the traditional ones and see what happens? Can we use these new departments for some kind of recommendation engine? That we don’t know, but for now we’re having a blast combing through the results.

A full list of topics extracted from the course descriptions can be found here.