Using Guided Topic-Noise Models

Learn how to use GTM with social media data (2/3)

Rob Churchill
Georgetown Massive Data Institute
12 min readJan 26, 2023

--

Topic-noise models are a new type of topic model designed for short texts like social media data. They have been used to discover topics in large domain-specific data sets, like surveys about the impact of Covid-19, and tweets about presidential elections. In this article, we will consider ways to generate even better topics using topic-noise models when we have previous knowledge of the domain that we are working in. We start by providing some intuition for how and why the Guided Topic-Noise Model (GTM) works, and then we explain how to use the package and review results.

Guided Topic-Noise Model (GTM)

In the first article in this series, we established that topic models take a set of documents and return a summary of those documents in the form of topics, which are lists of related words. As humans, we can interpret these topics to get a better understanding of the data set that we are working with. But what if we already have a reasonable understanding of the domain, and want to improve topics by using that knowledge? We’re going to go through a real life example to explain the problems that people with domain knowledge run into with unsupervised models, and how we can use the Guided Topic-Noise Model (GTM) to address these issues.

Figure 1: Meet Pam, the social science researcher.

Meet Pam. Pam is an expert in her field, and is interested in studying conversation related to her field on social media. For this example, let’s assume that Pam studies elections, and is interested in the 2020 United States Presidential Election. Pam already knows that people talk about Mail-in Voting, Political Parties, and the Economy, among other things when they converse about the election on social media. So, when she runs her unsupervised topic model on her data set, she expects to see those topics along with some others that she hadn’t thought of. Suppose the following are the topics Pam sees that are returned by an unsupervised topic model.

Figure 2: Pam sees the topics returned by an unsupervised topic model.

Unfortunately, when Pam looks at the topics, the Economy topic is nowhere to be found. On top of that, the Voting and Parties topics are not exactly what she expected. It seems that they were co-opted by related topics that might have had a stronger signal in the data. Additionally, the Candidate topic contains some information that belongs in the Parties topic. That doesn’t mean that the topics she cares about are not there. It just means that they were pushed out by higher volume topics. The unsupervised model does, however, provide a couple new topics that Pam wasn’t thinking of at first. One is about Taxes, and one is a long list of insults aimed at the candidates.

Note: Topics consisting of insults aimed at candidates is one of the most common topics that we have found in election data.

Thanks to the Guided Topic-Noise Model, Pam can do better. Pam makes a list of seed topics, feeds them into GTM along with her data, and gets out a list of topics closer to what she was expecting.

Figure 3: Pam uses GTM to get topics more in line with what she, as a domain expert, was expecting.

Pam now sees expanded and more informative versions of her seed topics, as well as new topics that she might have otherwise missed. If she is happy with her topics, she is done. However, she can repeat this process, curating her list of seed topics in each iteration until she gets a full set of topics.

Figure 4: Pam curates a new, better set of seed topics.

After a few iterations of guiding the topic model in the right direction using her domain knowledge, Pam can finally be happy with her full set of topics for the domain.

Figure 5: Pam gets a better set of topics thanks to her own expertise and the flexibility of GTM.

This might sound a little complicated, but the whole process can be visualized as a loop, where the user and the computer interact in rounds. The user creates the initial seed topics and gives them to the model. The model gives her back a topic set that contains topics trained with the seed topics, as well as topics found without the help of seeds. This allows the user to see their seed topics grow, and identify topics that they may have missed. It also informs her about the prevalence of some of her seed topics. If one of her topics is not returned, she knows its discussion is less coherent and frequent than other topics that are returned.

Figure 6: The Guided Topic-Noise Model (GTM) Overview.

How Guided Topic-Noise Model works

GTM [Churchill and Singh, 2022] is similar to NLDA [Churchill and Singh, 2021 (2)]. There is a topic model component that comes together with the topic-noise model (TND), in an ensemble, to filter noise from topics. The difference between the two is in the sampling algorithm of the underlying topic model.

Figure 7: Gibbs Sampling vs GPU Seed Word Sampling.

NLDA uses Latent Dirichlet Allocation (LDA) [Blei et al., 2003], which uses traditional Gibbs sampling to generate topics. In Gibbs sampling, an observed word is probabilistically placed in a topic according to the topic that its container (the document) belongs to. GTM uses a modification of Gibbs sampling, called Generalized Polya Urn (GPU) seed word sampling, that makes adjustments based on the seed topics. In the latter, when a seed word is observed in a document, it is not probabilistically placed in a topic. Instead, it is placed in the topic corresponding to its seed topic with 100% probability, and with a weighting corresponding to its rarity in the data set (rare seed words are oversampled). Figure 7 shows the difference between Gibbs Sampling and GPU seed word sampling. As seed words are observed, this mechanism allows the sampling algorithm to guide documents containing seed words to the correct seed topic. This results in GTM building topics based on the words closest to the original seed words in the context of the data.

Now that we have a high-level understanding of how GTM works, let’s put it to work!

Guided Topic-Noise Model in Action

Data Sets

Throughout this article, we have been looking at a data set of tweets about the 2020 United States Presidential Election. The results shown below are based on a data set collected between January 2020 and November 2020, using keywords related to the Election domain. The data set consists of 1.2 million tweets, and was preprocessed using the following methods: tokenization, URL removal, punctuation removal, lowercasing, and stopword removal. For more details about preprocessing for social media, check out textPrep [Churchill and Singh, 2021 (5)].

Due to Twitter’s privacy policies, this data set is not publicly available, so for your own experiments, we provide a small sample data set of 196 preprocessed tweets taken from the Election data set (with metadata removed to abide by Twitter’s policies), as well as a link to a larger public domain data set provided by Kaggle, that covers the 2020 election over a slightly different time period to the former election data set.

In the final section, Visualizing Seed Topics, we briefly highlight a data set of open-ended survey responses taken from a survey of parents on the challenges of their children’s schooling during the Covid-19 pandemic [Davis-Kean et al., 2022]. This data set consists of 2,700 responses from U.S. adults and serves to show GTM’s ability to find topics in different types of data.

Setting up

GTM, like the other topic-noise models, is implemented in Java (based on the Mallet [McCallum, 2002] implementation of LDA [Blei et al., 2003]). We built Python wrappers, based on the old Mallet LDA from Gensim [Řehůřek and Sojka, 2010], for simplicity.

Note: gdtm works in Python 3.6 and up, but we have not tested it on older versions of Python. This tutorial assumes that you are using MacOS or Linux.

Navigate to your working directory in your terminal, enable whichever virtual environment you plan on using, and pip install the gdtm package. You can find everything you need to know about gdtm in its documentation.

Once you have the Python package installed, you need the Mallet (Java) implementation of whichever topic-noise model you are going to use. You can find an implementation of TND and of the seeded topic model used in GTM in the Topic-Noise Models Source repository. Download the mallet-tnd and mallet-gtm folders from the repository and note their paths, wherever they end up on your computer (path/to/tnd, path/to/gtm).

Loading Data Sets

Data sets can be loaded in whatever way you find convenient, but the final data structure to be passed into the model should consist of a list of documents, where each document is itself a list of words.

You can load the sample data set using a built-in function from gdtm :

Figure 8: Loading data using the built-in function

The sample data set is a CSV file that is space-delimited. It is important, if making your own files of this type, to not accidentally include extra spaces. As you can see in Figure 8, there is an argument to pass your preferred delimiter, so you may use whichever you prefer in your own data sets.

Note: Topic-noise models are best used on data sets of tens or hundreds of thousands of tweets or other social media posts. The training of the noise distribution is accomplished using a randomized algorithm. With smaller data sets, GTM is not always able to get an accurate noise distribution, so don’t expect to see great results with the sample data set! If you want to play with a larger data set to see its true effect, we suggest using the Kaggle data set that we described above. The US Election 2020 Kaggle dataset contains 1.72 million tweets about the election between October 15, 2020 and November 8, 2020, collected using the Twitter API. It was released to the public domain, meaning you are free to use it for whatever means you wish. You will need to do some preprocessing and data wrangling before putting the larger data set into a model. We suggest starting with a subset of a couple hundred thousand tweets from the larger data set.

Running Guided Topic-Noise Model (GTM)

Setting up seed topics

Before we can actually run GTM, we need some seed topics! Let’s use some of Pam’s seed topics (she’s an expert after all). We need to store them in a CSV file so that the Java code can load them. Let’s store them in a file called data/seed_topics.csv

Figure 9: CSV containing seed topics. Each row is a seed topic.

Running the model

Alright, now that we have our seed topics sorted out, we can finally get this show on the road. GTM is simple to run once we are set up. We pass in our data set, the paths to the Java model code, the seed topics file path, and any other parameters that we want.

Figure 10: Running GTM on our sample data set.

Notice that we set tnd_k and gtm_k to be a few times larger than the number of seed topics that we have (in the case above, we set k to ten for two seed topics). This is to allow space for other topics to be found that might exist in our data. The phi parameter dictates how much noise we should remove in the noise filtering phase of GTM, and the top_words parameter tells GTM how many words to return per topic.

Interpreting the Results

In the previous article, we went over what to expect from the get_topics() and get_noise_distribution() functions. These functions work the same way here, with one caveat for get_topics() . Because we have seed topics in the mix, we need to know which topic corresponds to which seed topic. Thankfully, that is easy. If there are X seed topics, then the first X topics in the topic set correspond to the seed topics, in their original order. This makes it super easy to see how GTM has transformed and augmented the seed topics.

We can save the topics to a CSV easily using gdtm .

Figure 11: Saving topics is simple with gdtm.

In the rest of this section, we will be looking at the results of using GTM on the 1.2 million tweet Election data set.

First, let’s look at the seed topics. As we can see, we have ten seed topics (each row is a topic), and five seed words per topic. You can have as many seed words and seed topics as you want, so long as you define them in your seed topics file. We also don’t have to have the same number of seed words per topic, this just happens to be the case here.

Figure 12: Seed topics for 2020 Election data set.

Now, let’s see what the topics look like. As we can see, the first ten topics follow the ten seed topics very closely. There is a lot to take in here, but look around and decide what you think about these seeded topics for yourself.

Figure 13: Guided topics for 2020 Election data set.

As mentioned above, we display fifteen topics, but there are only ten seed topics. These extra five topics are unsupervised topics that are generated alongside the seed topics. They are not always as coherent as the seed topics, but some might be valuable to consider in the final topic set. If we wanted to run another iteration of GTM, we might add a seed topic with some words from the fifteenth topic, to try to find a topic about Trump Supporters. We might also take words from the eleventh and twelfth topics to seed a topic about Political Party Conventions, or maybe even two topics about the Republican National Convention and the Democratic National Convention.

Visualizing seed topics

Finally, we have created a simple and effective way of visualizing seed topics. You can access and duplicate the Guided Topic Model Visualization Template google sheet here. Copy and paste the transposed seed topics into the Seed Topics tab, and do the same with the final topic set in the Topics tab. The seed words present in the final topics should be highlighted with the color corresponding to their seed topic. This allows you to quickly and easily see how coherent the topics are, and whether the seed topics had a strong enough signal to stick together.

Figure 14: Example of seed topic visualization on Covid-19 surveys.

Figure 14 shows an example of this visualization. These topics are derived from open-ended survey responses taken from a survey of parents on the challenges of their children’s schooling during the Covid-19 pandemic [Davis-Kean et al., 2022]. You can see how seed words are grouped together well in most topics, meaning the seed topic has a strong signal in the data set. We can also see that the topics about After School Activities and Shutdowns did not have a strong signal. We would look more closely at the seed words we provided for these two topics, while looking at the most probable words from the other topics to add to our successful seed topics.

In this article, we explained the value of incorporating a user’s domain knowledge into topic models, showed how the Guided Topic-Noise Model worked, and how anyone can run GTM in just a few lines of code. We also showed sample results, and shared a great way to visualize those results. In the final article of this series, we will explore dynamic topic-noise models, and show how we can use them to track the evolution of topics over time.

This article was co-authored by Lisa Singh, Professor of Computer Science and Director of the Massive Data Institute at Georgetown University.

All images unless otherwise noted are by the author.

References

[1] D. Blei, A. Ng, and M. Jordan, Latent Dirichlet Allocation (2003), Journal of Machine Learning Research 3, 993–1022.

[2] R. Churchill and L. Singh, Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections (2021), International Conference on Data Mining (ICDM), 71–80.

[3] A.K. McCallum, MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu.

[4] R. Řehůřek and P. Sojka, Software Framework for Topic Modelling with Large Corpora (2010), LREC 2010 Workshop on New Challenges for NLP Frameworks. 45–50.

[5] R. Churchill, L. Singh, R. Ryan, and P. Davis-Kean, A Guided Topic-Noise Model for Short Texts (2022), The Web Conference (WWW).

[6] P. Davis-Kean, R. Ryan, L. Singh, and Y. Wang, The “New” Normal for Schooling. MOSAIC Data Brief: August 2022, (2022), Measuring Online Social Attitudes and Information Collaborative.

--

--

Rob Churchill
Georgetown Massive Data Institute

Ph.D. in Computer Science from Georgetown University, pursuing a life outside of academia.