Topic modeling or being spoilt by choice

Remi Ouazan Reboul
Stellia.ai
Published in
10 min readAug 17, 2021
Each label is represented by a color and a large dot. Each small dot represents a question. The closer a small dot is to a large dot, the more related they are. Image by Author.

Author’s note: this article was inspired by a figure found in a paper by Ike Vayansky and Sathish A.P. Kumar (link in references).

Imagine you’ve just finished scrapping a big database of documents for an NLP-related project. And now, you want to regroup this gigantic sea of data by topic. If you know about topic modeling, a class of algorithms made just for this, it’s smooth sailing from here: just pick an algorithm and let it churn through your database. But turns out, picking isn’t that easy.

Topic modeling is a popular subject, thus there are many algorithms out there. And since all topic models are the children of sharp minds, the bad news is that there is none systematically better. Choosing a topic model is a complicated subject on which long articles have been written. This article was inspired by one such publication that you can find in reference.

This article will explore how we can choose the topic modeling algorithm that will fit our dataset the best by relying on a few easy rules of thumb.

To illustrate the various rules, we’ll use a list of questions from various datasets, namely SQuAD, ELI5, IMDb, and HotpotQA. To get all of these in pandas DataFrame, the following code snippet will do the trick:

Summary

I. Rules of thumb for picking a topic model
II. The magic flowchart
III. Summary on topic modeling algorithms w/ papers
IV. Conclusion and references

I. Rules of thumb for picking a topic model

Rule 1: average number of words per document

Our datasets and their average word count. Image by Author.

This might be the most important factor: the average number of words per document (a document is the same as an entry in the database) is a recurrent problem in topic modeling. Usually, the more the better, which makes sense: if I asked you to pick a label for a Wikipedia article by just giving you a random sentence taken from it, you will probably fail.

The criterion is simple:

  • Is the average word count per entry superior to 50?

In our example, only the texts from the IMDB dataset have a word count above 50. To get that number out of a list of entries and visualize it, just use this snippet, which accounts for contractions:

Rule 2: topic relationships

Sometimes, the topics you are looking for are completely independent: as in SQuAD, all questions are grouped by topics, but these don’t relate to each other in a meaningful way. But in other cases, such as in the climate_fever dataset (on climate change) or ag_news dataset (containing news on the span of one year) topics are bound to be related.

And to be even more fine-grained, if the topics’ relationships do matter to you, what kind of relationship do you want to explore: correlation or change over time? All in all, that leaves us with three questions:

  • Are the topic’s relationships of interest?
  • Does this relationship is temporal in nature?
  • Do we want to show a correlation?

For topic modeling tasks that heavily depend on topics’ relationships, some of you may want to look into Graph Topic Modeling or GTM, because recent works in this field have led to promising results.

Rule 3: metadata availability

In some cases, it’s easy enough to find metadata such as information related to the entries’ authors or keywords related to various entries. For example, in the fake_news_english database, since entries are articles with links, one could easily add the author’s name to the data or keywords related to the article. To put it in a nutshell:

  • Do we have access to the entries’ author or keywords related to the entry?

Even in other datasets with no sources, such as a list of SQuAD questions, one could use a Named Entity Recognition model to extract keywords out of the various questions: you might be able to make your own metadata.

Rule 4: personal concerns

Stats on our datasets. Image by Author.

In this section, we are not concerned with the dataset, but only with what you want and need out of your algorithm. To be more precise, we now need to answer two more questions (the lasts ones, I promise):

  • Is the computational cost a concern?
  • Do we want more than 60 topics?

The computational cost for some of these algorithms is high, and chances are you might need to run them on very large databases.

Also, the number of topics you expect in the end is an important bargain between precision and complexity: the more you have, the more likely you are to tailor the topics to the documents, but you also risk getting lots of topics that encompass a small number of documents.

II. Picking our algorithm

Now that we have answered all those questions, we’ll let our answers guide us through this flowchart, adapted from an article by Ike Vayansky and Sathish A.P. Kumar [1]. Here is a link to a pre-print version available on ResearchGate.

Topic model flowchart. Image by Author.

All of these topic models share something paramount: you need to pick the number of topics you want to get. That’s a double-edged sword. If you want a topic model that doesn’t require that, they exist out there, mostly derived from clustering algorithms such as k-means or DB-scan.

Before we finish, we’ll take a stroll through the different algorithms present in this chart, because it’s very likely the one you picked is a stranger. And even if it isn’t, it’s still useful to know your way around the neighborhood.

III. A quick tour of topic modeling algorithms

Latent Dirichlet Allocation (LDA)

Paper: Latent Dirichlet Allocation [2]

The Dirichlet distribution, used in the LDA algorithm. Image by Author.

The LDA algorithm considers that each document is generated in the following manner:

  • We chose a list of topics for all documents, where a topic is a distribution of words
  • For each document, we chose a list of topics and their importance in each document
  • We add random words to each document according to the topics and their importance
  • We destroy the topic list and only keep the documents

Of course, this isn’t the case, but that assumption is the main idea behind LDA, and when you think about it, it’s not that far from the truth.

Now, since the LDA can’t get the list of topics or their importance in each document, it does the next best thing: trying to replicate the process.
The LDA algorithm creates topics (distributions of words) and sees what would happen if it generated documents with various topic importance. Through optimization, it then tries to replicate each document’s distribution as best as it can.
It then does that on every document, and after the LDA is done, we’re left with a list of N topics, and for each document, their importance in the document. N is chosen before the algorithm’s execution.

LDA is perfect for modeling bases like IMDb, where entries are long and topics unrelated.

A good thing to keep in mind is that the most commonly used topic modeling algorithm is probably LDA, but not for the reasons one might think. LDA performs well as soon as the text is long enough. This is why when someone doesn’t know a lot about their database, or about topic modeling, they’ll go straight for LDA, missing some algorithms that would better fit their dataset. But you now know better.

Johann Peter Gustav Lejeune Dirichlet (Wikipedia Commons)

Topics over time (TOT)

Paper: Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends [3]

TOT is an LDA-related model that is based on the same principle as LDA but also takes into account the date at which documents were written. It doesn’t make assumptions about the documents’ dates and their relations, meaning that TOT can take into account complicated time patterns such as ellipses. If you have long documents with non-linear time relations, this should do it.

Dynamic topic models

Paper: Dynamic Topic Models [4]

Dynamic topic models take this assumption a step further: documents also come with their time of writing, but they are then grouped by time slices. It’s assumed that the topics in a slice are influenced by the topics in the previous slice. It should be the go-to algorithm for long documents spread over time linearly, like newspapers.

Author or keyword aggregation LDA

Paper: Aggregated topic models for increasing social media topic coherence [5]

This algorithm allows for LDA to perform well on small documents by aggregating them on author or keywords. This algorithm was introduced in a paper where the database was based on social media, and looking at twitter’s heavily labeled (or #ed) short tweets, one understands why. Keep in mind that this algorithm is the way to go for small documents where topics are related when you can afford metadata like keywords.

Correlated topic model (CTM)

Paper: Correlated Topic Models [6]

CTM is based on the same principles as LDA but links the different topics distributions together. How it does this is quite complex: in LDA, the D stands for Dirichlet, because it involves drawing from the Dirichlet distribution that we represented above. In CMT, this distribution is replaced by one that depends on parameters influenced by the topic’s distributions in the corpus.

It’s used for long documents with correlated topics, as pachinko allocation models, while delivering on average fewer topics than Pachinko.

Pachinko allocation model (PAM)

Paper: Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations [7]

Photo by Emile Guillemot on Unsplash

PAM is a generalization of LDA, which we won’t try to explain here. Just remember that when you’re using CMT and want to find a lot of topics (>60), you should PAM for help instead.

Mixture of unigrams

Paper: A Simple Topic Model (Mixture of Unigrams) [8]

This algorithm, which is based on the same principle as LDA, but is simpler, largely outperforms LDA on short texts. However, it is not meant to capture relationships between topics, and it will be left trailing behind other algorithms if you can leverage inter-topic relations. A perfect example of a dataset you should use Mixture of Unigrams for topic modeling is the list of questions we talked about in the beginning or any QA dataset.

Self aggregated topic model (SATM)

Paper: Short and Sparse Text Topic Modeling via Self-Aggregation [9]

What sets this model apart from LDA is the assumption that not only documents are generated randomly from a hidden list of topics, but also that documents in your dataset are part of longer hidden documents. Since this model is used for short documents related to one another, this assumption makes sense. Although, one has to be aware that this algorithm requires copious amounts of optimization before getting the work done, so if that’s a deal-breaker, skip to the next one.

Sparse Pseudo-document topic model (PTM)

Paper: Topic Modeling of Short Texts: A Pseudo-Document View [10]

PTM is SATM’s cousin, which shares the same founding principle, will be of aid when you want to call on SATM but can’t afford it. I won’t expand on it too much, but I’ll leave you with this paper.

IV. Conclusion

Each label is represented by a color and a large dot. Each small dot represents a question. The closer a small dot is to a large dot, the more related they are. Image by Author

You now have the know-how to choose your topic-modeling algorithm and a bit of knowledge on what’s out there. It is now time to pick (or guess) the number of topics you want, run the model and wait for the job to be over.

Credits

Thanks to Ha Quang Le, Fanbo Meng, Samy Lahbabi, Nicolas Rennert and the rest of the staff at ProfessorBob.ai that helped me during the redaction of this article.

References

[1] Ike Vayansky and Sathish A.P. Kumar, A review of topic modeling methods (2020), Information Systems Volume 94

[2] David M. Blei, Andrew Y. Ng and Michael I. Jordan, Latent Dirichlet Allocation (2003), Journal of Machine Learning Research

[3] Xuerui Wang and Andrew McCallum, Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends (2006), Department of Computer Science - University of Massachusetts

[4] David M. Blei and John D. Lafferty, Dynamic Topic Models (2006), Proceedings of the 23 rd International Conference on Machine Learning

[5] Stuart J. Blair, Yaxin Bi and Maurice D. Mulvenna, Aggregated topic models for increasing social media topic coherence (2019), Springer

[6] David M. Blei and John D. Lafferty, Correlated Topic Models (2007), The Annals of Applied Statistics

[7] Wei Li and Andrew McCallum, Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations (2006), Proceedings of the 23 rd International Conference on Machine Learning

[8] Allan B. Riddell, A Simple Topic Model (Mixture of Unigrams) (2012), Distant Readings: Topologies of German Culture in the Long Nineteenth Century

[9] Xiaojun Quan, Chunyu Kit, Yong Ge and Sinno Jialin Pan, Short and Sparse Text Topic Modeling via Self-Aggregation(2015), Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence

[10] Yuan Zuo, Junjie Wu, Hui Zhang , Hao Lin, Fei Wang, Ke Xu and Hui Xiong, Topic Modeling of Short Texts: A Pseudo-Document View (2016), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

--

--

Remi Ouazan Reboul
Stellia.ai

Machine learning and math student in France at l’Ecole supérieure des Mines de Paris. Currently working as an intern at ProfessorBob.ai