An in-depth introduction to Topic Modeling using LDA and BERTopic

Konstantina Andronikou
Cmotions
Published in
12 min readNov 1, 2022

If you are interested in the evolution of technology when it comes to text mining tools and Natural Language Processing tasks, then you are in the right place! This article presents one of the most well-known NLP tasks: topic modeling. We are providing a thorough overview of topic modeling techniques as well as an in-depth discussion of two different topic models. In this post we are presenting a traditional approach, Latent Dirichlet Allocation (LDA) and a deep learning method, BERTopic. If you would like to follow the implementation with further explanation, we also created blogs for both topic modeling approaches here and here!

Let us set the scene, imagine entering a bookstore to buy a cookbook and being unable to locate the part of the store where the book is located, presuming the bookstore has just placed all types of books together. In this case, the importance of dividing the bookstore into distinct sections based on the type of book becomes apparent[1]. Topic Modeling is a process of detecting themes in a text corpus, like splitting a bookstore depending on the content of the books. The main idea behind this task is to produce a concise summary, highlighting the most common topics from a corpus of thousands of documents. This model takes a set of documents as input and generates a set of topics that accurately and coherently describe the content of the documents. It is one of the most used approaches for processing unstructured textual data, this type of data contains information not organized in a pre-determined way. Over the past two decades, topic modeling has succeeded in various applications in the field of Natural Language Processing (NLP) and Machine Learning. Throughout the years this task has been used in various domains. You might be wondering, what is the value of this task to a business? Well, let me tell you!

The rapid development of technology has contributed to developing new technological tools to improve many domains such as Customer Experience (CX). As customer experience can be crucial in maintaining a successful business, many organizations have adapted to newly developed technological tools. Among the tools used for analyzing and improving customer experience, a significant amount is within the field of Natural Language Processing (NLP), such as topic modeling. Different types of data contain meaningful information for a company’s objectives, such as reviews, customer and agent conversations, tweets, emails, etc. To fully understand clients’ needs, companies must use a wide range of technological tools to analyze which characteristics impact customer experience and satisfaction. In business settings, topic modeling insights can improve a company’s strategy and assist in developing marketing platforms.

History of Topic Modeling​

Topic modeling emerged in the 1980s from the ‘generative probabilistic modeling’ field. Generative probabilistic models are used to solve tasks such as likelihood estimates, data modeling, and class distinction using probabilities. The main idea behind this task is to produce a concise summary highlighting the most common topics from a corpus of thousands of documents. This model takes a set of documents as input and generates a set of topics that accurately and coherently describe the content of the documents. Topic models are continually evolving in sync with technological developments. As a result, new topic modeling techniques have been developed since the 1980s.

It is worth noting that not all topic modeling techniques can be used and are suitable for all types of data (Churchill and Singh[2], 2021). For example, the algorithm used to retrieve hidden topics on social media data might not perform well for scientific articles due to the different patterns of words. Each data and domain characteristic, such as document length, and sparsity, must be considered before implementing a topic modeling algorithm (Churchill and Singh, 2021). In this scenario sparsity can be identified as two different kinds: model and data sparsity. Model sparsity means that there is a concise explanation for the effect we are aiming to model. While data sparsity is bad as there is missing information and the model is not observing enough data in a corpus to model language accurately. Data sparsity is an issue within the field of NLP as it concerns a great number of vocabularies. Due to this variation of different types of data, new topic modeling approaches are developed.

In this article we are going to present two different methods that can be used for implementing a topic modeling task. One method is a traditional model while the second one is a new deep learning approach. We are going to compare and discuss the limitations that need to be taken into consideration when implementing the models.

​​Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a three-level hierarchical generative model. This model is a powerful textual analysis technique based on computational linguistics research that uses statistical correlations between words in many documents to find and quantify the underlying subjects (Jelodar et al.,2019)[3]. This model is considered state of the art in topic modeling.

‘Latent’ represents the process of the model to discover the hidden topics within the documents. The word ‘Dirichlet’ indicates that the distribution of subjects in a document and the distribution of words within topics are both assumed to be Dirichlet distributions. Finally, ‘Allocation’ represents the distribution of topics in the document (Ganegedara, 2019). For a deeper understanding of the components within the LDA topic model a blog written by Thushan Ganegedara [4]offers a great explanation.

The model implies that textual documents are made up of topics, which are made up of words from a lexicon. The hidden topics are ‘a recurring pattern of co-occurring words’ (Blei, 2012)[5]. Every corpus that contains a collection of documents can be converted to a document word/document term matrix (DTM). LDA converts the documents into DTM, a statistical representation describing the frequency of terms that occur within a collection of documents. The DTM gets separated into two sub-matrices: the document-topic-matrix: which contains the possible topics, and the topic-term-matrix: which includes the words that the potential topics contain (Seth, 2021)[6].

Parameters

This topic modeling approach can be implemented in various ways, but the model’s performance comes down to specifying one or more parameters. The most crucial parameters, in this case are the following: Number of topics: the optimal number of topics extracted from the corpus, Alpha: controls prior distribution over the topic weights across each document and Eta: controls prior distribution over word weights across each topic. Depending on the data as well as the goal of the task a grid search needs to be executed to find the optimal parameters. A grid search is a tool used for exhaustively searching the hyperparameter space given in an algorithm by trying different values and then picking the value with the best score. If you think that this topic modeling approach is interesting and would like to further observe the implementation of the technique, then we got you! We generated a notebook that executes a topic modeling task using LDA.

BERTopic

​Devlin et al. (2018)[7] presented Bidirectional Encoder Representations from Transformers (BERT) as a fine-tuning approach in late 2018. If the first thing that comes to mind when reading the word Transformers is the movie, then you might want to look at a blog written by Jay Alammar [8]before continuing with this article. A variation of Bidirectional Encoder Representations from Transformers (BERT) has been developed to tackle topic modeling tasks. BERTopic was developed in 2020 by Grootendorst (2020) [9]and is a combination of techniques that uses transformers and class TF-IDF (term frequency-inverse document frequency) to produce dense clusters that are easy to understand while maintaining significant words in the topic description. This deep learning approach supports sentence-transformers model for over 50 languages for document embedding extraction (Egger and Yu, 2022)[10]. This topic modeling technique follows three steps: document embeddings, document clustering and document TF-IDF.

Parameters

As with any other language model, BERTopic also has some parameters that need to be taken into consideration. The values assigned to these parameters are crucial as they have a major influence on the performance of the model. Some parameters are the Number of topics: the optimal number of topics extracted from the corpus, Language: the primary language used in the training data and Embedding model: depending on the domain of the data, an embedding model needs to be used. There are several numbers of different parameters but in this article the most important ones were presented. This deep learning method gives the user the opportunity to replace all components withing the parameters. For example, the user can select the embedding mode, dimension reduction method with the preferred one. Depending on the task executed and the desired goal, changing the defaults might influence and improve the predictive performance and topic quality of the model. If you would like to see how a BERTopic model is implemented, or if you would like to experiment on different parameters you can find our notebook.

LDA Vs BERTopic

In this article we presented two different ways that an individual can use to execute a topic modeling task. Before you decide which one you would like to use there are some limitations as well as advantages that need to be taken into consideration before implementation. Let us look at the positive side of these models first!

LDA is considered as a state-of-the art topic detection technique, it is a time efficient method. For context, the size of the data used for this project was 11314 documents, when implementing the topic model, the training procedure took less than five minutes. Moreover, this generative assumption confers one of the main advantages, LDA can generalize the model it uses to separate documents into topics to documents outside the corpora. Even though the time efficiency and generative assumption of the model are crucial factors there are also some disadvantages that are crucial for the performance of the model. The first drawback of this generative model is that it fails to cope with large vocabularies. Based on previous research, executors had to limit the vocabulary used to fit and implement a good topic model. This can lead to consequences for the performance of the model. To restrict the vocabulary usually, the most and least frequent words are eliminated; this trimming may remove essential terms from the scope (Dieng et al.,2020)[11].

Another significant limitation is that the core premise of LDA is that documents are considered a probabilistic mixture of latent topics, with each topic having a probability distribution over words, and each document is represented using a bag-of-words model (BOW). Based on this approach, topic models are adequate for learning hidden themes but do not account for a document’s deeper semantic understanding. The semantic representation of a word can be an essential element in this procedure. For example, for the sentence ‘The man became the king of England’, the representation of a bag of words will not be able to identify that the word ‘man’ and ‘king’ are related. Finally, when the training data sequence is altered, LDA suffers from ‘order effects’ which means that different topics are generated. This is the case due to the different shuffling order of the training data during the clustering process. Any study with such order effects will have systematic inaccuracy. This inaccuracy can lead to misleading results, such as erroneous subject descriptions.

Due to these limitations, many new generative and deep learning models have been generated using this traditional approach as a base to improve the topic quality and predictive performance. One of those improved models is BERTopic. This approach, as mentioned before, can take into consideration the semantic understanding of the text with the use of an embedding model, and generate meaningful topics of which the content is semantically correlated. In the case of a topic modeling task, it is vital to produce topics that are correlated and understandable to a human. This deep learning approach does not have any limitations concerning the size of the data used for the task. This prevents the concern of removing essential terms from the scope. Moreover, with this approach it is possible to use multilingual data as there is an available parameter called ’language’ when training the model. Finally, as the model uses embedding models it is possible for the user to choose from a wide variety of embedding models or even create their own custom model. This model has improved some aspects of traditional methods, but it does not mean that there are no limitations that need to be taken into consideration for this algorithm.

When it comes to topic representation, this model does not consider the cluster‘s centroid. A cluster centroid is ‘a vector that contains one number for each variable, where each number is the mean of a variable for the observations in that cluster. The centroid can be thought of as the multi-dimensional average of the cluster’ (Zhong, 2005)[12]. BERTopic takes a different approach, it concentrates on the cluster, attempting to simulate the cluster’s topic representation. This provides for a broader range of subject representations while ignoring the concept of centroids. Depending on the data type, ignoring the cluster’s centroids can be a disadvantage. Moreover, even though BERTopic’s transformer-based language models allow for contextual representation of documents, the topic representation does not directly account for this because it is derived from bags-of-words. The words in a subject representation illustrate the significance of terms in a topic while also implying that those words are likely to be related. As a result, terms in a topic may be identical to one another, making them redundant for the topic’s interpretation (Grootendorst, 2022). Finally, an essential disadvantage of BERTopic is the time needed for fine-tuning, for this project an hour was needed for this model to train.

Concluding Remarks

This article presented two different approaches to topic modeling. LDA is a generative model which categorizes the words within a document based on two assumptions: documents are a mixture of topics and topics are a mixture of words. In other words, ‘the documents are known as the probability density (or distribution) of topics, and the topics are the probability density (or distribution) of words. While BERTopic is a deep learning method which takes into consideration the frequency of each word per class, divided by the total number of words. This is a form of regularization of frequent words in the class, then the total number of documents is divided by the total frequency of word across all classes . As a result, rather than modeling the value of individual documents, this class-based TF-IDF approach models the significance of words in clusters. This enables us to create topic-word distributions for each document cluster.

You might be wondering, but when should I use LDA and when BERTopic? Well, this decision depends on many factors. For example, the size of the data and the resources used for the task can be crucial for the decision of which model is better. If an individual is working with a great amount of data and their resources are not powerful enough, then it can be the case that topic model takes a tremendous amount of training time. Then in this case the way to go is LDA. Or if an individual thinks that the semantic representation of their data is important and would like to take it into consideration then BERTopic is the solution. At the end of the day, it really depends on the goal and target of the project. When it comes to tuning the topic models for the best result, LDA takes a great amount of time in terms of tuning and preparing the input. For example, inspecting the data, pre-processing, and filtering. While in the case of BERTopic many different variants can be tested, such as which pretrained model to implement for embeddings, what dimension reduction and clustering techniques to use.

The main aim of this article was to introduce topic modeling and highlight the importance of retrieving hidden topics within a great amount of text data. Moreover, give a summary of the history of topic modeling as well as the models that have been created with technological development. Finally, this article aims to show the value of an NLP task such as topic modeling, therefore two notebooks were also generated to give a better overview of the practical matters of implementation (here and here). This automatic topic retrieval can provide a company with information about the most frequent matters that the customers talk about and improve a company’s strategy and assist in developing marketing platforms.

​If you enjoyed reading this article and you would like to know more about NLP and the variety of tasks that it includes, check out our series Natural Language Processing.

[1] This example was inspired by Topic Modelling With LDA -A Hands-on Introduction — Analytics Vidhya

[2] The Evolution of Topic Modeling (acm.org)

[3] [1711.04305] Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey (arxiv.org)

[4] Intuitive Guide to Latent Dirichlet Allocation | by Thushan Ganegedara | Towards Data Science

[5] Probabilistic topic models | Communications of the ACM

[6] Topic Modeling and Latent Dirichlet Allocation (LDA) using Gensim (analyticsvidhya.com)

[7] [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org)

[8] The Illustrated Transformer — Jay Alammar — Visualizing machine learning one concept at a time. (jalammar.github.io)

[9] BERTopic (maartengr.github.io)

[10] (PDF) A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts (researchgate.net)

[11] [1907.04907] Topic Modeling in Embedding Spaces (arxiv.org)

[12] [PDF] Efficient online spherical k-means clustering | Semantic Scholar

Want to read more about the cool stuff we do at Cmotions and The Analytics Lab? Check out our blogs, projects and videos! Also check out our Medium page for more interesting blogs!

--

--