Topic Modeling with Latent Dirichlet Allocations (LDA)

Konstantina Andronikou
Cmotions
Published in
10 min readNov 1, 2022

Topic Modeling with Python

This Notebook is an adaptation of the tutorial generated from Piek Vossen.

What is Topic Modeling?

The following image was sourced from Blei, 2021

Imagine entering a bookstore to buy a cooking book and being unable to locate the part of the store where the book is located, presuming the bookstore has just placed all types of books together. In this case, the importance of dividing the bookstore into distinct sections based on the type of book becomes apparent. Topic Modeling is a process of detecting themes in a text corpus, similar to splitting a bookshop depending on the content of the books. The main idea behind this task is to produce a concise summary highlighting the most common topics from a corpus of thousands of documents.

This example was inspired by the following blog

This model takes a set of documents as input and generates a set of topics that accurately and coherently describe the content of the documents. It is one of the most commonly used approaches for processing unstructured textual data, this type of data contains information not organized in a pre-determined way.

What does Latent LDA stand for?

  • Latent: represents the process of the model to discover the hidden topics within the documents
  • Dirichlet indicates the distribution of subjects in a document and the distribution of words within topics
  • Allocation represents the distribution of topics in the document

How does LDA work?

This method is a three-level hierarchical generative model, it is a powerful textual analysis technique based on computational linguistics research that uses statistical correlations between words in a large number of documents to find and quantify the underlying subjects (Jelodar et al.,2019). This topic modeling algorithm categorizes the words within a document based on two assumptions: documents are a mixture of topics and topics are a mixture of words. In other words, ‘the documents are known as the probability density (or distribution) of topics, and the topics are the probability density (or distribution) of words’ (Seth, 2021). The hidden topics are a ‘recurring pattern of co-occuring words’ and therefore this method relies on the bag-of-words (BOW) approach. This approach combines all words into a bag without taking into consideration the deeper semantic understanding or the order of tokens within the document. For example, for the sentence ‘The man became the king of England’, the representation of a bag of words will not be able to identify that the word ‘man’ and ‘king’ are related.

LDA converts documents into document word/document term martix (DTM). This is a statistical representation describing the frequency of terms that occur within a collection of documents. DTM gets separated into two sub-matrices: the document-term matrix: which contains the possible topics, and the topic-word matrix: which includes the words that the potential topics contain (Seth,2021).

For a deeper understanding of the components with LDA a blog written by Thushan Ganegedara offers a great explanation.

Implementation

Enough on the theoretical component, let’s get our hands dirty! This notebook is implementing a traditional approach of topic modeling, Latent Dirichlet Allocation (LDA)(Blei, Ng, and Jordan, 2003).

Step 0: Loading the data and relevant packages

The first step in order to start with the topic modeling task is to load the desired data as well as the relevant packages for the pre-processing steps. The data used for this notebook is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups.

Step 1: Reviewing and preparing the data

Text data is unstructured and if we want to extract information reading is not an option. You need to process those texts to obtain structured representations. The common idea for all NLP tools is that they try to transform text in some meaningful way. Before diving into the topic modeling task, itself, we need to review and prepare the textual data.

1.1 Data Statistics

By reviewing the format, length as well as the type of data, can provide better understanding and useful information in terms of what kind pre-processing steps need to be implemented. The following cell provides some information concerning the data we are using.

1.2 Dataframe

A DataFrame is a two-dimensional data structure of a table with rows and columns. With the creation of a DataFrame from textual data it makes the inspection and understanding of the data easier. Moreover, this function can take on a lot of different structures as input. The package used to generate a dataframe is pandas.

1.3 Pre-processing

As it can be seen from the output of the previous cells the data is not ready for the topic modeling task. It contains many elements that create ‘noise’ in the data such as punctuation, content words, etc. The process of preparing textual data can be consuming as the input used for the model is crucial for the quality of a language model. Based on the data used in this notebook the following pre-processing steps are implemented:

1.3.1 Tokenization: A token is the word or the punctuation mark as it appears in the sentence. Tokenization is the process of splitting the sentences into individual words/punctuation. This process is beneficial as it divides the text data into pieces and thus make it easier for a language model to distinguish.

1.3.2 Lemmatization A lemma is the root form of a token, for instance, the word ‘undivided’ within a sentence is a token and ‘divide’ would be the corresponding lemma. In this case we are lemmatizing to prevent redundant topics such as ‘books’ and ‘book’.

1.3.3 Filtering Removing words that do not contain any meaning such as: pronouns, determiners and conjunctions. This reduces the ‘noise’ within the data and helps the language model when training.

The previous cell shows the pre-processed data, as it can be seen the overall image of the data is easier to understand. The processed data has been separated into individual tokens and content words as well as punctuation have been removed. This data will be used as the input for the topic model.

Step 2: Input preparation for topic model

2.1.Bag of Words (BoW)

The first step for the input preparation is to create a dictionary containing the frequency count of the words that appear within the data. This approach counts how often each word appears in each cluster, residing the frequency of each word.

2.1 Filtering extremes

With the help of the dictionary function, we are able to filter out tokens that are present in less that 10 documents or in more than 0.5 documents. When those two conditions are met then the function returns the 100000 most frequent tokens. Depending on the task and preference the values of these parameters can be changed to the preferable values.

2.2 Vector representation

With the doc2bow function provided by gensim we can create a BoW vector representation for the dataset. As it can be seen from the output of this function, we can inspect the frequency of each individual token within the document.

Step 3: Parameters and training the model

This topic modeling approach can be implemented in various ways, but the model’s performance comes down to estimating one or more parameters. In this case the most crucial parameter is Number of topics. Depending on the data as well as the goal of the task a grid search can be executed in order to find the optimal parameters. A grid search is a tool used for exhaustively searching the hyperparameter space given in an algorithm by trying different values and then picking the value with the best score.

Parameters for this implementation

1.bow_corpus = Corpus data as BoW
2.id2word = Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
3.passes = Number of full passes over the training corpus.
4.num_topics = Number of topics to extract.

Additional parameters that can be used!

3.1 Training

The LDA model used for this notebook is provided by gensim.models.ldamulticore — parallelized Latent Dirichlet Allocation. For more information about the model and the algorithm specifications please have a look at the gensim page.

Step 4: Results

After training the model with the help of visualization we can easily inspect and understand the output of the topic model. For this notebook two different types of visualizations were chosen. There are multiple different ways to visualize the output of a topic model. A great overview and inspiration of the following visualizations is a blog written by Selva Prabhakaran.

4.1 Interactive graph

A popular visualization package used for LDA is pyLDAvis which gives a great overview of the individual topics and the tokens within them as well as the relationship between the topics.

If this is the first time using pyLDAvis the following line is necessary

As it can be seen with the interactive representation of the topics we can manually select and view a specific topic and its most frequent terms. This method includes a ‘λ’ parameter that gives us the opportunity to adjust the relevance of the terms. As mentioned before with this visualization we can not only see the individual topics but also the relationship/correlation between them. This can be done by exploring the Intertopic Distance Plot and the Marginal topic distribution.

4.2 Barchart

With the help of matplotlib we can visualize the generated topics using a barchart. The following function generates a graph that contains 10 different barcharts, each barchart represents a topic and their corresponding terms.

Discussion

Let's take a moment to discuss some of the topics generated! If we have a look at topic 7 we can see that the model has combined terms such as ‘game’, ‘player’, ‘hockey’,’season’. Based on human judgment we can hypothesis that this news topic concerns the sports domain such as hockey. Another example is topic 4, it can be seen that it contains terms such as ‘christian’, ‘believe’, ‘jesus’ so it can be assumed that the news topic in this case concerns religion.

Step 5: Evaluation

The unsupervised nature of topic models makes the model selection problematic; therefore, evaluation is an important issue. Topic coherence is a part of the larger subject of what are good topics, what properties of a document collection make it more suitable for topic modeling, and how can topic modeling’s potential be utilized for human benefit (Newman et al., 2010). This evaluation method can be defined as the degree of significance between the words inside a topic in terms of how interpretable it is. The goal of the topic coherence metrics employed is to assess the quality of topics from a human-like standpoint.

  • C_v: this measure ‘is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity’ (Mifrah and Benlahmar, 2020).
  • C-umass: this measure takes into consideration the document co-occurrence counts, one-preceding segmentation, and a logarithmic conditional probability as a confirmation measure (Mifrah and Benlahmar, 2020).

Source of the evaluation method.

Discussion

Even though LDA is considered as a state-of-the-art topic detection technique there are some limitations that need to be taken into consideration before implementation.

Limitations

The first drawback of this generative model is that it fails to cope with large vocabularies. Based on previous research, executors had to limit the vocabulary used to fit and implement a good topic model. This can lead to consequences for the performance of the model. To restrict the vocabulary usually, the most and least frequent words are eliminated; this trimming may remove essential terms from the scope (Dieng et al.,2020). Another significant limitation is that the core premise of LDA is that documents are considered a probabilistic mixture of latent topics, with each topic having a probability distribution over words, and each document is represented using a bag-of-words model (BOW). Based on this approach, topic models are adequate for learning hidden themes but do not account for a document’s deeper semantic understanding. The semantic representation of a word can be an essential element in this procedure. Finally,when the training data sequence is altered, LDA suffers from ‘order effects’ which means that different topics are generated. This is the case due to the different shuffling order of the training data during the clustering process. Any study with such order effects will have a systematic inaccuracy. This inaccuracy can lead to misleading results, such as erroneous subject descriptions.

Closing Notes

This notebook was aiming to introduce a topic modeling task and highlight the importance of retrieving hidden topics within a great amount of text data. Moreover, to show the value of an NLP task such as topic modeling, this automatic topic retrieval can provide a company with information about the most frequent matters that the customers talk about and improve a company’s strategy and assist in developing marketing platforms.

If you would like to read more about Topic Modeling, please have a look at our article An in-depth Introduction to Topic Modeling using LDA and BERTopicand make sure to check out the notebook generated based on a deep learning approach, BERTopic.

If you are in general interested in NLP tasks, then you are in the right place! Take a look at our series Natural Language Processing.

Want to read more about the cool stuff we do at Cmotions and The Analytics Lab? Check out our blogs, projects and videos! Also check out our Medium page for more interesting blogs!

--

--