Learn to Find Topics in a Text Corpus

10 min readApr 16, 2018

Be it customer reviews, news articles or conversations between people, when we are tasked with the ordeal of having to figure out what the corpus is about, it is impossible to manually read and summarize them. Topic modeling is a natural language processing technique that extracts latent topics from a corpus of documents. Unlike a classification problem, there are no labels directing this process, hence it is unsupervised. There are many algorithms that perform topic modeling. The most important ones are:

Latent Dirichlet Allocation (LDA)
Non Negative Matrix Factorization (NMF)

In this blog, we will restrict our discussion to topic modeling using LDA.

Understanding LDA

Let us explore how LDA works.

The above chart shows how LDA tries to classify documents. Documents are represented as a distribution of topics. Topics, in turn, are represented by a distribution of all tokens in the vocabulary. But we do not know the number of topics that are present in the corpus and the documents that belong to each topic. In other words, we want to treat the assignment of the documents to topics as a random variable itself which is estimated from the data. This sounds complicated, but the process we have discussed above is similar to the Dirichlet process. Curious readers can read the following article to understand the Dirichlet process in detail.

The Dirichlet process for dummies (i.e., biologists, like me)

The Dirichlet process is a very useful tool in Bayesian nonparametric statistics, but most treatments of it are largely…

phyletica.org

Hence, LDA makes a prior assumption that the (documents, topics) and (topics, tokens) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

Since topics are learned from the model, these topics are formed based on some underlying concepts buried deep in the data.

Now that we have an understanding of how LDA works, let’s see how it is implemented. The corpus is converted into a bag of words as given below in Fig-1. This when passed through a topic modeling algorithm such as LDA, we identify the following two things:

The distribution of words in each topic
The distribution of topics in each document

Fig-1: General flowchart for topic modeling

From the figure above, it can be easily understood that topic models solve the problem of a large sparse matrix. Imagine we have n documents and m terms in our vocabulary. Now let us compare conventional algorithms against topic models:

Conventional models: n x m variables
Topic models: k x (n+m) variables where k << n<<m where k is the number of topics

Each document in a corpus can be as short as a single sentence (a tweet) or as long as a chapter in a book.

Typically a corpus with very short documents tends to be more difficult to build coherent(interpretable) models over than the one with slightly longer documents.

Let’s get our hands dirty….

LDA Implementation

Step-1: Data preprocessing with SpaCy

Fig-2: General flowchart for data processing

For the purpose of this blog post, I have scraped about 12,000 resumes from the web. I have masked all personal identifiers from the dataset considering its sensitivity. In this dataset, we will be looking at the skills of candidates who are interested in data science and try to understand what clusters of skill sets exist. Please grab the data from this link.

Let’s look at the snapshot of the resume data below.

As we can see, the data is very messy. I preprocessed the data using SpaCy. If you are not familiar with SpaCy, I have created an introductory tutorial in this link. The preprocessing must be customized to a dataset. Our processed data must represent each document in the corpus as a list of words, that represent the ideas that you want to capture from this document.

Step-2: Topic modeling

Post preprocessing, the corpus was converted into a ‘bag of words’ as shown in Figure 2. I have previously analyzed LDA with Sklearn and Gensim packages in Python. The latter is capable of running on multiple cores and hence it runs much faster. A full listing of the code from preprocessing texts to generating final topic model is given below.

Selecting a topic model & topics within it

In the example code given above, the number of topics was set to 12. But how did I chose this magic number 12? The hardest and the most elusive part of topic modeling is the selection of the number of topics. The most popular way of tackling this issue is by eyeballing the topics derived from models trained with different numbers of topics. The most sensible one is selected. But what is ‘sensible’ differs from one person to another. The use of such approaches often leads us to wrong conclusions about the corpus.

In this section, I will summarize two important techniques used to evaluate topic models, i.e,

Perplexity (might not be such a great measure)
Topic coherence

Perplexity

Given a trained model, perplexity tries to measure how this model is surprised when it is given a new dataset. This is measured as the normalized log-likelihood of the held out test set. The lower the perplexity, the better the model.

The first equation calculates the log-likelihood; the probability of observing some unseen data given a model learned earlier. This checks whether the model captures the distribution of the held out set. If it doesn’t then the perplexity is very high suggesting that model is bad. However, studies have shown the following:

Perplexity is not strongly correlated to human judgment and, even sometimes slightly anti-correlated.

This is a good example of when a popular metric doesn’t fit the business requirement. This is why there is a focus on topic coherence.

Topic coherence

Here we quantify the coherence of a topic by measuring the degree of semantic similarity between its high scoring words. These measurements help distinguish between topics that are human interpretable and those that are artifacts of statistical inference. To compute topic coherence of a topic model, we perform the following steps.

Select the top n frequently occurring words in each topic
Compute pairwise scores (UCI or UMass) for each of the words selected above and aggregate all the pairwise scores to calculate the coherence score for a particular topic.

3. Take a mean of the coherence score per topic for all topics in the model to arrive at a score for the topic model.

To understand more about the score function used in evaluating topic coherence, please read the article given below.

Topic Coherence To Evaluate Topic Models

Human judgment not being correlated to perplexity (or likelihood of unseen documents) is the motivation for more work…

qpleple.com

Using the concept mentioned above, we can shed some light on the main issues:

Topic model evaluation: to estimate the number of topics in a corpus

We can evaluate average coherence score per topic for a range of models trained with a different number of topics by following the steps above. The number of topics for which the average score plateaus, as shown in the figure below, is the sweet spot we are looking for.

As you can see, the score plateaus around 12–14 topics. So our best guess for the number of topics is in this range. The code for this step can be found on my Github.

2. Topic evaluation: automated selection of important topics

Once we select a topic model, say with a number of topics as 12, the next step is to distinguish between human interpretable topics and jargons of statistical analysis. We use the same concept as above. We calculate the average coherence score per topic with the same coherence measure.

Higher the topic coherence, the topic is more human interpretable. We can order the topics based on their decreasing magnitude of topic coherence scores to obtain the most human interpretable topics.

How do we decide on a threshold for our coherence score?
This is more of an art than science. Normally, we can see a huge drop in coherence score after a particular topic. In the table above, the drop happens after the first 7 topics. In addition to the above, apart from having a good coherence score, a topic must have a significant amount of tokens/words. In our case, the token percentage falls below 10% after the fourth topic from the top (described in the next section).

Here, we can see that topics 4, 11 and 8 have very less topic coherence score. Hence they can be discarded but topics 9, 10, 12, 7 have good coherence score. They are more interpretable topics compared to the rest. The code for this section can be found on my GitHub.

So far we have learned how to build a topic model, how to evaluate and get the best topic model and how to evaluate the topics within the selected topic model. Now let’s proceed to understand these topics through visualization.

Interactive visualization of topics using pyLDAviz

In the previous section, we have defined each topic with top 10 frequently occurring words in that topic. As you would have noticed, the caveat with the above approach is that often a term that is important to one topic might be important to others topics as well. Such words don’t help us in differentiating the topics from one another. For example, in this dataset, since we are talking about data science skills, words such as ‘python’, ‘sql’ etc will find its way into almost all topics.

How do we draw out relevant words with respect to a specific topic?

We want terms that are not just important for a topic, but it should also be able to distinguish this topic from the rest of the lot. In this context, I introduce the concept of relevance.

The relevance of a term w.r.t to a topic is defined as weighted sum of the probability of word w given a topic k and the lift. Lift for a word w.r.t to a topic is given by the ratio of the probability of word w given a topic k and the probability of the word in the entire corpus.

The first term in the RHS is proportional to the frequency of the word occurrence in that topic. Therefore, if a word frequency is high on multiple topics, then the word doesn't help to describe or distinguish the topic really well. The second term in the RHS indicates the exclusivity of a word to a specific topic; this term is proportional to the ratio of the frequency of a word’s occurrence in a specific topic and frequency of a word in the corpus. The exclusivity term decreases the score of globally frequent terms but it increases the score of rare terms that occur in a topic. Through empirical research, it has been found that an ideal value of the weight, ℷ is generally kept around at 0.3.

The relevant words for the top 4 coherent topics for this dataset are given below. I chose to use 4 topics out of my entire topic list, as the next topic had a token percentage less than 10%. Since my documents are too short and the number of documents is also less, the topics with less than 10% of total tokens in the corpus will most probably lead to incoherent topics.

The final number of topics must be chosen after taking into account the token percentage and coherence score for each topic

Now it is easy for us to name these highly coherent (interpretable) topics with the help of relevant words. We have ignored the less coherent topics from the analysis.

The main use of this package to provide interactive visualizations to augment our understanding. Let us see how our topics look:

We can understand the document from the following perspective.

Topics are represented as a bubble. The size of the bubble is proportional to its prevalence of the corpus.
Similar topics appear close together, topics further apart are less similar.
Upon selecting a topic, the most representative words for the selected topic can be seen. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.
When a topic is selected, the percentage of tokens in the topic is also visible. This measure can be used as an additional measure to weed out irrelevant topics. For example, the final list of topics for the above example was cut down to four because of this reason.
Hovering over a word will adjust the topic sizes according to how representative the word is for the topic as shown in the figure below.

To play more with this visualization, please download the interactive visualization from this link.

Applications of topic modeling

LDA model gives a topic distribution of each document. This can be used to predict the most probable topic for a given document
Other documents similar to a given document can be found using LDA
Comparing the topics from two different corpora helps us to understand the similarity and differences between the two
To construct network of topics and documents. Check out this link.

Applications 1–3 have been demonstrated in this notebook in my GitHub repository.

References

Editors : Kunal Kotian, Sri Santhosh Hari and Alvira Swalin