What Is Topic Modeling and How Can It Improve Natural Language Processing?

Topic Modeling and Visualizations with Gensim LDA equipped with Mallet library.

Farhad Sadeghlo
Omdena
Published in
9 min readFeb 14, 2021

--

The problem

The whole idea about topic modeling comes from the inspection of websites followed by an analysis to see if the website is performing as it is expected.

In this matter, deriving the following is useful — the dominant topic in each sentence, the most representative document in each topic, the topic distribution across documents, the frequency distribution of the word counts, the word cloud of top N keywords in each topic, the word counts of topic keywords, the sentence chart colored by topic, and a t-SNE clustering chart.

How Topic Modeling can help: An example

World Resources Institute (WRI) has been seeking to understand how regional and global Nature-Based Solutions (NbS) (e.g. forest and landscape restoration) can be leveraged to address and minimize climate change impacts. More than 50 Omdena AI changemakers developed an interactive dashboard, which visualizes the impact of climate change in various areas as well as showcases mitigating Nature-Based-Solutions.

Where it is needed

The used cases for such a project are everywhere and the results can be endless expectations; As explained the real reason is to qualify the clarity of the website; Some examples of websites that would need to be analyzed by topic modeling are those in the business of stock market, political views, publications, news, articles, blogs, in general, any website on the internet which has paragraphs in a large capacity which would not be possible for an individual to provide thorough analysis just by reading the raw text; Some examples of sectors which would ask for such analysis are everywhere, the owner(s) of the website, governments, jurisdictions, institutions for example, those who would like to invest their money in an industry, and etc.

How we used topic modeling — The datasets

The three datasets that this project was based on were: 1. afr100, 2. initiative20x20, 3. cities4forsets.

Specifically, for this article, there were dozens of pdf files provided by the initiative20x20 website. We analyzed the extracted datasets, the following is the report of our analysis.

The approach

Let’s go through all the steps that are preferable to do for topic modeling analysis.

Major libraries used in this project

pyLDAvis is one of the major libraries that would help us in providing great analysis equipped with nice visualizations.

Another useful package to be installed is the mallet library. According to their website, it is a Java-based package for topic modeling, document classification, clustering, information extraction, and other statistical natural language processing and machine learning applications.

Mallet library provides a better implementation of LDA, it runs faster and gives better topic separation.

Pros of Gensim LDA

It provides a general view of the context, unigrams, bigrams, and trigrams can be combined altogether.

Gensim LDA works perfectly on unigrams.

Two parameters of min_count and threshold would help in modifying the results from just unigrams to a general format. Min_count means the least number of cases in each topic, and threshold means

Gensim works properly with pyLDAvis.

Cons of Gensim LDA

The performance of Gensim LDA on bigrams and trigrams is weak, if one is interested just to generate bigrams and trigrams it doesn’t provide such important unique results, as mentioned above it provides a combination of all the N-grams.

For example, these are some results that Gensim LDA has provided.

min_count=1 and threshold=100

[['city', 'take', 'many', 'shape', 'size', 'abandon', 'overgrow', 'lot', 'avenue', 'tower', 'vast', 'inner', 'city', 'park', 'repurpose', 'parking', 'space', 'serve', 'green', 'pocket', 'park', 'inner_forest', 'wild', 'natural', 'manicure', 'somewhere', 'find', 'public', 'private', 'land', 'provide', 'leisure', 'recreation', 'opportunity', 'riverbank', 'reduce', 'damaging', 'effect', 'stormwater', 'form', 'add', 'mosaic']]

min_count =1 and threshold=50

[['city', 'take', 'many', 'shape', 'size', 'abandon', 'overgrow', 'lot', 'avenue', 'tower', 'vast', 'inner', 'city', 'park', 'repurpose', 'parking', 'space', 'serve', 'green', 'inner_forest', 'wild', 'natural', 'manicure', 'somewhere', 'find', 'public', 'private', 'land', 'provide', 'leisure', 'riverbank', 'reduce', 'damaging', 'effect', 'stormwater', 'form', 'add', 'mosaic']]

min_count=1 and threshold=10.

[['city', 'take', 'many', 'shape', 'size', 'abandon', 'overgrow', 'lot', 'avenue', 'tower', 'vast', 'inner_city', 'park', 'repurpose', 'parking', 'space', 'serve', 'green', 'inner_forest', 'wild', 'natural', 'manicure', 'somewhere', 'find', 'land', 'provide', 'leisure', 'riverbank', 'reduce', 'damaging', 'effect', 'stormwater', 'form', 'add', 'mosaic', 'city']]

While for min_count=1 and threshold=1 number of bigrams may not increase significantly as expected. The most unsatisfactory result is that although a specific function to remove stop_words was in place, it was failed on not considering “inner_for” as a bigram, as it is meaningless.

[['inner_for', 'city', 'shape', 'size', 'abandon', 'overgrow', 'lot', 'avenue', 'tower', 'vast', 'inner_city', 'park', 'repurpose', 'parking', 'space', 'serve', 'green', 'inner_forest', 'wild', 'natural', 'manicure', 'somewhere', 'find', 'land', 'provide', 'leisure', 'riverbank', 'reduce', 'damaging', 'effect', 'stormwater', 'form', 'add', 'mosaic', 'forest']]

Therefore, as it may contain stop_words, it can not be trusted, so we would stick to the most conservative parameters of min_count=5 and threshold=100.

[['city', 'take', 'many', 'shape', 'size', 'abandon', 'overgrow', 'lot', 'avenue', 'tower', 'vast', 'inner', 'city', 'park', 'repurpose', 'parking', 'space', 'serve', 'green', 'pocket', 'park', 'inner_forest', 'wild', 'natural', 'manicure', 'somewhere', 'find', 'public', 'private', 'land', 'provide', 'leisure', 'recreation', 'opportunity', 'stabilize', 'slope', 'riverbank', 'intercept', 'rainfall', 'reduce', 'damaging', 'effect', 'stormwater', 'form', 'add', 'mosaic']]

What would pyLDAvis provide?

pyLDAvis is very well suited with visualizations of topic models that add additional quality to one's interaction with the result of topic classification. It is equipped with a slide to adjust the relevance metric. As it can be seen in the following figure all of the segregated topic models are shown and one can choose the topic of interest, the result is according to the relevance metric equipped with the top-30 most relevant terms for each topic, the words are provided with the number of times each word mentioned in that topic. One must know that non-overlapped topics are better choices as they would contain unique words.

Visualizing topic-keywords with pyLDAvis

How mallet library can help?

To emphasize how the usage of the mallet library is essential let’s provide word clouds of the topics modeled before and after using the mallet library.

As mentioned in the title, this project's main topic is Natural Based Solutions (NBS) in climate change. The following pictures of 8 topics the first without and the second with the mallet library clearly show this big change. Below, the most used cases in each topic have become bolder than the rest.

The case with the mallet library provides more meaningful topics related to this project than the case without the mallet library. For example “NBS” which is one of the keywords in this project can only be seen in the case with the mallet library.

  • Without the mallet library.
LDA model word cloud without considering the mallet library
  • With the mallet library.
Optimized LDA model word cloud by using the mallet library

What is perplexity?

Perplexity measures the amount of surprise that this model shows from itself while seeing a new dataset. The lower the perplexity, the better the model, or The higher the log-likelihood in the numerator the better.

The perplexity in this format is a decreasing function of the log-likelihood L(w), i.e. the numerator is the log-likelihood of a set of unseen documents Wd given the topics Phi and the hyperparameter alpha for topic-distribution theta d of documents.

In case of the need to compare models, we can use the likelihood of unseen documents for it.

We should bear in mind that the likelihood P(Wd|Phi, alpha) of one document is intractable, and therefore is perplexity.

What is the coherence score?

Coherence score is a real number quantifying the degree of semantic similarity among the highest scoring words in the set. Such measurements help us to identify semantically interpretable topics and those which are the artifacts of statistical interference. There are steps to do so.

  1. What are the top n most used words in each topic?
  2. What is the summation of all pairwise scores (UCI or UMass) for each of the words in the previous question? The answer is the coherence score for each topic.

3. What is the mean of the coherence score per topic for all topics in the model? That would be the coherence score for the topic model.

Also, by using the coherencemodel() method and providing the type of coherence of interest, one can compute the current coherence score and perplexity, and after supplying the pipeline with the LDA mallet deriving the coherence score and perplexity for different models and finally by plotting a graph of several topics with the coherence score, optimizing the number of topics to improve the quality of the final results. For example, according to the following graph, the best number of topics should be 8. But also, one should bear in mind that if the number of repetitions of topic models in the final results is high, it would be a wise decision to increase the number of topics although the coherence score may decrease.

Coherence score and number of topics

Frequency Distribution of Document Word Counts by Dominant Topic

This important part would help us to understand in each set of our dominant topics how many related words existed in what number of documents. Therefore, we can estimate how such words were scattered in our topics.

The word clouds of top N keywords in each topic

Following with the word clouds of top N keywords in each topic and with the previous paragraph to provide an example we can discuss the distribution of document word counts in “topic 1”, “topic 0”, and “topic 7”.

“Topic 1, 0 and 7” had shown more involvement in the number of counts and they were addressing words like “key woman local community and group work project, landscape forest restoration, and water resource area and natural ecosystem” as we can see on average.

While the others like “topic 2” and “topic 5” which had less involvement were more focused on “practice farmer policy support” and “measure estimate model report time” respectively.

Therefore, we can summarize that most of our pdf files address the issues and cases that we need to talk about in NBS solutions.

The word counts of topic keywords

We discussed earlier that according to our analysis “topic 1, 0, and 7” on average play a dominant role in our model, below we can have a look at 10 of the most used words in each of the topic keywords, their word counts and weights. We can see that on average “topic 1, 0, and 7” number of word counts are higher than the rest i.e. closer to ~ 900 or 1000. We can see topic 6 has some considerable amounts for words “climate” and “change” but it decreases quickly for the rest.

A t-SNE clustering chart

As discussed topics “1, 0, and 7" had more dominancy of word usage on the rest, so we need more clarification. One type of solution that can help us in finding the most dominant topic is plotting the clustering chart.

Here in the following figure, all 8 topics usages are plotted. Clearly, we can see topic 1 has a considerable amount of dominancy over the rest, then topic 0, but topic 7 is not that much dominant as we expected.

Conclusion and Summary

In this article, we provided the benefits of topic modeling, cons and pros of topic modeling with Gensim LDA, some use case examples, introduced pyLDAvis and mallet library, described perplexity and coherence score, the frequency distribution of document word counts by dominant topic, the word clouds of top n keywords in each topic, the word counts of topic keywords and finally t-SNE chart for clarification.

References:

https://github.com/bmabey/pyLDAvis

http://qpleple.com/perplexity-to-evaluate-topic-models/

About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

--

--