Topic Modeling: Art of Storytelling in NLP

Mageshwaran R
Technovators
Published in
7 min readDec 26, 2019

Topic Modeling is an unsupervised approach to discover the latent (hidden) semantic structure of text data (often called as documents).

source: https://www.thinkinfi.com/2019/02/lda-theory.html

Why Topic Modeling?

Each document is built with a hierarchy, from words to sentences to paragraphs to documents. Thus, extracting topics from documents helps us analyze our data and hence brings more value to our business. Isn’t it great to have some algorithm that does all the work for you? Yes!! Topic modeling is an automated algorithm that requires no labeling/annotations. Given a bunch of documents, it gives you an intuition about the topics(story) your document deals with.

Applications of Topic Modeling:

  • Document Clustering
  • Information Retrieval
  • Feature extractor for text classification

Other than this topic modeling can be a good starting point to understand your data. In the later part of this post, we will discuss more on understanding documents by visualizing its topics and word distribution.

Before we start, here is a basic assumption:

  • Each document is viewed as a mixture of topics
  • Each topic is viewed as a mixture of words

Given some basic inputs, Let us first start to explore various topic modeling techniques, and at the end, we’ll look into the implementation of Latent Dirichlet Allocation (LDA), the most popular technique in topic modeling. If you’re already aware of LSA, pLSA, and looking for a detailed explanation of LDA or it’s implementation, please feel free to skip the next two sections and start with LDA.

Latent Semantic Analysis

LSA creates a vector-based representation of text by capturing the co-occurrences of words and documents.

Breaking down documents into topics and words

Here, M — number of documents with Vocabulary(V) is approximated with two matrices (Topic Assignment Matrix and Word-Topic Matrix).

Steps:

  • Build a Document-Term Matrix (X), where each entry Xᵢⱼ is a raw count of j-th word appearing in the i-th document. However, In practice, we use TF-IDF Vectorizer to assign weights to each in the document.
  • Apply Truncated Singular Value Decomposition to find few latent topics to capture the relationship between words and documents.

Then we pick top-k topics, (i.e) X = Uₖ * Sₖ * Vₖ.

However LSA being the first Topic model and efficient to compute, it lacks interpretability.

Probabilistic Latent Semantic Analysis

pLSA is an improvement to LSA and it’s a generative model that aims to find latent topics from documents by replacing SVD in LSA with a probabilistic model.

source: https://thesai.org/Publications/ViewPaper?Volume=6&Issue=1&Code=ijacsa&SerialNo=21

Steps:

  • Select a document dᵢ with probability P(dᵢ)
  • Pick a latent class Zₖ with probability P(Zₖ|dᵢ)
  • Generate a word with probability P(wⱼ|Zₖ)
source: LDA paper

In simple context, we sample a document first then based on the document we sample a topic, and based on the topic we sample a word, which means d and w are conditionally independent given a hidden topic ‘z’.

Limitations:

  • “d” being a multinomial random variable based on training documents, Model learns P(z|d) only for documents on which it’s trained, thus it’s not fully generative and fails to assign a probability to unseen documents.
  • Model parameters are on the order of k|V| + k|D|, so parameters grow linearly with documents so it’s prone to overfitting. In practice “tempering heuristic” is used to smooth model params and prevent overfitting.

Latent Dirichlet Allocation

LDA is a Bayesian version of pLSA.

Dirichlet Distribution is a multivariate generalization of the beta distribution. Basically, Dirichlet is a “distribution over distribution”.

LDA uses Dirichlet priors for the document-topic and topic-word distribution.

Graphical representation of LDA

Assumption

  • Our document contains various topics in it but one specific topic in a document has more weightage
  • So we’re more likely to choose a mixture of topics where one topic has a higher weightage

Steps

  • Randomly sample topic distribution (θ) from a Dirichlet distribution (α)
  • Randomly sample word distribution (φ) from another Dirichlet distribution (β)
  • From distribution (θ), sample a topic (z)
  • Sample a word (w) from the word distribution (β) given topic z.

This is how it assumes each word is generated in the document.

Our goal here is to estimate parameters φ, θ to maximize p(w; α, β). The main advantage of LDA over pLSA is that it generalizes well for unseen documents.

Implementation of LDA in python

I will be using the 20Newsgroup data set for this implementation. I have reviewed and used this dataset for my previous works, hence I knew about the main topics beforehand and could verify whether LDA correctly identifies them.

This dataset is available in sklearn and can be downloaded as follows:

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

Below are the categories of news data:

from pprint import pprint
pprint(list(newsgroups_train.target_names))
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']

Basically, they can be grouped into the below topics:

  • Politics
  • Science
  • Computers and Technology
  • Religion
  • Sports etc

Let’s start with our implementation on LDA,

Data pre-processing

LDA requires some basic pre-processing of text data and the below pre-processing steps are common for most of the NLP tasks (feature extraction for Machine learning models):

text-preprocessing

The next step is to convert pre-processed tokens into a dictionary with word index and it’s count in the corpus. We can use gensim package to create this dictionary then to create bag-of-words.

text2bow

Running LDA

Now it’s time for us to run LDA and it’s quite simple as we can use gensim package. We need to specify the number of topics to be allocated. We can set Dirichlet parameters alpha and beta as “auto”, gensim will take care of the tuning. Let’s start with 5 topics, later we’ll see how to evaluate LDA model and tune its hyper-parameters.

# Create lda model with gensim library
# Manually pick number of topic:
# Then based on perplexity scoring, tune the number of topics
lda_model = gensim.models.LdaModel(bow_corpus,
id2word=dictionary,
num_topics=5,
offset=2,
random_state=100,
update_every=1,
passes=10,
alpha='auto',
eta="auto",
per_word_topics=True)

Visualizing the topics:

First, let’s print topics learned by the model.

from pprint import pprint
pprint(lda_model.print_topics())
[(0, # Seems to be Computer and Technology
'0.014*"key" + 0.007*"chip" + 0.006*"encryption" + 0.006*"system" + '
'0.005*"clipper" + 0.005*"article" + 0.004*"university" + '
'0.004*"information" + 0.004*"government" + 0.004*"time"'),
(1, # Seems to be Science and Technology
'0.008*"drive" + 0.007*"university" + 0.007*"window" + 0.007*"system" + '
'0.006*"doe" + 0.005*"card" + 0.005*"thanks" + 0.005*"space" + '
'0.004*"article" + 0.004*"computer"'),
(2, # seems to be politics
'0.010*"people" + 0.006*"gun" + 0.006*"armenian" + 0.005*"time" + '
'0.005*"article" + 0.005*"then" + 0.005*"israel" + 0.004*"war" + '
'0.004*"government" + 0.004*"israeli"'),
(3, # seems to be sports
'0.013*"game" + 0.011*"team" + 0.008*"article" + 0.007*"university" + '
'0.006*"player" + 0.006*"time" + 0.005*"play" + 0.005*"season" + '
'0.004*"hockey" + 0.004*"win"'),
(4, # seems to be religion
'0.018*"god" + 0.011*"people" + 0.008*"doe" + 0.008*"christian" + '
'0.007*"jesus" + 0.006*"believe" + 0.006*"then" + 0.006*"article" + '
'0.005*"life" + 0.005*"time"')]

I have manually grouped(added in comments) them to those 5 categories mentioned earlier and we can see LDA doing a pretty good job here.

Let’s Visualize LDA with pyLDAVis tool,

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)
vis

It’s an interactive visualization tool with which you can visualize the distance between each topic (left part of the image) and by selecting a particular topic you can see the distribution of words in the horizontal bar graph(right part of the image).

Evaluating LDA

There are two methods that best describe the performance LDA model.

  • perplexity
  • coherence

Perplexity is the measure of uncertainty, meaning lower the perplexity better the model. We can calculate the perplexity score as follows:

print('Perplexity: ', lda_model.log_perplexity(bow_corpus))

Even though perplexity is used in most of the language modeling tasks, optimizing a model based on perplexity will not yield human interpretable results. Hence coherence can be used for this task to make it interpretable.

Coherence is the measure of semantic similarity between top words in our topic. Higher the coherence better the model performance. It can be measured as follows,

coherence_model_lda = models.CoherenceModel(model=lda_model, texts=X, dictionary=dictionary, coherence='c_v')coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Tuning LDA model

Given the ways to measure perplexity and coherence score, we can use grid search-based optimization techniques to find the best parameters for:

  • Number of topics(K)
  • Dirichlet parameter alpha
  • Dirichlet parameter beta

I hope you have enjoyed this post. For more learning please find the complete code in my GitHub. I encourage you to pull it and try it.

References:

Check out related blogs,

--

--

Mageshwaran R
Technovators

AI Engineer | NLP | Computer Vision. An avid reader of Neuroscience, Psychology, and Decision Making. https://mageshwaran.com