What is Topic Modeling?

Ankur Dhuriya
Analytics Vidhya
Published in
5 min readFeb 1, 2021

--

Topic modeling is a type of statistical modeling tool which is used to assess what all abstract topics are being discussed in a set of documents. Topic modeling, by its construction solves the problem of creating topic in an unsupervised manner.

Generally the statistical approach is used by considering that each document talks about a few different topics, while each topics are generally denoted by a distribution of words. i.e. there is a two step structure assumed about the documents, such as a document = [topic 1, topic 2, topic 3, …,topic N] and then again topic1 = [w 1,w 2,…,w N].

Clearly, the topic modeling is generally performed by counting words, their proportions and related indicators. I will describe two models, LSA ( latent semantic analysis) and LDA ( latent dirichlet analysis) based on the idea I just discussed.

Topic modeling is therefore an unsupervised machine learning method, which is used to model topics out of an unlabelled data and it can work therefore without any training. For obvious reasons, topic modeling is therefore a rough beginning approach rather than being a sophisticated and end solution.

Topic Classification Modelling
While topic modeling is an unsupervised modeling, we need to train models to our custom topics for high end and more accurate systematic usage. For example, if you are building a classification model for detecting support tickets, in that case you may want to assign specific topics to the tickets rather than creating topics from the tickets like we do in topic modeling. Therefore, a topic classification modeling is more of a simple text classification modeling where we classify the text into multiple topics.

Difference between Topic Modeling and Topic Classification Modeling
As I described in the above paragraphs, clearly, topic classification is a supervised process, where as topic modeling is a unsupervised modeling effort. In general modeling effort, we start with topic modeling, then find out the topics coming up naturally. Finally we create human curated topics which we use for topic classification modeling.

The two different topic modeling techniques:

In this article, we will discuss the two main ways to do topic modeling. These two methods are:

  1. LSA: latent semantic analysis
  2. LDA: latent dirichlet analysis

LSA : latent semantic analysis

It is a method using bag of word and term-document matrix to detect topic. The assumption behind LSA is that different topics have different distributions of words in them, as well as different topics have different distribution of the topics in them.

o mathematically solve this idea, We first create a bag of words from the corpus. Then we create a word-document frequency matrix where in the rows we have words, and in the columns we have the different documents. A cell (i,j) of the matrix therefore denotes how many times the word w_i has occurred in the document D_j.

Now we use the above assumption in LSA to factor this matrix into three matrices, which are:

  1. word-topic frequency matrix: in this matrix, words are in row, topics in columns. each cell denotes the occurrence of a word in the respective column. [MxN matrix]
  2. topic importance matrix: this is a diagonal matrix, where i th diagonal element of the matrix denotes the importance of the topic i. [NxN matrix]
  3. topic-document matrix: in this matrix again, topics are in row, documents are in the column, each cell denotes the weight of a topic in the respective document. [NxM matrix]
Source: https://goo.gl/images/Fsw2ak

For matrix use we use SVD matrix factorization. Using SVD, we factorize a matrix M as follows:

M = UDV

where U is a unitary matrix, D is a diagonal matrix and V is again a unitary matrix. This ensures that the middle matrix is a diagonal. Now you may ask that why do we use this specific matrix factorization.

The answer to that is that from linear algebra consideration, SVD changes a linear map into different basis representation so that it becomes D. For example, because U and V are unitary matrices, so we can write,

M = UDV => UTMVT = D

Which sort of can be interpreted that the term frequency matrix is basically represented by creating a basis made of topics ( which are linear combination of words) and then representing the map from words to matrix by another map from topic to words.

And therefore, as original SVD represents the diagonal matrix as svd ( some sort of importance value) for the changed basis representations; which implies clearly that the diagonal matrix in this case represents the weights of the topics ( which are the new bases).

LDA: latent dirichlet analysis

LDA is a significant improvement from LSA in the context that LSA considers no probabilistic determination inside the document structures. In LDA, this is the major change.

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for each document w in a corpus D:

  1. Choose N which follows a poisson distribution with lambda = e
  2. Choose theta which follows a dirichlet distribution with alpha = a
  3. for each word w_n for n in N: choose a topic z_n which follows Multinomial distribution with theta, choose a word w_n which comes from p(w_n|z_n, beta), which is a multinomial probability conditioned on topic z_n

From this, we can calculate the probability distribution of the unigrams theoretically dependent on alpha, beta. We get a distribution from the actual dataset too for the different unigrams.

To finally solve the problem, we need to optimize alpha and beta for decreasing the kullback leibler divergence distance between the theoretical and the observed probability distributions

--

--