Topic Modeling with Latent Dirichlet Allocation (LDA)

Sandeep Panchal
Analytics Vidhya
Published in
7 min readJun 14, 2020
https://www.google.com/url?sa=i&url=https%3A%2F%2Ftheteachingtexan.com%2F2018%2F07%2F06%2Fa-fresh-and-bright-teacher-toolbox-diy%2F&psig=AOvVaw18cTY-qcuMfzL8-dcNpqpl&ust=1592241054285000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCvrqbmgeoCFQAAAAAdAAAAABAS

My warm welcome to all the readers!

Contents:

contents

1. Introduction to Latent Dirichlet Allocation (LDA):

LDA stands for Latent Dirichlet Allocation. As time is passing by, data is increasing exponentially. Most of the data is unstructured and a few of them are unlabeled. It is a tedious task to label each and every data manually. How can we label such a huge amount of data if not manually? Here comes the LDA to our rescue. LDA is one of the topic modeling techniques which is used to analyze a huge amount of data, cluster them into similar groups, and label each group. It should be noted that LDA technique is for unsupervised learning which is used to label the data by grouping them into similar topics. Unlike K-Means clustering and other clustering techniques which uses the concept of distance between cluster center, LDA works on the probability distribution of topics belonging to the document.

2. Assumptions of LDA:

2.1 Topics are probability distribution over words:

It represents the probability distribution of words belonging to the topics. Suppose, we have 2 topics — Healthcare and Politics. Words like medicine, injection, oxygen, etc, will have a higher probability distribution belonging to the topic of Healthcare while words like election, voting, etc, will have a lower probability distribution. On the other hand, words like election, voting, party, have a higher probability distribution belonging to the topic Politics while words like medicine, injection, etc, will have a lower probability distribution. In this, each topic shares a similar group of words with a higher probability. Refer to the below image for a clear understanding.

Source: https://www.udemy.com/share/101YmOBEYTcVxUQX4=/

2.2 Documents are probability distribution over topics:

It represents the probability distribution of topics belonging to each document. As in the above example, we have 2 topics — Healthcare and Politics. Since the documents are the mixture of various words, each word will have a probability distribution belonging to the topics. This results in the probability distribution of topics belong to each document. Confusing, right! Even I was confused in the beginning but, after pondering over it for some time, I was able to understand it. Let me try to explain in other terms. Consider a document that states — “We have observed a lot of patients recovered last month. Government’s fund has increased the supply of medicine.” If we read the statement, it sounds more relatable to the Healthcare topic than the Government. Though we have words like ‘Government’s fund’ in the document which relates to the Politics, due to its lower probability, when compared to the words like patients, recover, medicine, the document can be labeled as ‘Healthcare’. Refer to the below image for a clear understanding.

Source: https://www.udemy.com/share/101YmOBEYTcVxUQX4=/

3. LDA on NPR data set:

I am taking the NPR data set to understand the LDA, how it can be grouped into similar topics, and labeled accordingly. Refer to the below image to see the size and the head of the data set.

3.1 Import the Data Set:

data set image

3.2 Vectorization of Text Data:

I am using Bag of Words (Count Vectorizer) to vectorize the text data. We can even use other vectorizer techniques like TFIDF, Word2Vec, etc. I have set the parameters ‘max_df’ to 0.90 so as to discard the words that has the frequency more than 90% and ‘mid_df’ to 2 so as to include only those words that appear at least in the 2 documents. Also, I am removing stopwords by setting the stop_words to ‘english’.

FYI: As our objective is only to understand the basic application of LDA for topic modeling. I am not doing any Exploratory Data Analysis part. In addition, I am not doing data pre-processing such as lemmatization or stemming, punctuation removal, etc. One can do Exploratory Data Analysis, data preprocessing, etc, to get the better understanding of the data, and for the better result.

Count Vectorizer

From the above image, we can see the sparse matrix with 54777 corpus of words.

3.3 LDA on Text Data:

Time to start applying LDA to allocate documents into similar topics. Here, it should be noted that the choice of the number of topics (n_components) merely depends on the individual’s domain knowledge and understanding of the data set. Here, I am choosing 5 number of topics to be allocated. Refer to the below image for the code.

LDA code

Refer to the below image for the number of topics it has produced and the columns (the corpus of words).

Topics and feature names

From the above image, we can see that the LDA has created 5 topics as defined and 54777 feature names which are column names.

3.4 Topic Analysis with Word Distribution:

Refer to the below image to analyze the words in each topic. I will explain what I have done in the below the image.

Topics analysis

I have printed the top 50 words that have the highest probability of belonging to the topic. Every topic has 54777 words with a probability distribution. I have first sorted them and got the index with ‘df.argsort()’. Using the obtained index, I have obtained the feature names but from the last, that represents the top probability distribution to that topic.

If we read few words, we can see that the topic- 0 talks about Healthcare, topic-1 talks about Entertainment, topic- 2 talks about Education, topic- 3 talks about Politics, and topic- 4 talks about the Military.

Let us print 1st row to check how it is allocated to the topics with the probability distribution. Refer to the below image.

Testing one row

If we read the text ‘In the Washington of 2016, even when the policy can be bipartisan, the politics cannot…’ it clearly represents political statement. In the second cell of the above image, we can see the probability distribution of various topics, and the text belongs to the topic- 3 with a 92% probability. As per our analysis from the image ‘Topic analysis’, topic- 3 actually belongs to Politics. Perfect!

3.5 Assigning the topics:

Refer to the below code to assign topics to each row.

Assigning topics

3.6 Mapping Topic Names:

Refer to the below for a mapping of the topic names.

Mapping Topic Names

4. Visualization of the Topics:

4.1 Visualization with word cloud:

topic- 0

Word cloud for the topic- 0. We can see words like brain science, researchers, patients, disease, insurance, etc. This relates to Healthcare topic.

topic- 1

Word cloud for the topic- 1. We can see words like song, story, family, film, music, love, etc. This relates to the Entertainment topic.

topic- 2

Word cloud for the topic- 2. We can see words like a student, university, school, college, work, etc. This relates to the Education topic.

topic- 3

Word cloud for the topic- 3. We can see words like congress, republican, new state, democrats, government, committee, etc. This relates to the Politics topic.

topic- 4

Word cloud for the topic- 4. We can see words like forces, ISIS, police, security, Korea, refugees, etc. This relates to Military topic.

4.2 Visualization with pyLDAvis:

To install pyLDAvis, enter the below code in your notebook cell or CMD prompt or anaconda prompt.

!pip install pyLDAvis

Install pyLDAvis

Refer to the below image for the code.

pyLDAvis code

Output video: Video of the visualization with pyLDAvis

For the full code, refer to the GitHub repository- Topic-Modeling-with-LDA.

Reference:

  1. https://www.udemy.com/share/101YmOBEYTcVxUQX4=/

Connect me:

  1. LinkedIn: https://www.linkedin.com/in/sandeep-panchal-682734111/
  2. GitHub: https://github.com/Sandeep-Panchal

Thank you all for reading this blog. Your suggestions are very much appreciated!

--

--