Topic Modeling with Latent Dirichlet Allocation

Published in

Analytics Vidhya

7 min readNov 21, 2019

In the modern internet and social media age, people’s opinions, reviews, and recommendations have become a valuable resource for political science and businesses. Thanks to modern technologies, we are now able to collect and analyze such data most efficiently.

We will delve into sentiment analysis and learn how to use Topic modeling to categorize the movie reviews into different categories. We are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb).

This article is an excerpt from the book Python Machine Learning, Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python. This new third edition is updated for TensorFlow 2.0, GANs, Reinforcement Learning, and other popular Python libraries. In this article, we will discuss a popular technique for topic modeling called Latent Dirichlet Allocation (LDA).

Topic modeling describes the broad task of assigning topics to unlabeled text documents. For example, a typical application would be the categorization of documents in a large text corpus of newspaper articles. In applications of topic modeling, we then aim to assign category labels to those articles, for example, sports, finance, world news, politics, local news, and so forth. Thus, in the context of the broad categories of machine learning, we can consider topic modeling as a clustering task, a subcategory of unsupervised learning.

Decomposing text documents with LDA

Since the mathematics behind LDA is quite involved and requires knowledge about Bayesian inference, we will approach this topic from a practitioner’s perspective and interpret LDA using layman’s terms. However, the interested reader can read more about LDA in the following research paper: Latent Dirichlet Allocation, David M. Blei, Andrew Y. Ng, and Michael I. Jordan, Journal of Machine Learning Research 3, pages: 993–1022, Jan 2003.

LDA is a generative probabilistic model that tries to find groups of words that appear frequently together across different documents. These frequently appearing words represent our topics, assuming that each document is a mixture of different words. The input to an LDA is the bag-of-words model that we discussed earlier in this chapter. Given a bag-of-words matrix as input, LDA decomposes it into two new matrices:

· A document-to-topic matrix

· A word-to-topic matrix

LDA decomposes the bag-of-words matrix in such a way that if we multiply those two matrices together, we will be able to reproduce the input, the bag-of-words matrix, with the lowest possible error. In practice, we are interested in those topics that LDA found in the bag-of-words matrix. The only downside may be that we must define the number of topics beforehand — the number of topics is a hyperparameter of LDA that has to be specified manually.

LDA with scikit-learn

In this subsection, we will use the LatentDirichletAllocation class implemented in scikit-learn to decompose the movie review dataset and categorize it into different topics. In the following example, we will restrict the analysis to 10 different topics, but readers are encouraged to experiment with the hyperparameters of the algorithm to further explore the topics that can be found in this dataset.

First, we are going to load the dataset into a pandas DataFrame using the local movie_data.csv file of the movie reviews that we created at the beginning of this chapter:

import pandas as pd
df = pd.read_csv(‘movie_data.csv’, encoding=’utf-8')

Next, we are going to use the already familiar CountVectorizer to create the bag-of-words matrix as input to the LDA. For convenience, we will use scikit-learn’s built-in English stop-word library via stop_words=’english’:

from sklearn.feature_extraction.text import CountVectorizercount = CountVectorizer(stop_words=’english’, max_df=.1, max_features=5000)
X = count.fit_transform(df[‘review’].values)

Notice that we set the maximum document frequency of words to be considered to 10 percent (max_df=.1) to exclude words that occur too frequently across documents. The rationale behind the removal of frequently occurring words is that these might be common words appearing across all documents that are, therefore, less likely to be associated with a specific topic category of a given document. Also, we limited the number of words to be considered to the most frequently occurring 5,000 words (max_features=5000), to limit the dimensionality of this dataset to improve the inference performed by LDA. However, both max_df=.1 and max_ features=5000 are hyperparameter values chosen arbitrarily, and readers are encouraged to tune them while comparing the results.

The following code example demonstrates how to fit a LatentDirichletAllocation estimator to the bag-of-words matrix and infer the 10 different topics from the documents (note that the model fitting can take up to 5 minutes or more on a laptop or standard desktop computer):

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,
random_state=123, learning_method=’batch’)
X_topics = lda.fit_transform(X)

By setting learning_method=’batch’, we let the lda estimator do its estimation based on all available training data (the bag-of-words matrix) in one iteration, which is slower than the alternative ‘online’ learning method but can lead to more accurate results (setting learning_method=’online’ is analogous to online or mini-batch learning).

After fitting the LDA, we now have access to the components_ attribute of the lda instance, which stores a matrix containing the word importance (here, 5000) for each of the 10 topics in increasing order:

lda.components_.shape
(10, 5000)

To analyze the results, let’s print the five most important words for each of the 10 topics. Note that the word importance values are ranked in increasing order. Thus, to print the top five words, we need to sort the topic array in reverse order:

n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
  print(“Topic %d:” % (topic_idx + 1))
  print(“ “.join([feature_names[i]
  for i in topic.argsort()
    [:-n_top_words — 1:-1]]))

Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool

Based on reading the five most important words for each topic, you may guess that the LDA identified the following topics:

Generally bad movies (not really a topic category)
Movies about families
War movies
Art movies
Crime movies
Horror movies
Comedy movies
Movies somehow related to TV shows
Movies based on books
Action movies

To confirm that the categories make sense based on the reviews, let’s plot three movies from the horror movie category (horror movies belong to category 6 at index position 5):

horror = X_topics[:, 5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
  print(‘\nHorror movie #%d:’ % (iter_idx + 1))
  print(df[‘review’][movie_idx][:300], ‘…’)

Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal’s three most famous monsters; Dracula, Frankenstein’s Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that …

Horror movie #2:
Okay, what the hell kind of TRASH have I been watching now? “The Witches’ Mountain” has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it’s also strangely compelling. There’s absolutely nothing that makes sense here and I even doubt there …

Horror movie #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac …

Using the preceding code example, we printed the first 300 characters from the top three horror movies. The reviews — even though we don’t know which exact movie they belong to — sound like reviews of horror movies (however, one might argue that Horror movie #2 could also be a good fit for topic category 1: Generally bad movies).

Summary

In this article, we looked at a particular application of machine learning, sentiment analysis, which has become an interesting topic in the Internet and social media era. We were introduced to the concept of topic modeling using LDA to categorize the movie reviews into different categories in an unsupervised fashion. Python Machine Learning, Third Edition is a comprehensive guide to machine learning and deep learning with Python.

About the Authors

Sebastian Raschka has many years of experience with coding in Python, and he has given several seminars on the practical applications of data science, machine learning, and deep learning, including a machine learning tutorial at SciPy — the leading conference for scientific computing in Python. He is currently an Assistant Professor of Statistics at UW-Madison focusing on machine learning and deep learning research.

His work and contributions have recently been recognized by the departmental outstanding graduate student award 2016–2017, as well as the ACM Computing Reviews’ Best of 2016 award. In his free time, Sebastian loves to contribute to open source projects, and the methods that he has implemented are now successfully used in machine learning competitions, such as Kaggle.

Vahid Mirjalili obtained his PhD in mechanical engineering working on novel methods for large-scale, computational simulations of molecular structures. Currently, he is focusing his research efforts on applications of machine learning in various computer vision projects at the Department of Computer Science and Engineering at Michigan State University.

While Vahid’s broad research interests focus on deep learning and computer vision applications, he is especially interested in leveraging deep learning techniques to extend privacy in biometric data such as face images so that information is not revealed beyond what users intend to reveal. Furthermore, he also collaborates with a team of engineers working on self-driving cars, where he designs neural network models for the fusion of multispectral images for pedestrian detection.

Topic Modeling with Latent Dirichlet Allocation

Decomposing text documents with LDA

LDA with scikit-learn

Summary

Written by Packt_Pub