Topic modelling in natural language processing is a technique which assigns topic to a given corpus based on the words present. Topic modelling is important, because in this world full of data it has become increasingly important to categories the documents. For example, a company receives hundred of reviews, then it is important for the company to know what categories of reviews are more important and vice versa.
In this article, we will see the following:
- LDA
- Hyperparameters in LDA
- LDA in Python
- Shortcomings of LDA
- Alternative
Topics can be thought of as keywords which can describe a document, for example, for a topic sports the words that come to our mind our volleyball, basketball, tennis, cricket etc. A topic model is a model, which can automatically detect topics based on the words appearing in a document.
It is important to note that topic modelling is different to topic classification. Topic classification is a supervised learning while topic modelling is a unsupervised learning algorithm.
Some of the well known topic modelling techniques are
- Latent Semantic Analysis (LSA)
- Probabilistic Latent Semantic Analysis (PLSA)
- Latent Dirichlet Allocation (LDA)
- Correlated Topic Model (CTM)
In this article, we will focus on LDA
Latent Dirichlet Allocation
LDA, short for Latent Dirichlet Allocation is a technique used for topic modelling. First, let us break down the word and understand what does LDA mean. Latent means hidden, something that is yet to be found. Dirichlet indicates that the model assumes that the topics in the documents and the words in those topics follow a Dirichlet distribution. Allocation means to giving something, which in this case are topics.
LDA assumes that the documents are generated using a statistical generative process, such that each document is a mixture of topics, and each topics are a mixture of words.
In the following figure, Document is made up of 10 words, which can be grouped into 3 different topics, and the three topics have their own describing words.
The general steps in the LDA are as follows
Hyperparameters in LDA
There are three hyperparameters in LDA
- α → document density factor
- β → topic word density factor
- K → number of topics selected
The α hyperparameter controls the number of topic expected in the document. The β hyperparameter controls the distribution of words per topic in the document, and K defines how many topics we need to extract.
LDA in Python
Let us look at an implementation of LDA. We will try to extract topics from a set of reviews.
The dataset that we will be working on a set of reviews, which looks as follows:
Feature Extraction:
This step is not related to LDA, please free to skip to vectorization.
First, we will do feature extraction to get some meaningful insights of the data.
We have extracted the following features
- Number of words in a document
- Number of characters in a document
- Average word length of the document
- Number of stop-words present
- Number of numeric characters
- Number of upper count characters
- The polarity sentiment
Data cleaning and Preprocessing:
In data cleaning and preprocessing, we have done the following
- Made all the characters to lower case
- Expanded the short forms, like I’ll → I will
- Removed special characters
- Removed extra and trailing spaces
- Removed accented characters and replaced them with their alternative
- Lemmatized the words
- Removed stop words
Vectorization:
Since LDA has an inbuilt TF-IDF vectorizer, we will have to use Count vectorizer.
Latent Dirichlet Allocation:
In this example, we were given the number of topics so we did not have to tune the hyperparameter k but for times that we do not know what the number of topics is, we can use Grid search.
This can be done as follows
The Grid search looks as follows
The motivation for our model as follows:
- Since we know the number of topics, we will be using Latent Dirichlet Allocation with number of topics at 12.
- We will also not be needing to compare different models to get best number of topic
- We will use random_state, so that the results can be reproduced
- We will be fitting the model into the vectorized data, and transform it on the same
- After fitting the model, we will print the top 10 words of each topic
- After getting the topics, we will be creating a new column and assign the topic
Topic Assignments:
To assign the topics we can do the following,
- See the word-clouds of each topic
- See the top 10 words
- Look for KERA → Keyword Extraction for Reports and Articles
To make word clouds, we can simply import the WordCloud library.
To know more about KERA, the paper “Exploratory Analysis of Highly Heterogeneous Document Collections” by Maiya et al can be referred from this link, its on arXiv.
The abstract is as follows
We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural language processing. As one of our key tagging strategies, we introduce the KERA algorithm (Keyword Extraction for Reports and Articles). KERA extracts topic-representative terms from individual documents in a purely unsupervised fashion and is revealed to be significantly more effective than state-of-the-art methods. Finally, we evaluate our system in its ability to help users locate documents pertaining to military critical technologies buried deep in a large heterogeneous sea of information.
Problems in the model:
- We had to assign the topics with the provided topics, manually, which can cause errors
- Could not check if the topics assigned is correct or not
- Only one topic is assigned, while ideally it should depend on what matches the best.
- In some documents, all the topics has same probability which will cause problems, as we are selecting only the max
- Some of words had no relation with the topic, such as discount, change in date
Shortcomings of LDA:
- LDA performs poorly on small texts; most of our data was short.
- Since the reviews are not coherent, LDA finds it all the more difficult to
identify the topics - Since the reviews are mainly context-based, hence word co-occurrences
based models fail.
Alternative:
We can use BERT, to do better topic modelling, which will be covered in future :)
Resources:
- Choosing the right number of topics for scikit-learn topic modeling | Data Science for Journalism (investigate.ai)
- Contextual Topic Identification. Identifying meaningful topics for… | by Steve Shao | Insight (insightdatascience.com)
- sklearn.decomposition.LatentDirichletAllocation — scikit-learn 0.24.2 documentation
- https://www.youtube.com/watch?v=T05t-SqKArY
- Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences — YouTube
- NLP Tutorial 13 — Complete Text Processing | End to End NLP Tutorial | NLP for Everyone | KGP Talkie — YouTube
- Organizing machine learning projects: project management guidelines | by Gideon Mendels | Comet.ml | Medium
- and numerous Stack Overflow questions.
Thankyou for reading :)