Topic Modeling using Gensim-LDA in Python

Aravind CR

Published in

Analytics Vidhya

7 min readJul 26, 2020

This blog post is part-2 of NLP using spaCy and it mainly focus on topic modeling.

Do check part-1 of the blog, which includes various preprocessing and feature extraction techniques using spaCy.

What is topic modeling ?

Topic modeling is technique to extract the hidden topics from large volumes of text. Topic model is a probabilistic model which contain information about the text.

Ex: If it is a news paper corpus it may have topics like economics, sports, politics, weather.

Topic models are useful for purpose of document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. Finding good topics depends on the quality of text processing , the choice of the topic modeling algorithm, the number of topics specified in the algorithm.

There are several existing algorithms you can use to perform the topic modeling. The most common ones are Latent Semantic Analysis or Indexing(LSA/LSI), Hierarchical Dirichlet process (HDP), Latent Dirichlet Allocation(LDA) the one we will be discussing in this post.

LDA’s approach to topic modeling is, it considers each document as a collection of topics and each topic as collection of keywords. Once you provide the algorithm with number of topics all it does is to rearrange the topic distribution within documents and key word distribution within the topics to obtain good composition of topic-keyword distribution.

Topics are nothing but collection of prominent keywords or words with highest probability in topic , which helps to identify what the topics are about.

Install dependencies

pip3 install spacy
python3 -m spacy download en #Language model
pip3 install gensim # For topic modeling
pip3 install pyLDAvis # For visualizing topic models

For this implementation we will be using stopwords from NLTK.

import nltk
nltk.download('stopwords')

Imlementation

Import libraries

import re
import numpy as np
import pandas as  pd
from pprint import pprint# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel# spaCy for preprocessing
import spacy# Plotting tools
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

Prepare stopwords

You can extend the list of stopwords depending on the dataset you are using or if you see any stopwords even after preprocessing. You can also visualize your cleaned corpus using wordcloud and check if any words are adding noise or there are any stopwords still left out in the cleaned corpus.

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

Load Dataset

We will be 20-Newsgroups dataset. It contains about 11K news group post from 20 different topics. Dataset is available at newsgroup.json.

# LoadDataset
df=pd.read_json('https://raw.githubusercontent.com/selva86/datasets/ master/newsgroups.json')
print(df.target_names.unique())
df.head()

Remove emails and newline characters

As you can see there are lot of emails and newline characters present in the dataset. Remove them using regular expression. Use sub() of re module. IN re.sub() specify a regular expression pattern in the first argument, a new string in the second argument, and a string to be processed in the third argument.

# Convert to list 
data = df.content.values.tolist()  
# Remove Emails 
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]  
# Remove new line characters 
data = [re.sub('\s+', ' ', sent) for sent in data]  
# Remove distracting single quotes 
data = [re.sub("\'", "", sent) for sent in data]  
pprint(data[:1])

The text still looks messy , carry on further preprocessing.

Tokenize words and cleanup the text

Use gensims simple_preprocess(), set deacc=True to remove punctuations.

def sent_to_words(sentences):
  for sentence in sentences:
    yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))            #deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[:1])

Creating Bigram and Trigram models

Bigrams are 2 words frequently occuring together in docuent. Trigrams are 3 words frequently occuring. Many other techniques are explained in part-1 of the blog which are important in NLP pipline, it would be worth your while going through that blog. The 2 arguments for Phrases are min_count and threshold. The higher the values of these parameters , the harder its for a word to be combined to bigram.

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

Remove Stopwords, make bigrams and lemmatize

Using lemmatization instead of stemming is a practice which especially pays off in topic modeling because lemmatized words tend to be more human-readable than stemming.

# Define function for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

Call the functions in order

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

Create Dictionary and Corpus needed for Topic Modeling

Make sure to check if dictionary[id2word] or corpus is clean otherwise you may not get good quality topics.

# Create Dictionary 
id2word = corpora.Dictionary(data_lemmatized)  
# Create Corpus 
texts = data_lemmatized  
# Term Document Frequency 
corpus = [id2word.doc2bow(text) for text in texts]  
# View 
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 5), (6, 1), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1)]]

Gensim creates unique id for each word in the document. Its mapping of word_id and word_frequency. Example: (8,2) above indicates, word_id 8 occurs twice in the document and so on.
This is used as input to LDA model.

If you want to see what word corresponds to a given id, then pass the id as a key to dictionary. Example: id2word[4].

Readable format of corpus can be obtained by executing below code block.

[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

Building topic model

> Parameters of LDA

Alpha and Beta are Hyperparameters — alpha represents document-topic density and Beta represents topic-word density, chunksize is the number of documents to be used in each training chunk, update_every determines how often the model parameters should be updated and passes is the total number of training passes.
A measure for best number of topics really depends on kind of corpus you are using, the size of corpus, number of topics you expect to see.

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

View topics in LDA model

Each topic is combination of keywords and each keyword contributes a certain weightage to the topic.
You can see keywords for each topic and weightage of each keyword using lda_model.print_topics().

# Print the keyword of topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

Output: First 5 topics -

You can see the top keywords and weights associated with keywords contributing to topic.
Topics are words with highest probability in topic and the numbers are the probabilities of words appearing in topic distribution.
But looking at keywords can you guess what the topic is?
You may summarize topic-4 as space(In the above figure). Each one may have different topic at particular number , topic 4 might not be in the same place where it is now, it may be in topic 10 or any number.

Evaluate topic models

Compute model Perplexity and Coherence score

Coherence score and perplexity provide a convinent way to measure how good a given topic model is.

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Output:

Lower the perplexity better the model.
Higher the topic coherence, the topic is more human interpretable.

Perplexity:  -8.348722848762439  
Coherence Score:  0.4392813747423439

Visualize the topic model

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

Each bubble on the left-hand side represents topic. The larger the bubble, the more prevalent or dominant the topic is. Good topic model will be fairly big topics scattered in different quadrants rather than being clustered on one quadrant.

The model with too many topics will have many overlaps, small sized bubbles clustered in one region of chart.
If you move the cursor the different bubbles you can see different keywords associated with topics.

How to find optimum number of topics ?

One approach to find optimum number of topics is build many LDA models with different values of number of topics and pick the one that gives highest coherence value.
If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.
Sometimes topic keyword may not be enough to make sense of what topic is about. So for better understanding of topics, you can find the documents a given topic has contributed the most to and infer the topic by reading the documents.
Finally, one needs to understand the volume and distribution of topics in order to judge how widely it was discussed.

Hope this blog was informative

Keep Learning………..

— — — — — — — Thank you — — — — — — —