Topic modeling with LSI, LDA and automatic labeling of clusters with LINGO — Part 1

Zeina Thabet
8 min readAug 21, 2020

--

In this series of 2 articles, we are going to explore Topic modeling with several topic modeling techniques like LSI and LDA. We are also going to explore automatic labeling of clusters using the LINGO algorithm.

Important terms

Before starting, it is important to note what is meant by some words that are used frequently in this article to clear confusion in the context of Topic Modeling.

Document: A bunch of text that could be a paragraph or a bunch of paragraphs.
Example:

  • Reviews on TripAdvisor:
    Each review would be counted as a document as each review would have a certain topic it focuses on.
  • A Website:
    Each webpage would be a document as each webpage would be about a certain topic.

Now let’s get into topic modeling!

Objectives:

  • Know what Topic Modeling is and how it works.
  • Learn about Latent Semantic Indexing (LSI)
  • Be able to apply LSI on texts to extract topics
  • Query the LSI model against a query to get the related documents to the query

What is Topic Modeling?

Topic Modeling is a method used in a lot of websites to extract “topics” out of documents and attaches a tag relevant to the document. The topic modeling model is fed a corpus of documents and it outputs topics it extracted where each topic consists of a distribution of words.

In this article we are going to be using text from Property Finder which is a Real Estate website for listings in the UAE. In this case, each page or “listing” would be a document, and a bunch of documents would form a corpus.

There are many methods that topic modeling could use like Hierarchical Dirichlet Process (HDP), Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI). In this article we are going to use the LSI method using Gensim which is also known as Latent Semantic Analysis (LSA).

Text Preprocessing

To perform topic modeling, first we start by importing the set of documents and perform some text preprocessing on it to ensure that we get good results.

Text preprocessing and cleaning includes:

  • Text cleaning
  • Stop word removal
  • Word Lemmatization
  • removing words of length less than 3

Imports

Exploring the text

For easier readability, the output of the text is not shown. To see the output, check the notebook on github.

After exploring the text, it was noted that the text had a lot of:

  • HTML
  • \n
  • phone numbers in many formats
  • emails

Text cleaning

To clean the text, we create a function that takes in one document and cleans it using regex expressions.

Stop word removal

Next, we need to remove stop words and words that have a length of less than 3 words. For removal of stopwords, nltk has an already defined corpus of stopwords which we are going to use.

Word Lemmatization

Word lemmatization is getting the base of a word or origin of a word. lemmatization is an important step in text preprocessing as it helps
So for example:

['loving', 'surrounding', 'being']

After Lemmatization:

['love', 'surround', 'be']

In this article, we’ll be using the WordNet Lemmatizer provided by nltk which can be done using the following code:

word = WordNetLemmatizer().lemmatize('any-word')

However, there is a catch!

To be able to do correct lemmatization using WordNet we also need to provide a part of speech tag. Since we cannot provide pos tags manually, we define a function to do just that using the nltk.pos_tag([word])

nltk.pos_tag() returns tags that are not accepted by WordNet. According to the documentation, the nltk.pos_tag() uses the Penn Treebank tagset.

The WordNet Lemmatizer can accept pos tags such as:

  • wordnet.NOUN
  • wordnet.VERB
  • wordnet.ADJ
  • wordnet.ADV

Which is different from the pos tags that nltk uses. To fix this, we define a function to convert from an nltk pos tag to a WordNet pos tag.

And since the Treebank tagset has a lot more pos tags than just Nouns, Verbs, Adjectives and Adverbs. we can default our output to a Noun incase it’s not any of the tags that WordNet accepts.

Taken from: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

Now we can use the WordNet Lemmatizer without having to worry about the pos tags.

LSI model building

To build an LSI model using gensim, first we need two things.

  • dictionary: a dictionary that contains id’s as keys and words in the document as values.
  • corpus: contains the word id and the number of times it appears in each document. It is very similar to a document-term matrix.

Term-document matrix

Source: DataScience Authority via Quora

The document-term matrix basically holds the word counts of all words in the corpus for each document.

So, before building the LSI model we first need a dictionary and a corpus (documen-term matrix)

Inspecting corpus..

[[(0, 1), (1, 1), (2, 1),(3, 1),(4, 1), (5, 1),(6, 2),(7, 1),(8, 1),
(9, 1),(10, 1), (11, 1),(12, 1),(13, 1),(14, 3),(15, 1),(16, 1),
(17, 2),(18, 1),(19, 1),(20, 1),(21, 1),(22, 1),(23, 1),(24, 1),
(25, 1),(26, 2),(27, 1),(28, 2),(29, 1),(30, 1),(31, 1),(32, 1),
(33, 1),(34, 1),(35, 1),(36, 1),(37, 1),(38, 1),(39, 1),(40, 2),
(41, 1),(42, 1),(43, 1),(44, 1),(45, 1),(46, 1),(47, 2),(48, 1),
(49, 1),(50, 1),(51, 1),(52, 1),(53, 1),(54, 1),(55, 1),(56, 1),
(57, 1)],
[(4, 2),(6, 1),(8, 1),(14, 5),(16, 1),(17, 5),(28, 4),(32, 1),
(37, 1),(39, 1),(40, 5),(54, 2),(56, 1),(58, 1),(59, 1),(60, 1),
(61, 1),(62, 1),(63, 3),(64, 1),(65, 1),(66, 1),(67, 2),(68, 1),
(69, 1),(70, 1),(71, 1),(72, 1),(73, 1),(74, 1),(75, 1),(76, 1),
(77, 1),(78, 1),(79, 1),(80, 1),(81, 1),(82, 1),(83, 1),(84, 1),
(85, 1),(86, 1),(87, 1),(88, 3),(89, 1),(90, 1),(91, 1),
(92, 1),(93, 2),(94, 1),(95, 1),(96, 1),(97, 1),(98, 1),(99, 1),
(100, 1),(101, 1),(102, 1),(103, 1),(104, 1),(105, 1),(106, 1),
(107, 1),(108, 1),(109, 1),(110, 1),(111, 1),(112, 1),(113, 1),
(114, 2)]]

In the corpus, each list is a document that holds tuples of the word-id and it’s frequency. For example:

In the first list:

  • (0,1) means word id 0 occurs once in the first document
  • (1,1) word id 1 occurs once in the first document

In the second list:

  • (4,2) means word id 4 occurs 2 times in the second document
  • (6,1) means word id 6 occurs 1 time in the second document

And so on..

The way Latent Semantic Indexing works is that it takes in a document-term matrix where the rows are the unique words that appear in the corpus and the columns are the different documents in the corpus. (corpus)

And performs a dimensionality reduction method called Truncated Singular Value Decomposition (SVD) where it takes in a matrix M and decomposes the matrix into 3 matrices called U, V and S.

  • U relates terms to topics
  • V relates documents to topics
  • S is a diagonal matrix of singular values which are sorted from most important (largest value) topic to least important (smallest) topic

Then it reduces the matrices into k dimensions where k is a number specified by the user. In this context, k would be the number of topics that would be kept where everything else is discarded.

Here we set the number of topics to 10, so k would be set 10.

So what happens is the first 10 values in the S matrix will not be changed and the remaining values will be set to 0. So now we will be considering 10 topics only.

source: Research Gate

Initializing the LSI model

Inspecting the U matrix

Since the U matrix relates terms to topics, we can see here that the words with ids 14,29,55 relate most to topic 0.

Inspecting the S matrix

Since we have specified the number of topics as 10, we got 10 singular values.

This matrix ranks by importance of topic, so topic 0 is the most important and topic 9 is the least important

Most important topics in the corpus

[(0,
'0.396*"dubai" + 0.384*"marina" + 0.232*"view" + 0.217*"apartment" + 0.213*"tower" + 0.192*"bedroom" + 0.178*"room" + 0.168*"area" + 0.162*"property" + 0.158*"floor"'),
(1,
'0.362*"marina" + 0.307*"dubai" + -0.256*"bedroom" + -0.253*"property" + 0.200*"tower" + -0.198*"room" + -0.188*"view" + -0.156*"study" + -0.138*"bathroom" + -0.138*"garden"'),
(2,
'0.508*"tower" + 0.334*"princess" + -0.304*"marina" + 0.294*"floor" + -0.179*"dubai" + 0.154*"residential" + -0.147*"walk" + 0.146*"tallest" + -0.136*"property" + 0.129*"world"'),
(3,
'-0.429*"elite" + -0.421*"residence" + 0.235*"estate" + 0.234*"real" + -0.213*"room" + 0.200*"property" + 0.165*"dubai" + 0.156*"tower" + 0.152*"marina" + -0.146*"pool"'),
(4,
'-0.392*"real" + -0.390*"estate" + -0.320*"elite" + -0.303*"residence" + 0.170*"walk" + 0.167*"marina" + -0.162*"service" + -0.158*"property" + 0.156*"room" + 0.143*"apartment"'),
(5,
'0.372*"room" + 0.356*"dubai" + -0.228*"view" + -0.205*"walk" + -0.170*"marina" + -0.157*"call" + -0.147*"tower" + 0.136*"jumeirah" + -0.130*"please" + -0.129*"available"'),
(6,
'0.309*"walk" + 0.253*"real" + 0.247*"estate" + -0.212*"view" + 0.205*"station" + -0.202*"dubai" + 0.182*"room" + 0.179*"beach" + -0.164*"apartment" + 0.162*"pool"'),
(7,
'-0.481*"apartment" + 0.237*"dubai" + -0.221*"view" + -0.167*"real" + -0.160*"estate" + 0.159*"ranch" + 0.157*"villa" + 0.153*"world" + 0.153*"community" + 0.152*"study"'),
(8,
'0.287*"property" + -0.205*"kitchen" + -0.185*"apartment" + 0.184*"jumeirah" + -0.164*"large" + 0.152*"floor" + -0.148*"estate" + -0.148*"real" + -0.145*"tallest" + 0.143*"dubai"'),
(9,
'0.329*"bedroom" + 0.284*"floor" + -0.262*"property" + -0.236*"room" + -0.191*"princess" + -0.155*"world" + 0.147*"exclusive" + 0.142*"unit" + -0.134*"view" + 0.124*"marina"')]

The higher the number, the higher the distribution for the word is relative to the topic.

Querying the LSI model to get relevant documents

We can also use the LSI model we just created and query it against anything related and get the documents that relate the most to the query.

If we wanted to query ‘spacious bedroom with sea view’. We would first convert it to a bag of words vector then add it to the LSI vector space.

This would give us the relation between the query and the 10 topics.

[(0, 0.4790280995355319), (1, 0.5115331178619901), (2, -0.012159662397348562), (3, -0.04083240054066326), (4, -0.23147406310326937), (5, 0.3204532149199051), (6, 0.36010296942403663), (7, -0.27341654274531974), (8, -0.09082739687970177), (9, 0.16215539471121967)]

Using MatrixSimilarity from Gensim where it uses Cosine Similarity under the hood, we can get the documents that relate most to the query.

We can see here that the documents that are highly related to our query are documents 141, 134, 148.

Coherence Score

Finally we can test our model by getting the Coherence score which evaluates how good the model is.

Coherence Score:  0.43105337640086844

The higher the coherence score, the better. So a coherence score of 0.43 can definitely be further improved.

Code available on github

All the code is available in a Jupyter notebook on Github.

Note:

This was done as part of my internship at Ureka Education Group which is a startup company. Thanks to them, I have learned a lot during my Data Science internship.
Linkedin:
https://www.linkedin.com/company/ureka-limited/

Resources:

--

--