Topic Modelling.

Aditya
4 min readJul 30, 2020

--

Topic modelling is the process of inferring topics from a huge amount of text data from one or multiple resources. Here we will be exploring the topic modelling process done on AIRBNB reviews. Though the original dataset has many languages we will be focusing on only Topic Modelling on English comments for simplicity sake.

There are many Algorithms for the above objective. Here

· LDA

· LSA

· LINGO

are explored in detail.

LDA:

It stands for Latent Dirichlet Allocation. Here each and every document (which in our case is each and every review) as a mixture of distributions of words from different topics. The result will be the collection of words predicted to fall under a topic which can be named accordingly as per our interpretations.

Cluster 1

Here we have the terms in cluster one which is the largest cluster in our dataset. This cluster can be interpreted as the people who had given this review had a very pleasant stay but are somewhat reluctant to recommend the place.

Cluster 2

This is the second biggest cluster found in the dataset. Going by the words and their counts we can interpret that these sets of customers who liked the stay, though had issues with a high price, tidiness, etc. of the rooms are very likely to recommend the places.

LSA:

It stands for Latent Semantic Analysis. This is more of a mathematical approach to topic modelling. Here (in our case) the reviews along with the unique words in those reviews are converted to a document-term matrix which is then subjected to Singular Value Decomposition, which reduces each document-term matrix to the topics it has that we are interested in and the co-relation of each unique term in the review(doc-term matrix), based on which the clustering of topics can be made.

Cluster 1

The above is the 1st topic cluster obtained via LSA, here we can say that the customers liked the place and wanted to stay longer and valued the place.

Cluster 2

The above is the second cluster and because of the relative positioning of the words, we can say that these set of customers were even more overjoyed by the place than the people in the first cluster as this cluster has higher reviews saying they wanted to stay longer.

LINGO:

Here Carrot2-Workbench was used to implement topic modelling of the dataset using LINGO algorithm. For this sake, the English reviews were extracted from the dataset and converted to XML which were then accessed and clustered via LINGO in workbench. Due to hardware and software constraints, only 1000 reviews were clustered to draw inferences.

Clustering in lingo
Filtering based on wanted phrases.

We can see here that the reviews can be filtered according to the query terms of interest which are a feature of excellent use in this framework which is unavailable otherwise.

Closing Remarks:

It's worth mentioning that in both LDA and LSA the number of topic clusters is to be specified by the user which is arbitrary and open to speculation. LDA provides a pictorial view of clusters which is easy to interpret. LSA is prone to loss of language features because of mathematical modelling. Carrot2 using lingo is more accomplished than the above two techniques but as the reviews are processed in batches and as there is no means of filtering according to specific constraints, it is also open to speculations.

Github Repo links:

Thanks to UREKA Educational Group for helping me get a linear leaning curve in my quest to learn advanced Data Science concepts.

At Linkedin:https://www.linkedin.com/company/ureka-limited/

Edit: changed generating inference of sorts → inferring topics.

--

--