Analytics Vidhya
Published in

Analytics Vidhya

Analyze Trip Advisor Hotel Reviews: LDA Topic Modeling

What makes a hotel good/bad? What matters to the travelers? Let’s find out.

Hotels play a crucial role in traveling and with the increased access to information, new pathways of selecting the best ones emerged.
With this dataset, consisting of 20k reviews crawled from Trip advisor, I will apply LDA to:

  • Discover top topics and keywords mentioned in the reviews
  • Explore the key aspects that make a hotel good or bad

Latent Dirichlet Allocation (LDA)

A Topic Model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

Let’s get started!

Import dataset

Take a peak of the data

df.head(10) #Return the first 10 rows
df.info()

Check Null Values

df.isna().sum()

Visualize Rating Score Distribution

Given the rating score distribution, we can see that hotels are generally doing well. Positive reviews (score 4&5) account for 74% of the total reviews.

But what are the factors that determine and contribute to a positive/negative experience? What is it that can be leveraged? What is it that can be improved?

By answering these questions, one can acquire a better understanding of the customers, and hopefully identify business opportunities and weaknesses.

Prepare the Model

Pre-process the Data

Here, we will perform the following:

  • Tokenization: Split the text and segmenting it into into words. Lowercase the words and remove punctuation.
  • All stopwords are removed — they are removed or excluded from the given text so that more focus can be given to those words which define the meaning of the text.
  • Words are lemmatized — aiming to remove inflectional endings only and to return the base or dictionary form of a word.

Latent Dirichlet Allocation (LDA)

Topic Modeling

Here, I set the topic numbers as 3 because I suspect the reviews would in the category of positive, neutral and negative. Feel free to play around with this number with different attempts. To ensure highest relevancy, you can refer to this article: Method to find optimal number of topics.

Now, for each topic, we will explore the words occurring in that topic and its relative weight.

The 3 Topics keywords:

  • Topic 0 : room, hotel, night, service, time, desk, floor, problem, hour, small, staff, water, issue…
  • Topic 1 : hotel, room, great, good, staff, location, breakfast, nice, night, clean, friendly, area, helpful, comfortable…
  • Topic 2 : food, good, pool, time, great, resort, beach, people, restaurant, beautiful, water, nice, staff…

From the keywords, we can categorize them as:

  • Negative hotel experience (Topic 0)— from the keywords in this category, it seems like the problems are often related to wait time, reservation problems, dirtiness and loudness.
  • Positive hotel experience (Topic 1)— in this topic, we can see that most positive comments mention room, location, staff and breakfast. This tells us the criteria that the reviews are based upon, so basically what matters to the customers. We can also note the word ‘helpful’ here. This may suggest how the staffs handle a situation when problems arise can make a real impact which leaves a lasting memory.
  • Positive Resort hotel experience (Topic 2) — we learn that resort hotels must be popular, and the factors that make a great resort place include the location/environment such as beach, people, restaurant, etc.

Level of Relevance

Perplexity:  -7.163128068315959
Coherence Score: 0.3659933989946868

Perplexity is a measurement of how well a probability model predicts a test data. Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. Simply put, by looking at the perplexity and coherence score, here we get a sense of the level of relevance of the words categorized in each topic. The lower the Perplexity, the better. The higher the Coherence Score, the better. In this case, the relevance is rather low.

Business Application

Limitations

Insight

From the term list, we can see that the term ‘room’ is at the top of the list for all categories, so we can probably suggest that it is the primary product aspects that all travelers will consider.

Reference: https://www.youtube.com/watch?v=nNvPvvuPnGs

Hope this was helpful! I look forward to hearing any feedback or questions.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store