Sentiment Analysis & Topic Modeling for Hotel Reviews

Jiamei Wang

Published in

The Startup

11 min readSep 13, 2020

Web scraping, Sentiment analysis, LDA topic modeling

Project Overview

In this project, we are going to scrape hotel reviews of “Hotel Beresford” located in San Francisco, CA from the website bookings.com. Then, we are going to do some data exploration, generate WordClouds, perform sentiment analysis and create an LDA topic model.

Problem Statement

The project goal is to use text analytics and Natural Language Processing (NLP) to extract actionable insights from the reviews and help the hotel improve their guest satisfactions.

Methodologies

(1) Web Scraping

The hotel reviews will be scraped from bookings.com by using requests with BeautifulSoup. The detailed steps are covered in the next section.

(2) Exploratory Data Analysis (EDA)

We will use pie chart, histogram, and seaborn violin plot to get a better understanding of the reviews and ratings data.

(3) WordClouds

In order to generate more meaningful WordClouds, we will customize some extra stop words and use lemmatization to remove closely redundant words.

(4) Sentiment Analysis

The sentiment analysis helps to classify the polarity and subjectivity of the overall reviews and determine whether the expressed opinion in the reviews is mostly positive, negative, or neutral.

(5) LDA Topic Model

In natural language processing, the latent Dirichlet allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. We will use GridSearch to find the best topic model. The two tuning parameters are: (1) n_components: number of topics and (2) learning_decay (which controls the learning rate)

Metrics

To diagnose the model performance, we will take a look at the perplexity and log-likelihood scores of the LDA model.

Perplexity captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. Log-likelihood is a measure of how plausible model parameters are given the data.

A model with higher log-likelihood and lower perplexity is considered to be a good model. However, perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words. (Read this article to learn more)

How to Scrape the Reviews?

In this project, we are going to scrape the reviews of “Hotel Beresford” located in San Francisco, CA . To scrape any websites, we need to first find the pattern of the URL and then inspect the web page. However, we see that this link is extremely long.

https://www.booking.com/reviews/us/hotel/beresford.html?label=gen173nr-1FCA0o7AFCCWJlcmVzZm9yZEgzWARokQKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4ArGo6_oFwAIB0gIkODFmNjUzODEtMWU5Ny00ZjIzLWI2MWEtYjBjZGU2NzI0ZWYz2AIF4AIB;sid=a1571564bf0a35365a839937489b2ef6;customer_type=total;hp_nav=0;old_page=0;order=featuredreviews;page=1;r_lang=en;rows=75&

After several tries, you will realize that using the link below can also generate the same page. If you want to scrape other hotels, simply replace “beresford” with any other hotel names booking.com uses. Entering a page number at the end would bring you to the review page you want to see.

https://www.booking.com/reviews/us/hotel/beresford.html?page=

Right click anywhere on the web page and select “Inspect” to view the HTML & CSS script of web elements. Here we find the html tags of the review section we want to scrape is “ul.review_list”.

Review Section that includes reviewers’ information and both positive and negative reviews

Under this tag, we want to scrape the following information:

1. Basic information of the reviewer and reviews:

Rating Score
Reviewer Name
Reviewer’s Nationality
Overall Review (contains both positive & negative reviews)
Reviewer Reviewed Times
Review Date
Review Tags (Trip type, such as business trip, leisure trip, etc.)

2. Positive reviews
3. Negative Reviews

Review Tags | Positive Reviews | Negative Reviews

Now that we have found all of the html tags, let’s start coding! First, we need to import all the python packages we need.

Part I. Scrape Reviews from Bookings.com & Clean Unstructured Text

The steps of scraping the reviews involves：

Specify the URL of the reviews page
Send an HTTP request to the URL and save the response from server in a response object.
Create a BeautifulSoup object by passing the raw HTML content from step 2 and specifying the HTML parser we want to use, likehtml.parser or html.5lib.
Navigate and search the parse tree using BeautifulSoup’s tree-searching method.find_all()
Try to scrape one page first. Print out each scraped text to identify patterns and clean the text by using .strip() and .replace() methods
Create for loops to store into the three lists
Use a while loop to scrape all the pages
Convert the lists into dataframes
Put everything into a function called “scrape_reviews”

The second function “show_data” will print out the length of a dataframe, total NAs, as well as the first five lines of a dataframe.

There are 42 review pages for Hotel Beresford, adjust the hotel’s name and total number of reviews page when you scrape other hotels.

Now, let’s check our scraped data by applying the “show_data” function we defined before.

We have 1030 records for the dataframe that contains the basic information of the reviewers as well as rating scores, review dates, and review tags.

For positive reviews, we have scraped 651 records and 614 for the negatives.

Part II. Exploratory Data Analysis (EDA)

Before doing further analyses, let’s perform the exploratory data analysis (EDA) first to get a “feel” of the data we have!

1. Ratio of positive and negative reviews

Positive reviews are slightly higher than the negatives

2. The Distribution of Ratings

The histogram of ratings is left skewed.

People from different countries may have different standards when it comes to rating hotels and their services. Among 1030 reviewers, they came from 69 different countries. Here, we are only visualizing the distributions of the top 10 countries ranked by the number of reviews.

A Violin Plot is used to visualize the distribution of the data and its probability density. The plot shown below is displayed in the order of review counts of each country. It shows the relationship of ratings to the reviewers’ country of origins. From the box plot elements, we see that the median rating given by the U.S. and Ireland reviewers are a bit higher than the rest of the reviewers from other countries, while the median rating given by the reviewers from Italy is the lowest. Most of the shapes of the distributions (skinny on each end and wide in the middle) indicates the weights of ratings given by the reviewers are highly concentrated around the median, which is around 7 to 8. However, we probably need more data to get a better idea of the distributions.

3. Review Tags Counts for each Trip Type

A lot of the times, one review has multiple tags. The bar chart shows that most people came to San Francisco for leisure trips, either as couples or by themselves. Fewer people came with their family or with a group, and even fewer people came with friends. Out of 1030 reviews, there are only 164 reviews that were tagged “Business”, which means only 16% of the reviewers came for business trips. However, we should take into account of the fact that people who came for leisure trips are usually more likely to have time or more willing to write reviews, while those who came for business trips maybe too busy or simply do not want to write any reviews.

Part III. Text Analytics

Lemmatize Tokens

Lemmatization links words with similar meaning to one word. Wordnet and treebank have different tagging systems, so we want to first define a mapping between wordnet tags and POS tags. Then, we lemmatize words using NLTK. After generating WordClouds, I added extra customized stop words in the lemmatization process below.

2. Generate WordClouds

For positive reviews, most people are probably satisfied with the location, very convenient and close to Union Square or Chinatown and easy to find restaurants or pubs nearby, friendly and helpful staff, clean room, comfortable bed, and good price, etc.

The negative reviews also mentioned words like “breakfast”, “room” and “staff” quite often, but maybe people were complaining about the staffs who were being rude, small rooms, and coffee/ cereal/ muffin provided during breakfast. The air conditioning or the shower system may need improvements as we see words like “hot”, “cold”, “air”, “condition”, “bathroom” and “shower” in the WordCloud. The hotel may also need to solve issues related to soundproofing and parking.

3. Sentiment Analysis

Here, we are using the Overall_review column in the reviewer_info dataframe to perform the sentiment analysis.

The x-axis shows polarity, and y-axis shows subjectivity. Polarity tells how positive or negative the text is. The subjectivity tells how subjective or opinionated the text is. The green dots that lies on the vertical line are the “neutral” reviews, the red dots on the left are the “negative” reviews, and the blue dots on the right are the “positive” reviews. Bigger dots indicate more subjectivity. We see that positive reviews are more than the negatives.

4. LDA Topic Modelling

Now, let’s apply the LDA model to find each document topic distribution and the high probability of word in each topic. Here, we want to specifically look at the negative reviews to find out what aspects should the hotel be focusing on improving.

Below are the steps to find the optimal LDA model:

1. Convert the reviews to document-term matrix

TF computes the classic number of times the word appears in the text, and IDF computes the relative importance of this word which depends on how many texts the word can be found. TF-DF is the inverse document frequency. It adjusts for the fact that some words appear more frequently in general, like "we", "the", etc. We discard words that appeared in > 90% of the reviews and words appeared in < 10 reviews since high appearing words are too common to be meaningful in topics and low appearing words won’t have a strong enough signal and might even introduce noise to our model.

2. GridSearch and parameter tuning to find the optimal LDA model

The process of grid search can consume a lot of time because it constructs multiple LDA models for all possible combinations of param values in the param_grid dict. So, here we are only tuning two parameters: (1)n_components (number of topics) and (2)learning_decay (which controls the learning rate).

3. Output the optimal lda model and its parameters

A good model should have higher log-likelihood and lower perplexity (exp(-1. * log-likelihood per word)). However, perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words.

4. Compare LDA Model Performance Scores

The line plot shows the LDA model performance scores with different params

From the graph, we see that there is little impact to choose different learning decay, however, 5 topics would produce the best model.

Now, let’s output the words in the topics we just created.

Top 20 words in each topic and their corresponding weights

Now, let’s visualize the topics with pyLDAVis Visualization!

pyLDAVis is a great tool to interpret individual topics and the relationships between the topics. A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

On the left-hand side of the visualization, each topic is represented by a bubble. The larger the bubble, the more prevalent is that topic. The indices inside the circle indicates the sorted order by the area with the number 1 being the most popular topic, and number 5 being the least popular topic. The distance between two bubbles represents the topic similarity. However, this is just an approximation to the original topic similarity matrix because we are only using a two-dimensional scatter plots to best represent the spatial distribution of all 5 topics.

The right-hand side shows the top-30 most relevant terms for the topic you select on the left. The blue bar represents the overall term frequency, and the red bar indicates the estimated term frequency within the selected topic. So, if you see a bar with both red and blue, it means the term also appears at other topics. You can hover over the term to see in which topic(s) is the term also included.

You can adjust the relevance metric (λ) by sliding the bar on the top right corner. It helps to strike a balance between the terms that are exclusively popular for the topic you selected when λ=0 and the terms that also appear more in other topics if you slide the λ to the right.

Conclusion

From the Sentiment Analysis scatter plot, we see that positive reviews are slightly more than the negatives, Hotel Beresford definitely needs to improve hotel guest satisfaction.

The WordCloud reveals some problems for the hotel manager to look into, like their breakfast. However, it is probably necessary to read detail reviews about their breakfast to figure out what exactly needs to be improved, maybe coffee or pastries as appeared in the WordCloud. Also, the hotel manager should train staff well to provide friendlier and better services. The hotel may also need to work with issues related to soundproofing, air conditioning, shower system and parking.

The EDA section could give the hotel manager a general idea of the reviews as well as the rating distribution. The pyLDAvis interactive visualization would help the hotel manager to further understand what most popular topics within the negative reviews are and make improvements accordingly.

Future Work

A lot of the analyses are limited due to the size of the scraped data. Here, we are only scraping reviews written in “English.” According to San Francisco Travel Reports, there were 2.9 million international visitors visiting San Francisco in 2019. Visitors from non-English speaking countries are most likely going to leave reviews in their native language. Maybe trying to scrape reviews in other languages and translate the scraped reviews or scrape after translation would help to increase the data volume.

To provide more useful suggestions to Hotel Beresford, we may also conduct analysis of its competitors to gain insights of guest preferences as well as valuable information that Hotel Beresford may not get from its own reviews.

Hope that this analysis could also benefit people who are interested in text analytics. Please check out my GitHub link for the full code and analysis. Thank you!