Restaurant Reviews Analysis

Published in

Analytics Vidhya

5 min readOct 11, 2019

Web Scraping

One of the more frustrating aspects of data science is gathering a large set of good data. For my most recent project, I implemented a web scraper to gather restaurant review text from yelp.com. I had planned to use the yelp API but soon discovered that it is limited to returning 3 reviews per business. This would not be acceptable for my particular project, so I decided to try scraping for the first time… and I must say, there is something very gratifying about web scraping. Every time my loop would extract business and review data from another webpage, I felt like I was living dangerously. I should be clear here, this data was not gathered in an abusive fashion — the scraping occurred over the course of one day and only for the set of data required for this project. Also, the scraped data is not being used for any commercial purposes. End disclaimer.

I used requests and BeautifulSoup to scrape the webpages. I used requests.get(url) simply to get the webpage content. From there, I created a BeautifulSoup object using the content returned from requests. Once the BeautifulSoup object is created, I can traverse the page content immediately or save the object for use later. In order to locate and access items in my soup object, I would need to know some html tags and class names to look for. These can be found by visiting the url of a yelp reviews page and using inspect on desired items in Chrome to view the page contents. One challenge that I encountered was that some reviews pages on yelp use different html structures, so I was forced to create two separate flavors of scraper function for the review pages. When completed, I ended up with 59274 reviews for 1081 restaurants in the Albany, NY metro area.

To see my web scraper code, visit the Github repository for this project. I’ve provided a direct link to the Web Scraping notebook here.

Text Analysis

There is a wonderful python library called textblob which has built-in function for measuring sentiment of a block of text. See snippet below.

from textblob import TextBlob
my_text_blob = TextBlob("My random block of text")
my_test_blob.sentiment.polarity

I compared the polarity measurements for each review to the star rating given by the restaurant reviewer. Thankfully they jived nicely.

Each group of reviews (grouped by star rating) follows a normal distribution. Additionally, the sentiment matches the star rating with 1 star reviews peaking at the most negative sentiment among the groups and 5 star reviews peaking in the more positive direction. This may not seem terribly interesting, but it does mean that the sentiment of review text itself is consistent with star ratings. If they were inconsistent, I would start questioning my data, the rating system, or texblob’s capabilities.

As the sentiment value from texblob is continuous (-1 to 1), I was able use it to get more nuanced view of review data such as review sentiment over time on a more granular scale.

Further analysis of the review text data included phrasing using the gensim.models.phrases library to find phrases that are used in my dataset. This was done twice. First, to get bigrams. Second, to get bigrams from text that already has bigrams — which would make trigrams and 4-grams… quadgrams…or whatever those things should be called.

Anyway, the gensim analysis pulls out gems such as:

complimentary_chips_salsa
tasted_like_cardboard
cold_sesame_noodles
crispy_outside_soft_inside
love_hate_relationship
waitress_seemed_annoyed
stay_far_far_away

Honestly, the phrasing analysis was probably the most amusing part of this project. Tweaking the gensim phraser threshold and scoring function could greatly impact the quality of my phrases. I found that ‘npmi’ scoring would return more phrases, but they were not as sensical as those returned when using the default scorer. My phraser fitting function is included below. It returns a fitted phraser that can then be applied to any text block.

#Function Definition
def fit_phraser(sentences, 
                min_count=5, 
                threshold=8, 
                scoring='default'):
    """
    This function returns a gensim bigram phraser. The phraser is fit to the sentences passed in and passes the min_count, threshold, and scoring parameters to Phrases in gensim.models.
    """    bigram = Phrases(sentences=sentences, 
                     min_count=min_count, 
                     threshold=threshold, 
                     scoring=scoring)
    return Phraser(bigram)#Fit and apply bigram Phraser to text
bigram_phraser = fit_phraser(all_sentences)
phrased_sentence = bigram_phraser[text_block]

After phrasing, I used TF/IDF analysis to extract keywords for each business. This was done by creating a document of reviews for each restaurant at each star rating level and then performing the TF/IDF analysis to the document set. I used sklearn.feature_extraction.text.TfidfVectorizer to get word scores. As I had already cleaned, tokenized, and phrased my review text, I needed to use a dummy function for the preprocessor and tokenizer in TfidfVectorizer. To my surprise, the TF/IDF analysis ran pretty quickly. See the vectorizer object instantiation below:

#Instantiate TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1),
                                   tokenizer=dummy_function,
                                   preprocessor=dummy_function,
                                   token_pattern=None)#The dummy function simply returns the document.
def dummy_function(doc):
    return doc

After extraction of keywords. I decided to generate word embedding models in order to measure similarity between words. This was done using a gensim.models.Word2Vec vectorizer.

#Words to measure similarity
positive = ['beer_selection','draft','service']#number of words to put into word_list
n_words = 10model = Word2Vec(doc_list, 
                 size=20, 
                 window=5, 
                 min_count=1,
                 workers=4)model.train(doc_list, 
            total_examples=model.corpus_count,
            epochs=10)word_list = [w[0] for w in model.wv.most_similar(
                           positive=positive, topn=n_words)]

This will return a list of words most similar to the positive words passed into the model. Since the word vectors are 20 dimensions, a dimensionality reduction is needed in order to view the word distances graphically. I used t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction from sklearn.manifold.TSNE. The resulting plot for word-similarity using 5-star reviews to train the model is below:

The code for the TSNE plot is provided in my repository linked to below.

I really enjoyed the web scraping and word analysis portion of this project. Beyond the parts discussed in this blog post, I also fit a series of models to the review texts that would predict star ratings. I may create a separate blog post that discusses these models, however all of the project jupyter notebooks and python files for this project are available to view in the GitHub repo here.

Restaurant Reviews Analysis

Web Scraping

Text Analysis

Written by Dennis T