Predicting the ambiance of restaurants using only the wording of Yelp reviews

Published in

Fun with data and stats

16 min readMar 22, 2017

Businesses on Yelp are often tagged with certain attributes to help users better understand what to expect when walking into a restaurant or other establishment. Reviewers help Yelp tag the business in their reviews. That led me to wonder whether these manually tagged attributes of a restaurant could be inferred by the text of the reviews themselves, since both reflect the same reviewers’ opinions. If this is possible, then Yelp could conceivably predict things about restaurants that users haven’t tagged yet, or perhaps provide suggestions to reviewers.

To answer this question, I used data from the Yelp Dataset Challenge.

Framing the Problem

There are tons of attributes that Yelp stores about businesses. I’m going to consider only the “Ambience” and the “Good For” attributes of businesses that are categorized as Restaurant, Food, or Nightlife. “Ambience” can take on values of [Divey , Hipster, Casual, Touristy, Trendy, Intimate, Romantic, Classy, Upscale], and “Good For” can take on values of [Breakfast, Brunch, Lunch, Dinner, Late Night, Dessert].

Yelp’s “Ambience” and “Good For” attributes share an important trait: users can check off a business to have multiple values of each attribute. A restaurant can be both “Hipster” and “Casual”, and it might be good for both “Lunch” and “Dinner”.

A sample of the options presented to Yelp reviewers

So now we can state our problem: using the text of the reviews of for a business, predict which labels from the “Ambience” and “Good For” attributes this business should have.

Using words to perform classification

Our crucial idea here is that people probably say similar things about restaurants that share an “Ambience” or a “Good For”. Perhaps a business is more likely to have a set of labels if its reviews are similar to the reviews of other businesses who share the same labels.

How do we know when reviews are similar? I decided to use an approach called k-nearest neighbors. The idea is that an unlabeled instance can be classified by observing the the labeled instances closest to it, that is, its nearest neighbors. The diagram below demonstrates a simple implementation of the algorithm to classify the unknown red star. If we look at its 3 closest neighbors, the unknown instance is most similar to a blue hexagon since a majority of its neighbors are hexagons; comparing it to 5 neighbors may lead us to believe that it’s a green square.

To formulate a definition of nearness between Yelp reviews, we need to answer two questions:

How do we assign a quantitative value to text data?
What metric do we use as a distance between our quantitative representations of texts?

Definitions

Document: a body of text that we’d like to compare to other bodies of text. For our model, I combined all of a single business’ Yelp reviews into one large review. This way, comparing one document to another is equivalent to comparing businesses to one another.

Corpus: the collection of all documents in our analysis. For our model, the corpus is our set of Yelp businesses.

Term: a unique word in a document.

Vectorizing reviews

Before we can measure the distance between reviews, we need to turn them into quantitative values upon which we can apply a distance function. There are many ways do this with text data. We’re going to use an approach that transforms reviews into a collection of terms and gives each term a score.

Instead of simple term count, we’ll assign a statistic called tf-idf to each term in our review. Tf-idf measures the importance of a term to a document by multiplying term frequency (tf) with inverse document frequency (idf), which measures the rarity of the term across our entire corpus.

As a result, we assign a higher score to rare words that may be more important to determining the meaning of the document. In the context of Yelp reviews, it makes sense that we’d want to assign the term “burrito” a higher score than the term “eat”, even though the frequency of “eat” is much higher.

We can calculate the tf-idf score for each term in our document, giving us a collection of scores that we can turn into a T-dimensional vector, where T is the number of terms in our document. We’ve transformed our document into a quantitative representation that allows us to measure distance between documents.

Distance between document vectors

We’ll measure distance by using a metric called cosine similarity, which is the cosine of the angle between our review vectors. For our purposes, the angle between vectors is a good measure of document similarity since the direction that the vectors point in matters more than their magnitude. For instance, document A should be perfectly similar to document B if document B is simply two concatenated copies of document A. We’re going to normalize our term frequencies so the magnitude issue won’t apply here, but we may want to calculate term frequency another way for other analyses. Perfectly similar documents will have a similarity of 1 and opposite documents will have a similarity of 0. This will be our distance metric.

Similarity between document A and document B can be calculated using the dot product (Wikipedia)

cos(𝛳) is the similarity between review vectors

Predicting labels with nearest neighbors

With our quantified reviews and distance function in hand, we’re ready to find a document’s nearest neighbors. We combine our document vectors into a m x n document-term matrix containing our tf-idf scores, where each row of the matrix represents a single labeled document and each column represents a term. The unlabeled document can similarly be represented as a n x 1 vector of its tf-idf scores. Taking the cosine similarity of these matrices allows us to compute a m x 1 similarity matrix of our unlabeled document with all of our m labeled documents. The labeled documents with the highest similarity scores are the unlabeled document’s nearest neighbors.

Now that we can find our nearest neighbors, we can use that information to figure out which labels to apply to our unlabeled business. If we‘re observing our 5 nearest neighbors and find that 3 of them are “Casual”, 2 of them are “Lunch”, and 1 of them is “Upscale”, how do we turn that into a prediction for the label set of our unlabeled instance?

Essentially, we will give our unlabeled document a label if the probability of having the label is greater than the probability of not having the label, and vice versa. To calculate these probabilities, we need to know (1) the prevalence of the label in our data (over 2/3 of businesses are labeled “Casual” while only about 1/65 are labeled “Upscale”) and (2) how likely it is that the unlabeled instance has a certain number of neighbors with a certain label, given it actually has that label (it’s much more meaningful when we see that some of our neighbors have the “Upscale” label than when we see that our neighbors have the “Casual” label).

Once we have these probabilities, we‘re ready to make predictions for the labels of an unlabeled Yelp business.

For more details on this model, see the appendix.

Model selection and evaluation

We have one more question to answer: how many neighbors K should we take into account in our model? We’d like to select the model that minimizes some error function(s). To do this, we’ll split our data into training/test sets and use 10-fold cross-validation to test different values of K.

There are a few error metrics we can consider for multi-label predictions. We’ll evaluate hamming loss (the rate at which we misclassify labels), coverage, one-error, and average precision (measurements of how well can rank our label predictions based on our calculated probabilities). For more details on the error metrics, see the appendix.

Depending on the goal of our model, we may want to optimize for a different error type. With model selection and cross-validation, we have all the pieces of our model. It’s time to apply it to actual data.

Methodology

I used R to clean up the data and write the functions to build the model.

Preprocessing leaves us with and 15,474 businesses and 2,834 terms as model features.

For more on the methodology, see the appendix.

Results

To put our results into context, it’s helpful to visualize our label distributions.

Across our 15,474 businesses, we have a total of 37,970 labels, giving us 2.4538 labels per business on average.

How do we know whether our results are any good? We should compare our model to classifiers we could create without any model. The “Casual” label is most common with 10,777 instances. We could build a naive classifier that classifies all instances as having only the “Casual” label. Since over half of our observations have “Lunch” and “Dinner” labels, we can also tell our naive classifier to label every instance with these two labels. This naive classifier produces a hamming loss of 0.1195. We can also create a naive function that ranks our labels for each review only by their prior probability (i.e., as they’re ordered in the chart above). This naive ranking gives us a coverage of 4.8143, a one-error of 0.3035, and an average precision of 0.7414. The errors produced by our no-model naive classifiers serve as benchmarks to test the performance of our model.

Let’s take a look at the top features in our model, that is, the terms in our document-term matrix that have the largest tf-idf scores summed over all of our documents. These features are ostensibly the most meaningful in our corpus of reviews. The raw counts of terms in our corpus are shown alongside in the table below for comparison:

We can throw together some word clouds to get a bigger picture:

Words with high tf-idf scores are largely specific types of food (e.g. taco, bagel, lobster), while the terms with the largest raw count are more generic words that could apply to any business. If our model has predictive power, then this suggests that the type of food written about in the reviews give us insight into the ambience of the business and what meal service types its good for.

The top tf-idf scored features of businesses with the “Breakfast” and “Dinner” labels look quite different:

We train models on all values of K from 1 to 30 and compute errors to pick the best K. The smallest cross-validated errors are shown below:

Good news — our model has more predictive power than our naive classifiers in every error metric. K = 13 gives us our minimal hamming loss of 0.08184, a 31.51% reduction in misclassifications over our naive classifier. These results show that the words used in Yelp reviews do encode information about the attributes of businesses. Our Yelp Elite friends are conveying more information about a restaurant than perhaps they even realize.

We can see that our error rates become asymptotic as we increase the number of neighbors considered.

Our naive classifier benchmark hamming error is 0.1195

Our naive classifier benchmark average precision is 0.7414

We can now pick K for our model based on which error type we’d like to minimize.

Predicting San Francisco

Let’s try applying our model to predict the labels for some of San Francisco’s famed establishments: Gary Danko (“Classy”, “Upscale”, “Dinner”), Straw (“Hipster”, “Casual”, “Dinner”), Taqueria Cancun (“Casual”, “Lunch”, “Dinner”, “Late Night”), Tommy’s Joynt (“Casual”, “Lunch”, “Dinner”), and Lipo Cocktail Lounge (“Divey”). Note our model wasn’t trained on any businesses from SF.

I scraped all the reviews for these businesses from Yelp’s website (fun fact: the cumulative word count of Gary Danko’s reviews is over 800,000, making it longer than the Bible) and applied our 13-neighbor model to each restaurant’s reviews. Here’s what we predict:

Gary Danko: “Dinner”. We missed out on the rare “Classy” and “Upscale” labels, but we assigned them pretty high probabilities relative to other labels at 41.43% (compared to 3.5% prior probability) and 22.98% (1.5% prior) respectively. “Romantic” had a 29.93% chance, and everything else had less than 6% chance. It’s worth noting that we did well in assigning “Casual” only a 5.05% chance (69.64% prior).

Straw: “Casual”, “Lunch”, “Breakfast”. “Hipster” is a rare label that’s tough to predict — we weren’t even close, assigning it a 1% chance. I actually think we assigned “Breakfast” and “Lunch” correctly here, since Straw is best known for its brunch service. In this case, we seem to have predicted more appropriate labels than the “Good For” shown on Yelp. The model felt strongly about “Breakfast” at 77.96% chance (13.02% prior) and weakly about “Dinner” with 25.33% chance (53.82% prior). A probability-based model like this could be useful to catch errors.

Taqueria Cancun: “Casual”, “Lunch”. We missed the most interesting label, “Late Night”, though we went in the right direction, assigning it 27.15% probability (8.9% prior).

Lipo Cocktail Lounge: “Casual”, “Lunch”, “Dinner”. This was just completely wrong, and I think I know why. The top term of Chinatown icon Lipo Cocktail Lounge is “Chinese”, which is overwhelmingly associated with casual Chinese food places. Our model had no chance.

Tommy’s: “Casual”, “Dinner”, “Lunch”. We got all the labels correct here, and that’s partly because we predicted the most common labels.

Shortcomings

Let’s take a closer look at a how our model can go wrong:

The Wine Mill in Peninsula, OH has been labeled “Trendy” and “Romantic” by Yelp users, but our model thought it was only “Casual”. A sample of the reviews:

Great little find in Peninsula! Amazing atmosphere inside and out - they really spent time with the decor! The staff was very nice and the (I think) own stopped by our table to say hi. We ordered 3 different dips (chicken, artichoke, and crab) and the cheese board. We also had the house cab and Pinot...The Wine Mill offers 2 house made wines within each category. They have a great size wine list for a small neighborhood establishment. They offer a great mix of sweet and dry wines. You can get most wines as a Taste, Glass or Bottle...

The top 5 features for this business are “wine”, “slider”, “mill”, “jar”, and “artichok”. Strangely enough, “wine” isn’t in the top 100 features of businesses labeled “Romantic” or “Trendy”, while it is in the top 100 features of “Casual”. We don’t have many data points for “Romantic” and “Trendy” businesses, so we may simply not have encountered enough reviews that talked about wine. For the rarer labels like “Romantic”, we really need more data to predict it accurately. Furthermore, our model can’t understand that this is a trendy wine bar rather than just another restaurant that serves wine. If we had taken into account sequences of words instead of just looking at single words, perhaps terms like “wine list” or “house made wines” may have helped us make more accurate predictions.

We can improve our model in several areas:

Feature engineering — we did not perform much feature engineering beyond tf-idf scoring our reviews, word stemming, and removing sparse terms. We could try other things to extract more predictive features out of our bodies of text. Word2Vec might be an document vectorization technique to explore. We could try using n-grams, sequences of words, to add more meaning-rich features. Furthermore, we didn’t test whether the feature engineering we did perform actually improves or degrades model performance.

Imbalanced data — the label distribution of the data is imbalanced, heavily favoring the top 3 labels. Some of our labels are pretty rare, which makes training the model more difficult for those labels. If we had more balanced labeled examples, the model would learn better.

Validation set sampling — we sampled our cross validation sets uniformly, which may cause our CV splits to contain imbalanced proportion of labels in relation to our total dataset. Normally we’d like to use stratified sampling to generate representative folds of training/test splits, but it doesn’t make much sense when we have multi-label data. There have been more complicated techniques proposed to sample multi-label data for cross-validation, but for purposes of simplicity, I just use standard cross-validation here. Our error rates may be off as a result.

You can take a look at the code used for this analysis here

Appendix — Multi-label KNN algorithm

K-nearest neighbors is commonly used for binary classification problems in which we classify an unknown instance into a single class. To extend the technique to a multi-label classification in which the unknown can simultaneously belong to any number of classes, we estimate probabilities for each label by finding the prior probabilities of each label and the posterior probabilities of observing certain numbers of our neighbors having each label. We give our unlabeled document a label if the probability of having the label is greater than the probability of not having the label, and vice versa.

We use a Bayesian technique that makes use of a maximum a posteriori estimator put forth by this paper to calculate the probabilities for each label.

Suppose we have an unlabeled test document t and a set of labels 𝒴. Let C(l) be a function that counts how many neighbors of t have a label l, l∈𝒴. Let H(1, l) denote the event that t has label l and H(0, l) be the event that t does not have label l. Let E(j, l) (j ∈ {0, 1, . . . , K}) denote the event that, among the K nearest neighbors of t, there are exactly j instances which have label l. Let y(l) be an indicator function that takes on a value of 1 if t has label l and 0 otherwise.

Then the label vector for our test document t is the arg max of (the conditional probability of t having label l, given exactly C(l) of its neighbors have label l) and (the conditional probability of t not having label l, given exactly C(l) of its neighbors have label l) for all l∈𝒴:

We can compute this conditional probability by using Bayes’ rule to transform the equation:

Now our problem is reduced to finding the priors P(H(b, l)), (l ∈ 𝒴, b ∈ {0,1}) and the posteriors P(E(j,l)|H(b,l)), (j ∈ {0, 1, . . . , K}). The priors can be calculated simply by dividing the number of documents with label l by the total number of documents. I implemented the algorithm described in the paper to calculate the posterior probabilities. The information gain in our model is embedded in the posteriors of observing our neighbors having labels. Note that these probabilities are calculated using our training set, the set of documents that are already labeled.

So for each unlabeled test instance t, we can now compute a vector of length ‖𝒴‖ that takes values 1 if t is predicted to have label l ∈ 𝒴 and 0 otherwise.

Appendix — Multi-label error metrics

Hamming loss measures the percentage of labels that are misclassified. It’s a good measure of overall model performance. Hamming loss ranges from 0 to 1 and we want it to be small, optimally 0.

where p is the size of our test set, Q is the size of our label set, h(·) is our multi-label classifier predictor, Y is the true set of labels, and 𝚫 denotes symmetric difference of sets.

Our model also allows us to construct a ranked set of labels in which labels with a larger probability of occurring P(H(1,l) | E(j,l)) outrank those with a lower one. Let f(·) be the function that outputs these probabilities and rank(·) be the function that outputs the actual ranked position of a label.

Coverage tells us how far we have to go down the ranked list of labels before find all the true labels for our instance. It’s useful to measure when we want to get all the labels correct, even at the expense of some false positives. We want coverage to be small. The optimal coverage is the average number of labels per instance.

One-error is the rate at which the top-ranked predicted label is not in the set of true labels for our instance. It may be especially important if we want to have high confidence that one label prediction is correct. It ranges from 0 to 1 and we want one-error to be small, optimally 0.

where ⟦E⟧ is an indicator function that takes on 1 if E is true and 0 otherwise.

Average Precision is a metric that evaluates document ranking performance. For each true label, average precision measures the fraction of true labels that are ranked before it. It ranges from 0 to 1, and we want it to be closer to 1.

Appendix — Methodology

I used R to clean up the data and write the functions to build the model. Since we may have to hold 10’s of GBs of data objects in memory, I ran R on AWS EC2 instances. It’s pretty straightforward to install the RStudio on EC2 and ssh the dataset onto our instances. I forgot that R doesn’t natively support parallel processing and I didn’t want to rewrite my code, so I opted to train the model simultaneously on multiple EC2 instances. Since I couldn’t make use of more processing power, I chose the r4.xlarge instance to optimize for memory.

The Yelp dataset contains 4.1M reviews and 86K businesses. To read in Yelp’s json data and turn it into an R data frame object, I used R’s jsonlite package. I combined all the reviews for a single business into a single document. This way, each business can be considered a single document and the collection of our businesses is our corpus. To ensure our documents have reasonable length, I removed all business that have fewer than 20 reviews. This pre-processing leaves us with a total of 15,474 documents in our corpus.

Using the quanteda package, I created a document-term matrix and removed stop words , performed stemming so that we only consider root words to be unique, and removed terms beyond a certain threshold of sparsity until we are left with 2,834 terms as model features.

For each cross-validation split, I trained our model on the training set , then compute a cosine similarity matrix on the test set. This matrix will give us our nearest neighbors (from our training set) for each document in the test set. Then for each label, we calculate the probability of the document having or not having the label using our priors and posteriors for that particular label and value of K. We generate predictions like this for each value of K that we’d like to test, then average the errors for each CV fold to arrive at the error values for each K.