Unveiling the Secrets of Boston Restaurants: A Feature Importance Analysis

Nate Fogg
14 min readSep 7, 2023
Source: https://commons.wikimedia.org/wiki/File:Bostonstraight.jpg

Introduction

Boston, Massachusetts — a city with a rich history, outstanding universities, successful sports teams, and mouth-watering food! When it comes to dining in this city, there’s no shortage of amazing options. However, not all restaurants are created equal. If you’re like most people, you’ll turn to Yelp for recommendations. Yelp is a popular review website containing valuable data that influences dining decisions. When it comes to making choices, we are curious about what others have to say. The more consensus there is, the more likely we are to pay attention. It’s no surprise that the food industry closely monitors customer feedback. So then, the question is: What factors make some Boston restaurants stand out in terms of their ratings and reviews on Yelp?

Yelp Fusion

Source: https://www.yelp.com/

For this project, I utilized Yelp’s Fusion API, which “allows you to get the best local content and user reviews from millions of businesses around the world. You have the ability to search for businesses by keyword, category, location, price level, and phone number. Get rich business data, such as name, address, phone number, photos, Yelp rating, price levels, and hours of operation.”[Source: Yelp Fusion API Documentation]

However, there are some important restrictions:

  • Daily Limit: I was allowed to make up to 500 API calls per day.
  • Data Lifespan: I couldn’t store the data for more than 24 hours.
  • Data Sharing: I could only share aggregated data, not individual business details.
  • Noncommercial Use: I couldn’t use this data for commercial purposes.
  • Search Limitations: When searching for businesses, the API returns up to 50 at a time and doesn’t guarantee exact matches. However, by adjusting the offset, I accessed up to 1,000 businesses per zip code.

See the API Terms of Use here.

These limitations added complexity to my project, but they didn’t stop me from making the most of this resource!

Data Collection

To assemble a comprehensive dataset of restaurants across Boston’s neighborhoods, I employed Yelp’s Business Search endpoint. This endpoint provides suggested businesses based on various inputs. Given the wide array of search criteria available, my primary goal was to ensure a representative sample spanning all of Boston’s neighborhoods.

To achieve this, I devised a method involving the utilization of zip codes associated with Boston’s official neighborhoods. I crafted a script that initiated API calls for each zip code, incrementing the offset with each iteration of a loop. In instances where the API returned businesses that did not meet the specified criteria or exceeded an offset of 950, I introduced a fail-safe mechanism. In such situations, I either omitted the business from consideration or proceeded to the next zip code, as needed.

It’s worth noting that this approach, while effective, isn’t flawless. The API search results sometimes contained duplicates. Nonetheless, through these collection efforts, I successfully retrieved a total of 1,714 unique restaurants from the API.

This approach offered a substantial and reasonably representative overview of Boston’s restaurant landscape, allowing for an analysis of the city’s dining scene.

Figure 1: A kernel density estimate map depicting the distribution of restaurant locations in Boston, sourced from Yelp Fusion.

Feature Engineering

Below is an example of the API response to the requestor using the Business Search endpoint when searching for “delis”:

{
"total": 1316,
"businesses": [
{
"rating": 4.5,
"price": "$$",
"phone": "+14154212337",
"id": "molinari-delicatessen-san-francisco",
"categories": [
{
"alias": "delis",
"title": "Delis"
}
],
"review_count": 910,
"name": "Molinari Delicatessen",
"url": "https://www.yelp.com/biz/molinari-delicatessen-san-francisco",
"coordinates": {
"latitude": 37.7983818054199,
"longitude": -122.407821655273
},
"image_url": "http://s3-media4.fl.yelpcdn.com/bphoto/6He-NlZrAv2mDV-yg6jW3g/o.jpg",
"location": {
"city": "San Francisco",
"country": "US",
"address2": "",
"address3": "",
"state": "CA",
"address1": "373 Columbus Ave",
"zip_code": "94133"
}
},
// ...
]
}

With a limited amount of data for each location at our disposal, the key to extracting valuable insights lies in crafting meaningful features. Fortunately, we can harness the provided data and engineer additional features to enhance the analysis.

1. Neighborhood Location: I assigned a neighborhood location based on the zip code to which it belongs. This allows us to explore the geographical distribution of restaurants across Boston.

2. Price Encoding: The price column, representing the price level of each restaurant, was encoded. This transformation facilitates comparisons and insights related to pricing.

3. Presence Features: I introduced several binary features, including:

  • has_image: Indicates if a restaurant has at least one image on its Yelp page.
  • has_phone: Reflects whether a restaurant has a phone number listed on its Yelp page.
  • has_st_add: Conveys whether a restaurant has a street address listed on its Yelp page.

These features help us understand the completeness of business profiles.

4. Cuisine Encoding: To capture the diversity of cuisines in the dataset, I created encoded columns for each cuisine (alias) tag present. This enables us to analyze the popularity of different cuisines in Boston.

5. Balanced Rating Score (BRS): Using ratings alone as a measure of success can be misleading. To address this issue, I introduced the BRS. It accounts for the problem of favoring restaurants with a small number of high ratings over those with many slightly lower ratings. The BRS is calculated by normalizing and weighting the rating and review count columns, providing a more accurate measure of success. The possible range is between 0 and 1, with 1 being the best. This score serves as our target variable for feature importance analysis.

Exploratory Data Analysis

With the dataset in hand, let’s explore.

Figure 2: The distribution of restaurants across different neighborhoods in Boston within the Yelp dataset.

Our data collection approach yielded an uneven distribution of restaurants. For example, Dorchester presents a wealth of businesses, while Downtown appears to have almost none. It is also the case that Boston’s neighborhoods vary widely in terms of geographical size. Naturally, larger neighborhoods may host a greater number of businesses. Beyond size, certain neighborhoods may attract more restaurants due to factors like affordability, tourism, and demographics. These localized dynamics can influence the concentration of dining establishments.

Figure 3: Restaurant counts based on their price categories, aligned with Yelp’s pricing scale, which estimates the average menu price, with $ indicating the lowest and $$$$ representing the highest.

The majority of restaurants within the dataset fall within the moderate price range or below. A significant portion lack a listed price category. This absence doesn’t necessarily imply affordability; rather, it suggests that their Yelp page may not include pricing information. Many fast-food and smaller restaurants naturally align with the lower price ranges, offering quick and budget-friendly dining options. Conversely, upscale and gourmet establishments often require a more substantial investment, justified by an elevated dining experience and culinary craftsmanship.

Figure 4: The count of Boston restaurants within the data frame that have specific tags.

A substantial majority boast several key attributes on their Yelp pages. These attributes include a complete street address, at least one image, a phone number, and options for both delivery and pickup services. These elements collectively contribute to a comprehensive and user-friendly profile. Surprisingly, one exception emerges: a relatively small number of restaurants within the dataset have reservation options listed on their Yelp pages. It’s plausible that some restaurants may not be aware of the option to include reservation details on their business pages, highlighting the importance of proactive management of online profiles. Alternatively, the absence of reservation listings could reflect the specific composition within our dataset. Some types of dining establishments may naturally place less emphasis on reservations, aligning with different culinary styles or dining experiences.

Figure 5: The top 20 Boston cuisine categories as they appear in the data retrieved from Yelp.

Boston’s dining scene appears to be flavored with an abundance of pizza, sandwich, Italian, and seafood restaurants. However, it’s essential to acknowledge the potential overlap among these categories, where a business may embrace multiple identities. In particular, the Italian cuisine category often intersects with the pizza tag, illustrating how owners may choose to showcase their versatility and appeal to a broader range of tastes.

Figure 6: The distribution of the Balanced Rating Score (BRS) feature.

Now, let’s turn our attention to the BRS that was calculated and examine its distribution across the restaurants in our dataset.

The distribution of BRS reveals an interesting pattern: it exhibits a slight left-skewness. This has the effect of pulling the mean BRS downward, making it somewhat less representative of the data. In contrast, the median remains unaffected, rendering it a more robust measure for describing the central value of our dataset.

Notably, the majority of restaurants within our dataset fall within the BRS range of 0.5 to 0.8, which is indicative of generally favorable ratings and review counts. This concentration of scores in the higher range reflects positively on the overall dining selection in Boston, suggesting that many establishments enjoy positive feedback and support from customers.

Figure 7: Comparing the median balanced rating scores by neighborhood to the overall median.

Now that we have an understanding of the distribution of the BRS across all restaurants, let’s zoom in and examine how each neighborhood performs in terms of this crucial measure.

It becomes evident that certain neighborhoods in Boston exhibit distinctive patterns when it comes to BRS. Notably, the North End and Allston stand out with the highest median Balanced Rating Scores. In contrast, Dorchester and Mattapan show comparatively lower median BRS values.

For aspiring restaurateurs eyeing the Boston dining scene, it’s essential to recognize the significant influence that a neighborhood can have on a restaurant’s Yelp rating and success. These aggregated values emphasize the importance of location, even though each neighborhood hosts numerous successful businesses. The variations in median BRS highlight the underlying factors that contribute to the unique characteristics of each neighborhood.

In the feature importance analysis, I chose to exclude location to focus on actionable traits. This decision allows us to delve deeper into the specific attributes and cuisines that restaurateurs can control to enhance their restaurant’s success, irrespective of their neighborhood.

Feature Importance:

Source: https://www.flickr.com/photos/calliope/4555675629

My objective in conducting this feature importance analysis was to identify the most influential variables in the dataset when it comes to determining the BRS. Initially, I considered linear regression due to its simplicity, but the categorical nature of our dataset, coupled with the potential presence of multicollinearity and outliers, led me to explore alternative approaches.

I opted for the Random Forest Regressor and the Extreme Gradient Boost Regressor, both of which excel in handling multicollinearity and outliers.

Random Forest Regressor

For the random forest regressor, I employed a two-step approach. First, I conducted a random search cross-validation using RandomizedSearchCV in sklearn. Once I identified the best combination of hyperparameters, I further refined the model’s performance with a grid search. Subsequently, I ranked the features based on their calculated importance and selected the top 20 for feature selection. After reducing the feature set, I optimized the model once again.

The random forest regressor produced promising results, with a mean squared error (MSE) of 0.0143 and an R-squared value of 0.3479. While the model exhibits low error in predicting the BRS, it explains only 34.79% of the variance in the BRS. Given the complexity of factors and the limited dataset, this level of explanation is reasonable. Human behavior and experiences are multifaceted and challenging to predict accurately. Nonetheless, this model provides valuable insights.

XGBoost Regressor

Similarly, I applied a grid search and feature selection to the XGBoost Regressor. After retaining the best features, I fine-tuned the model to the data. The XGBoost Regressor yielded a slightly improved MSE of 0.0146. However, the R-squared value dropped to 0.3351, indicating that this model explains a lower proportion of the variance in BRS compared to the random forest regressor. For this reason, I chose the random forest’s feature importance values.

Figure 8: Comparison of the tuned random forest regressor and XGBoost regressor on their respective top 20 important features.

These regression models offer an explanation of the factors influencing success as measured by the BRS. Although they may not capture the entire picture, they provide a foundation for understanding key determinants and offer a reasonable level of predictive power given the inherent complexities of human behavior and restaurant dynamics.

Figure 9: A scatter plot depicting the predicted values vs. the actual values of the Balanced Rating Score (BRS) from the random forest regressor, alongside the distribution of residuals calculated as the actual BRS minus the predicted BRS.

An examination of the plots above reveals numerous predictions that fall above their actual values. These points represent businesses with attributes that defy the typical patterns observed in the dataset.

This phenomenon highlights a fundamental challenge encountered when attempting to model performance. It underscores the presence of characteristics that can significantly influence a BRS, ones that may not be adequately captured by our existing model. These deviations remind us of the diversity of Boston’s dining scene, where each restaurant is unique and shaped by a multitude of factors beyond what the models can encompass.

Key Findings

Figure 10: The random forest regressor feature importance visualized.

Within our modeling efforts, several features emerged as the most critical determinants of BRS. It’s important to note that a high feature importance score does not necessarily indicate that possessing it is good. To assess the impact of these features on BRS, we will need to examine the distributions of each feature to understand how they contribute to a restaurant’s success.

Figure 11: BRS Distributions across Price Ranges.

It’s evident that restaurants belonging to higher price categories, those exceeding $, tend to exhibit a higher median BRS compared to their more budget-friendly counterparts. However, it’s worth noting an interesting anomaly: restaurants in the most expensive price range don’t boast the highest median BRS. This observation could be attributed to several factors. One possibility is the limited sample size of expensive restaurants. Additionally, gourmet and high-end dining establishments may cater to a niche market, resulting in lower review counts due to their relative inaccessibility to the broader population.

To enhance understanding, outliers are marked in the distributions with diamond-shaped indicators. This visual representation helps identify businesses with exceptionally low BRS within each price category.

One category that stands out is ‘No Price’, which exhibits the largest range. This suggests that restaurants without listed prices on their Yelp pages encompass a wide spectrum of BRS scores, highlighting the unpredictability within this category.

Figure 12: BRS Distributions by whether the restaurant serves hotdogs or not.

While it’s important to acknowledge the presence of outliers in the ‘No’ category, a notable difference lies in the medians. Restaurants that have indicated the presence of hotdogs on their Yelp pages exhibit a considerably lower median rating compared to their counterparts. This finding suggests that if you’re considering opening a hotdog business in Boston, there’s a clear message: strive to stand out!

Figure 13: BRS Distributions by whether the restaurant‘s Yelp page has at least one image or not.

The data shows that restaurants without any images tend to score significantly lower compared to their image-bearing counterparts.

Several factors may contribute to this. Firstly, the presence of images on a Yelp page can play a pivotal role in attracting customers. Images provide potential diners with a visual glimpse of the dining experience, enhancing the appeal and piquing curiosity.

Secondly, the correlation between having at least one image and higher ratings could also be attributed to the fact that restaurants with images tend to accumulate more reviews. A robust online presence, bolstered by customer-contributed images, may encourage more diners to share their experiences, resulting in increased review counts.

In essence, this underscores the significance of visual content in the digital age of dining. Restaurants that embrace imagery and provide a captivating visual narrative of their offerings tend to enjoy higher ratings and a competitive edge in attracting customers.

Figure 14: BRS distributions vs. the remaining 17 most important features.

In our analysis, we’ve examined the distributions of Balanced Rating Scores across the 20 most important features, offering valuable insights for restaurants in Boston. Here are the key takeaways:

  1. Restaurants specializing in Italian, seafood, cocktails, salad, breakfast and brunch, new American, or wine tend to achieve higher BRS scores. Conversely, establishments focusing on pizza, chicken wings, Chinese, burgers, traditional American, sandwiches, or coffee show less consistent performance in BRS ratings.
  2. Pricing Flexibility: Offering menu prices between moderate ($$) and expensive ($$$$) appears to be advantageous for success.
  3. Visual Appeal: Adding images to your Yelp page and creating an enticing visual narrative can positively impact ratings and attract diners.
  4. Convenient Services: Providing delivery and pickup options enhances a restaurant’s appeal, making it more convenient for customers.
  5. Accessibility: Adding a phone number to your Yelp page can improve accessibility and communication with potential diners.
  6. Customer-Centric Approach: While correlations exist between certain features and BRS, it’s crucial to remember that correlation does not imply causation. Ultimately, prioritize the customer experience and the quality of the food, as these factors remain paramount in achieving success.

These insights provide a valuable roadmap for both existing and prospective restaurants in the Boston area. While data can offer guidance, it’s essential to blend these insights with a commitment to exceptional customer service and culinary excellence to thrive in this competitive industry.

Source: https://www.flickr.com/photos/werkunz/4608613719

Conclusion:

Using the Yelp Fusion API, we’ve uncovered valuable insights into the composition of Boston’s restaurant landscape. Our analysis reveals how specific business characteristics can influence restaurant ratings within the city. Neighborhoods have a discernible impact on the dining experience, certain cuisines stand out as favorites, and the presence of images and tags on Yelp pages proves to be invaluable.

While significant discoveries were made in this project, there remains room for further exploration and utilization of Yelp’s data.

Next Steps:

It’s clear that more research is needed to gain a deeper understanding. While the business search endpoint of the Yelp Fusion API provides valuable data, it falls short in delivering fine-grained details about Yelp pages. The API constraints resulted in some neighborhoods being underrepresented, while others were overrepresented. Daily data requests led to variations in the restaurants received, causing inconsistencies in model performance.

Two potential options exist to overcome this challenge. One is web scraping. It’s worth noting that Yelp prohibits this in its Terms of Service. The other option is to explore Yelp’s Open Dataset, which contains information on 150,346 businesses in Montreal, Calgary, Toronto, Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, and Cleveland.

As I continue to learn and develop my data analysis skills, I welcome any suggestions for improvement or questions from the community! Thank you!

Links:

Yelp Fusion

Github

Sklearn

XGBoost

--

--

Nate Fogg
Nate Fogg

Written by Nate Fogg

0 followers

Aspiring data professional

No responses yet