Utilizing Yelp Cost Estimates to Predict Affluence

Kelly
Analytics Vidhya
Published in
5 min readMar 2, 2020

When you think about data science, what comes to mind? Probably something that has to do with numbers or data, right? Well data science is basically the study of data! It involves developing methods of recording, storing, and analyzing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data — both structured and unstructured. Let’s dive into an example! Can Yelp cost estimates ($, $$, $$$) determine neighborhood affluence in the Manhattan borough of New York?

Web Scraping

One of the powerful tools of python and data scientists. Through company API keys, we’re capable of gathering data that is publicly available on the web. It’s definitely easier to be given a data file that’s all nicely organized and cleaned up for us, but that’s not always the case. With a Yelp API fusion, we managed to gather data on restaurants in Manhattan regarding information like name, location, yelp cost estimate (of course), and categories.

Code for scraping Yelp data

The code seems intense and confusing, right? Well, it’s not! As a whole, it might look intimidating (especially to someone who isn’t familiar with the language), but it’s actually quite simple when looking at the explanation (following the hashtag). All in all that giant code block allows us to gather data from the typical Yelp website that is user friendly and appealing to us. After some minor cleaning, we end up with something like the table below regarding the before and after.

Before: Yelp website
After: Yelp dataset

Cleaning the Data

Now comes the fun part: the 80/20 rule comes to life! There’s a common saying that data scientists spend 80% of the time on data wrangling and 20% on data analysis and modeling. In the aspect of this project, we did spend a chunk of time on the cleaning process. From handling null values, imputing values, combining datasets, and dummying columns we can end up with something much nicer after some blood, sweat, and tears. We end up with a data frame that consists of features and values that contribute to our problem statement and classification model.

Final result!

Next, we can move on the real fun part: visuals! Everyone loves a pretty chart that’s easier on the eyes than a bunch of numbers. So with some tweaking and playing around with Seaborn and Matplotlib, we can develop some fancy graphs to address stand out zip codes.

Visualizations

With some lines of code, we can end up with the following visualizations. As shown below, we can conclude that there is a trend as depicted by the orange-ish and blue bars in Chart 1 across Manhattan that cheaper restaurants are more popular than more expensive restaurants. Lower income areas show a higher count in this trend while higher income areas show a less drastic trend. However, in Chart 2, we can see that there are some zip codes that don’t follow this noticed trend. As shown by the green bars, those zip codes display practically the reverse trend that we notice in Chart 1.

Chart 1 — Yelp Cost Estimates: Highest Versus Lowest Incomes
Chart 2 — Yelp Cost Estimates: Standout Zip Codes

There is one key component that isn’t as concrete as the data that we can scrape from a website or easily Google and cite a credible author or organization. Regarding this project, that key component is: affluence. Affluence is simply measuring someone’s state of wealth, which is subjective to everyone. However, we decided to use something that can be agreed upon as a reasonable measure of affluence: income. From Statistical Atlas, we quantified each income percentile with numbers 1–6.

Using income ranges to determine bins of affluence.

Modeling

Finally, we can move onto the 20% part of our role as data scientists. There is the exploration aspect of running numerous models in order to determine the best parameters and model to run with and make conclusions with. To sum up, our best model was the Voting Classifier that is constructed with four other models: Decision Tree, Random Forest, Adaptive Boost, and Gradient Boost. We ended up with the following scores:

Training Score: 100%

Testing Score: 71.43%

Yes, our model is very overfit, but one major limitation is due to our sample data. We only analyzed the Manhattan borough, which consisted of about 40 zip codes. Upon a closer analysis, there we looked at our predictions and noticed some stand out zip codes.

Affluence vs 4$ Restaurant Counts

Two zip codes stand out: 10007 and 10019. From our data analysis, the possible reasons for misclassification of affluence could be due to: 10007 being the highest income neighborhood with the fewest total count of restaurants and 10019 being a cusper income with the highest count of expensive restaurants. Overall, our model can utilize Yelp cost estimates to determine a neighborhood’s affluence. It isn’t perfect as there may be other underlying factors that can affect affluence more than the Yelp cost estimates of restaurants.

Conclusions

While our model isn’t perfect, it could be improved upon by increasing our sample data to beyond Manhattan. In addition, we could also explore more features to better predict affluence such as real estate or population density by neighborhood. In summation, data science is a powerful field that allows us to explore some relationships that may not seem conventional immediately. The tools that are involved through programming languages (like Python) can also allow us to explore numerous aspects (of a project) without having to go from one program to another.

Sources:

--

--