Data Science Project — The Hidden Tastes of US Zip Codes

Published in

The Startup

10 min readDec 18, 2020

As most of us found ourselves at home during the covid-19 crisis I decided to use the time and learn a new subject I always wanted and didn’t found the time for it.

I have joined a course over Coursera that teaches both python and some machine learning. As the final part of this course, our task was to choose a project using location data and research an area based on that data.

I’m passionate about food, So why not try to combine both, the locations of restaurants and this project, trying to find some patterns.

The project explores different US states by zip codes; For each zip code, I have used forsquare API to retrieve information about the restaurants around. Hoping to identify a difference in the taste of the population between different states and different zip codes based on the types of restaurants.

I wanted to add some economical information on top of that but was unable to find any up-to-date data source by zip codes. So I decided to take a different approach and use house values by zip code, the data is available via Zillow.

Using the house data instead of income I’ll try and use it to identify any correlation between the price of a house and the sort of restaurants around, or maybe the sort has no correlation but the number does, both things can be tested using this data set.

This will be a three-part post, this is a summary of the work, in the next two posts I will explain the work in details with code example.

Data:

I have found a list of all US zip codes with their geolocation data on the OpenDataSoft website.

The zipcodes data includes US territories that are not part of the 50 states, I have removed them for the use of this research.

I used the Zillow research section to find a list of house values around the zip code, the data included prices dating back to 1998, decided to use the latest.

Finally the restaurant data, in the course we had to use the ForSquare API, and since it has a limit on the number of requests I decided to limit the number of states I’m going to work with, eventually choosing the following states, CA, FL, NY, MD, CO, MA, WA, IN, SC, GA, MS, MO.

A quick summary of the final data:

150,000 restaurants across those 12 states, with 184 different categories, the categories are based on what has been retrieved using the API.

Unfortunately, the tagging of those restaurants into categories is not the best, some of the categories were too general like Food or Restaurant. I replaced some of them by looking throw the data for other locations with the same name. eventually dropping almost 8000 data points since had no way to find the missing categories.

Just a simple check up, generating a map with all the restaurants clustered by the states (only the list I chose to work with):

How does the data look:

Looking at the housing prices and plotting a price distribution in each state we can see how prices vary, here is an example of the prices in NY and AK.

Another exploration step was to find the most frequent categorise among the different zip codes and states.

The data shows that the most frequent type of restaurant is an American Restaurant which is probably not a big surprise since we are looking at the US data. But the second most frequent is a Pizza place — who doesn’t live a hot slice of pizza :)

Wings place are leading the second place with a pizza, which probably makes pizza one of the most popular food type across the state.

I have plotted the same idea but segmenting by the states revealing the most common places in each state, here is a list of the top place in each state

California - Mexican Restaurant
Colorado — Mexican Restaurant
Florida - A Pizza place
Georgia - American Restaurant
Indiana - A Pizza Place
Massachusetts - Pizza Place
Maryland - A Pizza Place
Missouri - A Pizza Place
Mississippi - Fast Food
New Jersey - A Pizza Place
New York - A Pizza Place
South Carolina - American Restaurant

The image bellow is a summary of the 15 most common places by states in percentage (it won’t some to 100% since it just 15 categories out of 184):

Evaluating the data

The evaluation process was done in two parts, in the first part I have tried to cluster the different zip codes into segments that will be based on the different restaurant’s categories.

During the process I have used two clustering algorithms:

K-means — simple but useful clustering algorithm that can cluster unlabelled data. but it has a drawback since we have to specify the number of clusters in advance

DbScan — clustering algorithm that works without us specifying the number of clusters we want. the algorithm calculates the distance between the data points and tries to segment them into clusters based on this distance.

I have run both of them on a few combinations of the data set:

A full list of features, all of the 183 categories
PCA selects a list of features, the algorithm has found out that 44 features can explain 95% of the data so I check whether this can be a good predictor.
* PCA is a dimension reduction algorithm, it calculates what percentage of the data explained by the features and we can use this to reduce the number of features the algorithm requires in order to predict the data.
Non-PCA — since the idea is to segment zip code by different preferences I have assumed that using the variables that are not the most explorative, will yield a better clustering result.
Remove common data, in all the above features sets we had a bias toward a small number of leading categories, Fast food, Pizza, American restaurants, there are very common in the data set and the cluster we almost entirely built around them, assumed that maybe be removing them I’ll have a better result.
Clustering by cities — The result by zip codes wasn’t good enough, the data set also included the name of the zip each zip code located in, so another attempt was to try and cluster the cities by food categories, this actually led to the worst results.

To evaluate the success of each iteration I have checked two things, first the silhouette coefficient of the result and how the data is spread among the cluster by examining the top categories in each cluster.

**silhouette coefficient masseur how apart are the different clusters from each other, the score ranges from -1 to 1 whereas it gets closer to 1 the result is better. a score below zero and closer to -1 means something is wrong (A great post by towards data science that explains it in more details).

A summary of those data set and the algorithms results:

Here is an example of the clustering with non-PCA, decide to show this one since it recived the best coefficient result. Even in this case it doesn’t provide good information about the differences between zip code.

The food venues data found to be not significant in clustering zip codes.

A relationship between restaurants categories and house values

I have done this step only after evaluating the different clustering options as was explained above hoping to have a clustered data that can be used during this step as well; Unfortunately, the clustering wasn’t helpful so this is a stand-alone analysis.

A basic linear regression is a great start to understand what are we look at, by regressing the list of features the data showed no significant relationship, the linear regression r-squared was 0.25.

Maybe a feature selection algorithm can help, I used two selection algorithms from the python Sklearn package, mutual_info_regression, f_regression. there we can use just one of them, but I decide to check which features will be selected by either of them since most of them were overlapping I decided to combine the result of both algorithms.

After selecting 21 categories out of the total list the algorithm performance worsens, our new R² is 0.239. However, if we think about it in percentage terms, it worsens by 4.4% and we are using just 11% of our total number of categories.

Ridge / Lasso / Gradient Boosting Regression

Trying to solve the problem I decide to experiment with those 3 regression algorithms. let me explain them in short

Ride — add a coefficient to the loss function while trying to minimize it. The coefficient is a squared term, multiplied by a value called “lambda” changing the value will impact how much noise will be captured by the loss function.

Lasso — very similar to the ridge algorithm, but with a different coefficient. This algorithm can be useful for feature selection since it tried to minimize those feature who found to be not important to zero

Gradient Boosting — Boosting is a technique used in machine learning and can be used both in clustering and regression. The goal of this technique is to try and find the best features by minimizing a loss function. In this way, even params that seem weak can become significant if it can minimize the loss function.

The result of all of those regressions wasn’t able to find any significant relationships between the type and price.

Gradient boosting had the best result compared to other algorithms but still not significant enough R² of 0.28 and a cross-validation score of -143274.92 which is just 32,814 better than the Lasso model who ends with the worst result.

A summary of the different regressions:

The house prices were divided by 1000 for so the result of the cross validate and MSE implicated by that.

A quick look at the coefficients shows that some of them may actually have an impact on the price of a house, like a coffee shop or a juice bar around. But since the scores of the regressions are low I’ll count that as something with significant imapct.

My last question was, does the price of a house can be correlated to the number of restaurants around the same zip code? To answer the question I have summed up the number of restaurants in each zip code and run a regression with the hoses prices.

The result is even less satisfying than when trying to find a correlation between the categories and the price, R² of 0.1 for the linear regression and R² of 0.11 for polynomial regression.

Conclusion

The research tried to answer a question of whether or not we can identify a different food taste across zip codes in the US. and while doing so if we can find any correlation between the food venues categories and price of a house in the same zip code.

Clustering

The research couldn’t identify any good way to partition the zip codes by different restaurant types. It is found out the American and Fast food is leading the top on almost all zip codes but this doesn’t help us to identify different preferences.

Regression

Trying to understand if the price of a house can be dependent on the type of restaurant seems to be more promising. We managed to find a few coefficients that in our data has some explanatory power over the price of a house. but all regression scores were small and for that reason, I’ll not refer to this result as a significant indicator.

Future work

In future work a combination of the more common house features like, size, bedroom conditions and etc can be combined with the top features found in this research. By combining them together we can try and answer the question if those actually can help predict a price when combined with the usual predictors. Only by doing so, I believe that we can further understand doest it really a significant factor or just a bias of this stand-alone research.

As for the restaurant clustering, a similar test can be done across different countries and in this way to remove a home bias toward national preferences that we can see in the same country. Another thing that can be tested is reducing the number of categories by merging categories with more high-level ones, like Asian for all kinds of Asian food, street food and etc.

I’m open to hearing any recommendation or improvements to the idea.

In the next two posts, I’ll do a walk throw the process of working on that project.

If anyone is interested a link to the GitHub with the notebooks.