K-Means Clustering on neighborhoods and real estates — Minneapolis Edition

6 min readSep 9, 2020

Introduction

This is (hopefully) the first of many blog posts I will be writing on Medium!

For the last few weeks, I have been working on a machine learning project as part of IBM’s Professional Data Science Certificate on Coursera. In this post, I will summarize my report on how I used K-Means Clustering method to understand the city of Minneapolis in terms of its neighborhoods, venues in each neighborhood, Zillow’s Home Value Index (HVI), data of the homes (number of bedrooms, bathrooms, year built, sale price etc.), and venues from each home.

This post is for:

Buyers looking for a new home in Minneapolis
Property Investors, realtors and agents
Data Scientists looking for ideas to use/get/scrape real estate data
The curious ones

Data

I used a Python’s library called BeautifulSoup to scrape the most updated list of neighborhoods in Minneapolis.

In order to cluster the neighborhoods based on real estate information such as neighborhood’s median home price, estimation of change in home price in five and ten years, and more, I retrieved Minneapolis’s home price and values by neighborhood from Zillow.

To look further into the particular homes in Minneapolis, I used a web scraping platform called Apify where I got data of each home within a neighborhood such as the number of bedrooms, number of bathrooms, home size, year built, and more.

I also used Foursquare API to scrape the venues in the neighborhoods. I used this API again to get the nearby venues for each home to get a better idea of how homes within a neighborhood differ from each other.

To create choropleth maps, I found a geojson file of neighborhoods of Minneapolis on Github, which I later edited to match with the neighborhood names on Wikipedia.

Methodology

Part A

Here’s the first five rows of neighborhood in Minneapolis giving us a general sense of homes in each neighborhood and estimation of the home values in the future.

Minneapolis Real Estate Market Overview by neighborhood

Using FourSquare API, I got a total of 220 unique venue categories amongst 63 neighborhoods including school, museum, bar, restaurant, shopping mall, park, and many more.

Top 10 common venues for each neighborhood

I merged the location venues in each neighborhood and real estate data to cluster the neighborhoods. Using the Elbow method, I found that k = 5 is the optimal value for k for the K-Means clustering algorithm.

Elbow Method to find the optimal k to cluster neighborhoods

Table below shows the average values for each cluster group ordered by the HVI column.

The final clusters of the neighborhoods is shown in the picture below. Due to the unfortunate default colors, here are the clusters with their colors (order matched with the table above):

Cluster 4: Light green (upper left and center)
Cluster 1: Purple
Cluster 3: Turquoise (middle left)
Cluster 0: Red
Cluster 2: Blue (only one neighborhood: Kenwood, Minneapolis)

Clusters of neighborhoods in Minneapolis

Part B

From Apify, I retrieved a total of over 800 homes from all the neighborhoods in Minneapolis.

Home Data by Street address (street column is unique for each home)

I repeated the steps to find nearby location venues for each home, optimal k value (k = 6), cluster the homes based on parameters: bedrooms, bathrooms, sqft (living area), price (asking price of the real estate), and year built, and the nearby venues of each home. The table below shows how the 6 clusters differ by the number of bathrooms, bedrooms, living space, house sale price, and year built. Note that there’s an increment in living area (sqft) with an increase in price. Generally, it seems like the newer homes (year built) are more expensive than the older homes. Then again, we will have to run more analysis to decide if these observations are noteworthy.

The seemingly positive correlation between the sale price and living area is confirmed in the picture below (correlation: 0.81).

Relationship between living area and house sale price

The map below displays the clusters of homes against a choropleth visualization where darker the shade of red of a neighborhood, higher the number of venues in it.

Again, for clarification, the clusters are colored as (order matched with the table above):

Cluster 0: Red
Cluster 4: Light green
Cluster 2: Blue
Cluster 1: Purple
Cluster 5: Orange
Cluster 3: Turquoise(only two homes)

It is important to note that the data included for clustering of the homes are not standardized. As seen on the map, there are many homes belonging in Cluster 0 (red, 501 homes). Cluster 1 has 181 homes, Cluster 2 has 81 homes, Cluster 1 has 24 homes, Cluster 5 has 14 homes, and Cluster 3 has 2 homes. Coincidentally, the order of the number of homes in each cluster match the order of the clusters’ average home sale price. We definitely want more data (and standardize) to have a better understanding of whether the clusters of homes are grouped optimally.

Since Cluster 3 is hard to notice because of the (again) unfortunate colors, here’s a table showing the two homes in the neighborhoods East Isles and Lowry Hill.

Cluster 3 (highest average home sale priced cluster)

Discussion

This project can be taken even further by finding more data on homes. Some parameters that many people consider when buying a home (that are found to affect the property value significantly) are usable space, upgrades, and local market among others. The program I wrote to scrape data such as commute and walk scores from the Zillow website had some web crawling issues. There might be APIs that also provide the year that a home was renovated, condition of the home, view/commute/walk scores — all of which are important factors to consider for buyers and agents alike.

It would also be a convenient next step to predict home prices using regression analysis. I would try to find more data in each clusters. Currently, the number of homes in the clusters are widely different. It might help to standardize the dataset used for clustering based.

It might have been easier to do this analysis on bigger cities with more available data. However, Minnesota has now become a second home for me and Minneapolis is the biggest I could get. Moreover, Minneapolis is the second most densely populated city in the Midwest region behind Chicago. The city, along with St. Paul, makes up the ‘Twin Cities.’

It would have been interesting to look into the twin cities as well in general:)

Another thought I had for this project was to use Natural Language Processing tools to look at the descriptions of each home. I like word clouds for readability. Since this step is only few lines of codes, stay tuned to see it eventually on my Github.

Conclusion

If you found this article, I probably asked you to check it out or Medium suggested it to you (in that case#DataScience #MachineLearning #InsertPopularHashtags), or you looked up homes to buy in Minneapolis (and google recommended this to you! whaaaa). Either way, I hope that this article gave you an idea of what you are looking for in homes or which neighborhoods you like the most based on different factors or what home clusters you want to belong to in the future. The codes I wrote can be replicated to fit any city of your choice so get coding!

Learn more about this project on my GitHub account.