Exploratory Analysis on Housing Data in Python

Julia Wu
Web Mining [IS688, Spring 2021]
11 min readFeb 4, 2021
source from: https://bostonpads.com/

Hello, everyone! Since we are in an era that everything is becoming so expensive, I always try to save every penny in my pocket. However, I have a dream that is to buy a house to make sure there is always somewhere I belong to. When it comes to buying a house, it is really hard to decide which one I should buy since there are several things that I have to consider, such as the location, the size, and the number of bathrooms and bedrooms for my whole family. Since it will cost me hundreds of thousands of dollars, I do not want to spend extra money for buying a house while there is another cheaper house with similar features. I think I won’t be able to afford another house if I don’t buy the most ideal house for the first time.

How could I spend the least money for buying my one and only dream house without the need of doing tons of searches?

If you also feel like you need to buy the right house as your one and only, this article is going to show you how to add the attributes you want from the API data to your data frame and make some exploratory analysis to get the basic knowledge of it! I will guide you through the journey by using RapidAPI and Python Jupyter notebook. Let’s go!

Source code is in my GitHub!

Data Source from RapidAPI

Unlike a downloaded file, using API to extract data will be able to make sure that you always get the recent data on demand. For those who are not good at coding, I recommend you retrieve the data we are going to manipulate from the website RapidAPI. This website contains thousands of APIs in more than forty categories, and it provides code snippets for nineteen programming languages. Today we are going to retrieve data from Realtor through RapidAPI by using the requests library in Python.

Step 1 Sign up for an account in RapidAPI to get an API key.

Step 2 Search API: Realtor.

Step 3 Tap ‘properties/v2/list-for-sale’.

Step 4 Tap ‘subscribe to test’ and subscribe to the free plan.

Step 5 Fill out the required parameters and optional parameters for your needs.

Step 6 Select Python a requests under code snippets.

Step 7 Tap ‘copy code’ and paste it into your Jupyter notebook.

Since the attributes of the retrieved objects have different levels of location data and the numbers of bedrooms and bathrooms, it allows me to run further analysis at different levels.

Clean Data in Python

I make a minor change to the snippet code from RapidAPI. Now it is supposed to make a ‘get’ request through the API to retrieve the recent two hundred records and return them in a JSON format to me.

This is what it looks like. The data in a JSON format will be more readable for us. From the output, I find there are several useful attributes, such as the location and the number of bedrooms and bathrooms for each property.

To put the data in a nicer and cleaner format, I import the ‘pandas’ library to put the data in a data frame.

import pandas as pd

To make the code reusable in case I may want to query the data based on different locations or property types in the future, I create a function and put the code in it.

The tricky thing for this part is that the attributes ‘city’, ‘county’, ‘state code’, and ‘zipcode’ are nested in the ‘address’ attribute. To put them in my data frame, I use a ‘for’ loop to grab them from the ‘address’ attribute of each property and create a new attribute ‘combined_loc’ to contain ‘city’, ‘county’, and ‘state code’. Then I did the same thing to grab the attribute ‘size’ from the ‘building_size’ attribute to my data frame, so now it contains all the information that we can explore furthermore.

Since a JSON format uses a dictionary with keys and values, I can convert each row into a data frame by using the method ‘DataFrame.from_dict’ and add it to a list. After all of the rows are appended to the list, I can concatenate the list as a data frame and reset its index. Finally, I convert the attribute ‘price’ to ‘integer’ type without scientific notation, the attribute ‘beds’ and ‘baths’ to ‘category’ type, and the attribute ‘last_update’ to ‘date’ type in this data frame and return it.

Hooray! Now I can call the function to get the recent two hundred records.

Table = getDataFromProperty(api_key, city, state, prop_type, radius)
Table

Exploratory Analysis in Python

%matplotlib inlineimport numpy as np
import seaborn as sns
import matplotlib as mat

sns.set(color_codes = True)

To work with the data to get some exploratory analysis, I also import ‘numpy’, ‘seaborn’, and ‘matplotlib’ libraries to help me. Before we make nice figures, there are two methods in pandas that are extremely helpful to tell us what the data looks like in general. First, that is ‘info.(),’ which shows the number of non-null cells in a specific data type for each attribute. In this case, the recent two hundred records I retrieved through the API does not contain a null value in any cell, and I have four numeric types, two ‘category’ types, one ‘date’ type, and six ‘string’ types.

Table.info()

Second, that is ‘describe.(),’ which shows the min, max, mean, percentiles, and standard deviation for each attribute which is a numeric type. In this case, it shows all of the details for the attribute ‘price’, ‘size(sqft)’, ‘longitude’, and ‘latitude,’ and I suppress the scientific notation for ease of reading.

For the attribute ‘price’, the mean is around $1,000,000 and the standard deviation is more than $850,000, so that means overall the price of all properties that we retrieved seems to be far away from the average.

For the attribute ‘size(sqft)’, the mean is around 2,000 sqft and the standard deviation is slightly less than 1,200 sqft, so that means overall the price of all properties that we retrieved also seems to be far away from the average.

For the attribute ‘longitude’ and ‘latitude’, the standard deviation is really small, so that means the locations of those properties are close to each other.

I use ‘.funformatter()’ to show ‘thousand’ separators in some of the below figures for ease of reading.

Table.describe().apply(lambda s: s.apply('{0:.2f}'.format))

Box plot — price

Now I draw a box plot to show the center, spread, and outliers by price. The line in the interquartile range box shows the median of around $800,000. The interquartile range box represents the middle 50% of the data. The whiskers represent the ranges for the bottom 25% and the top 25% of the data values, excluding outliers. From the box plot, we find the prices higher than around $2,000,000 are marked as asterisks, and that means they are identified as outliers. This box plot is right-skewed with a thin interquartile range box and short whiskers, so it means the range of the price is relatively small after we exclude the outliers.

#Boxplot - price
sns.boxplot(x = Table['price'])

Histogram — state

It simply shows that most of the for-sale properties are located in New York within a one-mile radius, and few are in New Jersey since I use the API to pull the properties that meet the requirements

#Histogram - state
Table.state.value_counts().nlargest(40).plot(kind = 'bar', figsize=(10, 5))

Scatter plot — price, baths

It shows most of the for-sale properties located in New York City within a one-mile radius have less than 5 bathrooms. Since most of the ranges of the price in different numbers of bathrooms are overlapped, we know the number of bathrooms is not a critical factor to the price of most of the properties.

Scatter plot — price, beds

It shows most of the for-sale properties located in New York City within a one-mile radius have two to five bedrooms. Since most of the ranges of the price in different numbers of bedrooms are also overlapped, we know the number of bedrooms is not a critical factor to the price of most of the properties neither.

Scatter plot — state, price, size(sqft)

It shows most of the properties are smaller than 2,500 sqft, and overall, the size of the properties has some positive correlation to the price of the properties. When two properties are the same size and smaller than 1,500 sqft, the one in New York is usually more expensive than the one in New Jersey. However, when two properties are the same size and larger than 1,500 sqft, the one in New York is usually cheaper than the one in New Jersey.

Scatter plot — longitude, latitude, size(sqft), price

Since the ‘get’ request I make to the API searches for all the for-sale properties located in New York City within a one-mile radius, the background for this figure is NYC-centered.

It shows the most expensive and largest property is located in the center of New York City. Only a few properties located in NYC are for sale, and some of them are either more expensive or smaller than the average.

The price of many for-sale properties located in the left-bottom corner is lower than $2,000,000, and some of those properties may be larger than the average. A few properties located at the bottom of the middle-right corner may be cheaper than the average regarding the size. Both two areas are worthy to pay more attention to since my dream house is probably out there!

Conclude

There are three major limitations to this project regarding the API, the data, and my exploratory analysis. First, the limitation of the API is it can only make three hundred requests per month, and per request only retrieve the recent two hundred properties. Second, the limitation of the data is it does not contain some features that I am also concerned about, such as traffic and public safety, so I have to search that information for those properties that I am interested in. Third, the limitation of my analysis is I cannot find an efficient way to make my last figure dynamically show a map at the county or state level as its background, so I have to spend time comparing it with a real map.

In general, this method may help me narrow down the number of my searches significantly since it can help me decide the number of bedrooms and bathrooms without overthinking, the critical size regarding the price and state, and focus on the smaller areas. I have three major findings from my exploratory analysis. First, the number of bedroom and bathroom may not have a huge impact on the price of most of the for-sale properties located in New York City within a one-mile radius, so I can just simply consider how many bedrooms and bathrooms my family need. Second, when comparing by state, the size of most of the for-sale properties located in New York City within a one-mile radius may have an impact on their price, and 1,500 sqft may be a turning point to different states. If I need a property larger than 1,500 sqft, I may only need to pay more attention to the for-sale properties in New York. If I need a property smaller than 1,500 sqft, I may only need to pay more attention to the for-sale properties in New Jersey. Third, most of the for-sale properties within a mile southwest of New York City and a few within a mile east of New York City are relatively cheaper per sqft, so I may pay more attention to the for-sale properties in both areas.

Additionally, now I know how to use API to retrieve the data, clean the data, and make exploratory analysis for it in Python. Cheers!

Reference:

Agarwal, Malay. “Pythonic Data Cleaning With Pandas and NumPy.” Real Python, https://realpython.com/python-data-cleaning-numpy-pandas/. Accessed 1 Feb. 2021.

Analytics, Ariel. How to Get Real Estate Data from Realtor API. 2020, https://www.youtube.com/watch?v=Y_VF7hvCNuU.

API Reference. https://pandas.pydata.org/pandas-docs/stable/reference/index.html#api. Accessed 1 Feb. 2021.

Atwalsh. “Convert Pandas Column to DateTime.” Stack Overflow, 5 May 2018, https://stackoverflow.com/questions/26763344/convert-pandas-column-to-datetime.

BENY. “Add Column of DataFrame Based on a Nested JSON in Column.” Stack Overflow, 27 Dec. 2018, https://stackoverflow.com/questions/53951554/add-column-of-dataframe-based-on-a-nested-json-in-column.

Chae, Justin. “From API to Pandas — Getting JSON Data With Python.” Python in Plain English, 8 June 2020, https://medium.com/python-in-plain-english/from-api-to-pandas-getting-json-data-with-python-df127f699b6b.

cs95. “Combine Two Columns of Text in Pandas Dataframe.” Stack Overflow, 5 Feb. 2020, https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-pandas-dataframe.

Feetwet. “Python Convert Object to Float.” Stack Overflow, 31 May 2020, https://stackoverflow.com/questions/48094854/python-convert-object-to-float.

“How to Plot Data on Maps in Jupyter Using Matplotlib, Plotly, and Bokeh.” Big Endian Data, 27 June 2017, https://www.bigendiandata.com/2017-06-27-Mapping_in_Jupyter/.

“How to Suppress Scientific Notation in Pandas.” Re-Thought, 13 July 2019, https://re-thought.com/how-to-suppress-scientific-notation-in-pandas/.

“Interpret the Key Results for Boxplot.” Minitab, https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/graphs/boxplot/interpret-the-results/key-results/. Accessed 3 Feb. 2021.

“Is Ticklabel_format Broken?” Stack Overflow, 20 June 2020, https://stackoverflow.com/questions/18209462/is-ticklabel-format-broken.

Jeff. “Suppressing Scientific Notation in Pandas?” Stack Overflow, 20 July 2013, https://stackoverflow.com/questions/17737300/suppressing-scientific-notation-in-pandas.

Petit, Ulysse. “Simplify Your Dataset Cleaning with Pandas.” Towards Data Science, 11 Sept. 2019, https://towardsdatascience.com/simplify-your-dataset-cleaning-with-pandas-75951b23568e.

Peybae. “Https://Stackoverflow.Com/Questions/51734218/Formatting-y-Axis-Matplotlib-with-Thousands-Separator-and-Font-Size.” Stack Overflow, 7 Aug. 2018, https://stackoverflow.com/questions/51734218/formatting-y-axis-matplotlib-with-thousands-separator-and-font-size.

Prabhu, Tanu. “Exploratory Data Analysis in Python.” Towards Data Science, 10 Aug. 2019, https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce.

Quickstart. https://2.python-requests.org/en/master/user/quickstart/. Accessed 3 Feb. 2021.

“Seaborn Line Plots.” Journaldev, https://www.journaldev.com/39342/seaborn-line-plot. Accessed 3 Feb. 2021.

--

--

Julia Wu
Web Mining [IS688, Spring 2021]

Certified Google Advanced Data Analytics Professional | connect with me over LinkedIn: julia-h-wu/