Time Series Analysis on Zillow’s Housing Data

Fernando Aguilar
7 min readJul 15, 2019

--

Introduction

In this project, I will perform Time series analysis using Zillow’s historical median house prices for the US. The aim of the project is to provide the best zip codes to invest in for a national real estate developer looking for a place to develop a new apartment building.

The Data

Housing data in long-format. Unique zip codes: 14,723.

The data contains descriptive variable indicating the zip code, city, county, state and Region ID. It also contains an ordinal variable SizeRank, which after some data exploration, I assume ranks the zip codes by urbanization. More urbanized zip codes have a higher rank. The data also provides median house prices for the time periods between April, 1996 to April, 2018. This are the values that I will use to perform the time series analysis. Before jumping straight into modeling and forecasting, there are pre-requisites the zip codes need to meet to be considered for investment by the real estate company.

Zip Code Selection

The real estate company has the following preferences:

  • Urbanization: Zip code should be in the top 20% according to the SizeRank variable.
Filter top 20% SizeRank
  • Median house price: The zip code’s average house price should be between 1.5 decile below and 1 decile above the zip code’s 1-year median value. I chose a 1 year parameter to be realistic about the prices the real estate company will face most likely in the near future. This also reflects the purchasing power of the target market.
Get zip codes within certain percentiles of 1-year average values.
  • Risk: The real estate company has a risk averse investment strategy. Hence, the zip codes have to be selected according to a coefficient of variation below the 60 percentile.
  • Returns: Once I had filtered out the zip codes according to the company’s preferences, I chose the top 5 zip codes with the highest ROI.
Calculare COV and ROI and rank zip codes

After filtering and ranking the zip codes that met all of the criteria required by the real estate company, I ended up with the following 5 zipcodes:

Top 5 zip codes by ROI

The zip codes pertain to the following locations:

Zipcode : 70808 
Location: Baton Rouge, LA

Zipcode : 29461
Location: Moncks Corner, SC

Zipcode : 3820
Location: Dover, NH

Zipcode : 52722
Location: Bettendorf, IA

Zipcode : 70809
Location: Baton Rouge, LA

Now that I have 5 potential zip codes for the company to develop its new apartment building complex, I will proceed to perform time series analysis to rank this zip codes according to forecasted 10-year returns.

Time Series Analysis

Now let’s plot the time series and visually inspect the data.

for i in range(5):
dfs_ts[i].value.plot(label=dfs_ts[i].RegionName[0],figsize=(15,6))
plt.legend()
Time series plot of the 5 selected zip codes

Median house prices have a positive trend and they are very likely not stationary since next-period prices rely heavily in the previous period price. Hence, I will calculate the monthly returns and perform the model fit and forecast using monthly returns. Returns are more likely to be stationary and usually have a constant mean of zero. Also, they are a better feature to use to rank and compare zip codes rather than prices. I will proceed to calculate and plot the monthly returns for the 5 zip codes.

Calculate and plot historical monthly returns
Monthly returns for zip code 70808

From this plot, it is noticed that there is no clear trend in the data and it is likely to be stationary, which is an assumption to build the model. A helpful visualization is to plot the monthly returns along with its rolling mean and standard deviation which should not display a trend for the data to be stationary.

#Plot each of the zipcodes’ returns with their respective rolling mean and rolling standard deviation.
#Vizually test for stationarity.
for i in range(len(dfs_ts)):
rolmean = dfs_ts[i].ret.rolling(window = 12, center = False).mean()
rolstd = dfs_ts[i].ret.rolling(window = 12, center = False).std()
fig = plt.figure(figsize=(11,5))
orig = plt.plot(dfs_ts[i].ret, color=’blue’,label=’Original’)
mean = plt.plot(rolmean, color=’red’, label=’Rolling Mean’)
std = plt.plot(rolstd, color=’black’, label = ‘Rolling Std’)
plt.legend(loc=’best’)
plt.title(f’Rolling Mean & Standard Deviation for Zipcode: {dfs_ts[i].RegionName[0]}’)
plt.show()
Rolling mean and standard deviation of monthly returns

From the rolling mean and standard deviation, I notice that there is no clear trend and that the time series is most likely stationary. I perform the same visualizations and analysis on the remaining zip codes. However, a visual inspection is not accurate enough to just rely on it and proceed with fitting an ARIMA model. Therefore, it is necessary to perform an Augmented Dickey-Fuller test for stationarity.

ADFuller test p-value for zipcode: 70808
p-value: 0.024753431146719584
Reject the null hypothesis. Data is stationary.

ADFuller test p-value for zipcode: 29461
p-value: 0.3548707714514657
Fail to reject the null hypothesis. Data is not stationary.

ADFuller test p-value for zipcode: 3820
p-value: 0.36376171386715545
Fail to reject the null hypothesis. Data is not stationary.

ADFuller test p-value for zipcode: 52722
p-value: 1.780143393207466e-07
Reject the null hypothesis. Data is stationary.

ADFuller test p-value for zipcode: 70809
p-value: 0.1059872001451227
Fail to reject the null hypothesis. Data is not stationary.

From the results, three of the 5 zip codes resulted in not being stationary ata 95% confidence level. Therefore, I know that for those 3 non-stationary zip codes, the ‘I’ parameter (integration) of the ARIMA model is going to be 1.

SARIMA Model

SARIMA stands for Seasonal AutoRegressive Integrated Moving Average. So far I already have the value for the integration parameter from the model. Now, its time to search for seasonality (S) and the AR ( p) and MA (q) parameters of the model that will provide a best fit for each of the zip codes.

The AR and MA parameters can be estimated using the autocorrelation function (ACF) and partial autocorrelation function (PACF)plots of the stationary time series.

ACF and PACF plot function

Plots are created and visualized for the 5 zip codes to attempt to find the best estimates for the AR and MA parameters for the model. The following is a reference table to try to estimate those parameters:

Source: StackExchange

For most of the zip codes, both of the plots follow a decaying pattern. Hence, both the ACF and PACF plots tail off. Therefore, the models are likely to include both of the parameters. To aid me in the process of finding the best values for the parameters I used pmdarima autoarima function to get this values.

results = pm.auto_arima(TS_70808,information_criterion=’aic’,m=12,d=0,
start_p=1,start_q=1, max_p=3, max_q=3,
stepwise=True,trace=True,error_action=’ignore’,suppress_warnings=True)
results
ARIMA(callback=None, disp=0, maxiter=None, method=None, order=(1, 0, 1),
out_of_sample_size=0, scoring='mse', scoring_args={},
seasonal_order=(0, 0, 1, 12), solver='lbfgs', start_params=None,
suppress_warnings=True, transparams=True, trend=None,
with_intercept=True)

From the autoarima package I got the ARIMA parameters (p, I, q) which are (1,0,1), and the parameters of the seasonal component of the time series (P,D,Q,S) which are (0,0,1,12). Hence our data has a seasonality of 12 periods. This was repeated on all zip codes and only this one showed seasonality. The parameters where passed on the model and the summary is as follows:

Model summary and residual plots for zip code 70808

Ideally, the residuals will end up being white noise since all of the signal component of the data should be part of the model. The residuals should be normally distributed and non-correlated. From the residual plots, its clear the residuals are normally distributed, however there is still some autocorrelation. Hence, the model could be further improved by adding exogenous variables that can filter more of the noise as signal. But, that is outside the scope of this project, so I will conform with this results as this is the best model using only the data provided.

Functions for model fit, RMSE calculation and plots

I decided to leave the last 3 years of data as the test data set and train the model on everything before that time. Once I arrived to the model, I proceeded to calculate the RMSE on the train data set and then performed the one-step forecast method on the test data set to calculate the goodness of fit of the model.

RMSE calculation and goodness of fit

RMSEs are very similar between the train and test data sets, and small in general. Hence, the predictive accuracy of the model is not bad to calculate future out-of-sample forecasts.

Forecasting

Once more the model was trained, but now using all the data. Then, I used the model to generate forecasted monthly returns for the nest 10 years. With this information, I proceeded to calculate total returns for the periods of 1, 3, 5, and 10 years.

Forecasting model, calculations and plot
10-year monthly returns forecast

Summary and Findings

After performing time series analysis on the 5 zip codes and forecasting total returns for up to ten years, I reccomnd fo the company to invest in the following 3 zipcodes:

  1. 3820: Dover, NH (216.85% 10-year total return)
  2. 29461: Moncks Corner, SC (216.44% 10-year total return)
  3. 70809: Baton Rouge, LA (184.64% 10-year total return)

Note: Forecasts are solely based on historic monthly returns, and past performance does not necessarily predict future results. To improve the models and ultimately the quality of the forecasts, other external factors should be taken into account.

Further Reading

--

--

Fernando Aguilar

Data Analyst at Enterprise Knowledge, currently pursuing an MS in Applied Statistics at PennState, and Flatiron Data Science Bootcamp Graduate.