Using Zillow Housing Data to Forecast Home Values in Washington State

Sam Grasland
12 min readApr 12, 2023

By Salome Grasland

04/13/2023

Business Understanding

Our client Steady has had great success with their sustainable housing community in King County, WA. They are looking to build a new community elsewhere in the state of Washington. They want the community to be affordable to first time home owners so they are looking at the best zip codes to invest in with homes that are 500,000 USD or less in value.

The Problem

Steady, a housing development agency, is unsure of where they want to build their next housing community. It has to be in the state of Washington and affordable to first time home owners, so home values need to be 500,000 USD or less.

The Solution

Data from April 1996 to April 2018 from Zillow was used to build a time series model. The model was used to forecast the best zip codes to build in.

Metric: ROI

The metric used is return on investment. The top 5 zip codes with the best average 3 year ROI were isolated. Models were trained on the zipcode with the highest ROI and the model was used to forecast which top 3 zip codes would have the best ROI for the next year.

Data Understanding

Zillow Housing Data

This data represents median monthly housing sales prices for 265 zip codes over the period of April 1996 through April 2018 as reported by Zillow. The data was originally created by Zillow and obtained on Kaggle for the purpose of this project.

Each row represents a unique home and its location and value. Each record contains location info, such as, city, zipcode, and metro and median housing sales prices for each month.

The dates will be used as the index value for doing time series analysis on monthly housing prices for the span April 1996 to April 2018.

There are 14,723 rows and 272 variables:

  • RegionID: Unique index, 58196 through 753844
  • RegionName: Unique Zip Code, 1001 through 99901
  • City: City in which the zip code is located
  • State: State in which the zip code is located
  • Metro: Metropolitan Area in which the zip code is located
  • CountyName: County in which the zip code is located
  • SizeRank: Numerical rank of size of zip code, ranked 1 through 14723
  • 1996–04 through 2018–04: refers to the median housing sales values for April 1996 through April 2018, that is 265 data points of monthly data for each zip code
  • Value: refers to the median housing price

Exploring Return on Investments

Let’s look at the 20 year, 10 year, 5 year, and 1 year ROI on each zip code in Washington. ROI is equal to the (current price — the original price/original)*100.

Highest 20 year ROI: 230%
Highest 10 year ROI: 66%
Highest 5 year ROI: 115%
Highest 1 year ROI: 35%

It is expected that there will be a relationship between time and ROI. Typically real estate increases in value naturally with time, however, for some reason the return on investment for the 10 year ROI was lower than the 5 year ROI. This is most likely due to the 2008 housing crisis. If the housing crisis had not occurred we would expect the 10 year ROI to be around 180%. We will need to consider the impact of the 2008 housing crisis on our Time Series model.

# Create 20 year, 10 year, 5 year, 3 year, and 1 year ROI 
#ROI is equal to the (current price - the original price/original)*100
df['20_yr_ROI'] = (df['2018-04'] - df['1998-04'])/(df['1998-04'])*100
df['10_yr_ROI'] = (df['2018-04'] - df['2008-04'])/(df['2008-04'])*100
df['5_yr_ROI'] = (df['2018-04'] - df['2013-04'])/(df['2013-04'])*100
df['1_yr_ROI'] = (df['2018-04'] - df['2017-04'])/(df['2017-04'])*100

Average one year ROI over the past 5 years

The top 5 zipcodes with the highest average one year ROI over the past 5 years are:

  • 98168 = 27.78%
  • 98146 = 25.96%
  • 98043 = 25.77%
  • 98294 = 25.35%
  • 98251 = 25.20%
# Let's look at the ROI from 2018, 2017, 2016, 2015, 2014 
# And average the 5 values together
df['2017_ROI'] = (df['2017-04'] - df['2016-04'])/(df['2016-04'])*100
df['2016_ROI'] = (df['2016-04'] - df['2015-04'])/(df['2015-04'])*100
df['2015_ROI'] = (df['2015-04'] - df['2014-04'])/(df['2014-04'])*100
df['2014_ROI'] = (df['2014-04'] - df['2013-04'])/(df['2013-04'])*100
df['avg_1_yr_ROI'] = ((df['1_yr_ROI'] + df['2017_ROI'] + df['2016_ROI'] + df['2015_ROI'] + df['2014_ROI'])/3)
df.drop(['2017_ROI', '2016_ROI', '2015_ROI', '2014_ROI'], axis=1, inplace=True)

df.head()

Date Indexing

When working with time series data in Python it is helpful to have the dates in the index. We will be modifying the dataset to have the dates indexed.

#creating function to melt data
def melt_data(df):
melted = pd.melt(df, id_vars=['Zipcode'],
var_name='Date')
melted['Date'] = pd.to_datetime(melted['Date'], infer_datetime_format=True)
melted = melted.dropna(subset=['value'])
return melted

Time-series Index Slicing for Data Selection

Given the 2008 Housing Bubble precedent, I have decided to slice the dates and start my analysis in April 2013. As we can see from the plot below around 2013 is when home prices start increasing again and displaying more normal behavior. This leaves us with a 10 year time window.

Exploring Home Value

Top 5 Zipcode Analysis

The top 5 zip codes with the highest average one year ROI over the past 5 years are:

  • 98168 = 27.78%
  • 98146 = 25.96%
  • 98043 = 25.77%
  • 98294 = 25.35%
  • 98251 = 25.20%

Let’s visualize each of our top zip codes and run a Dickey-Fuller test. This is a statistical test that tests for stationarity. Our null hypothesis in this case is that the time series is not stationary. So if the test statistic result is less than the critical value then the null hypothesis can be rejected and we can say the series is stationary.

We can also see from our previous plot that all the zipcodes express an upward linear trend. We will need to attempt to detrend the series by differencing the data. Differencing can be used to remove a series dependence on time. It is performed by subtracting the previous observation from the current observation.

# Visualizing The Value of the Top Five Zip Codes

# to set the plot size
plt.figure(figsize=(16, 8), dpi=150)

# using plot method to plot values.
# in plot method we set the label and color of the curve.
zip_98168['value'].plot(label='98168', color='orange')
zip_98043['value'].plot(label='98043')
zip_98146['value'].plot(label='98146')
zip_98251['value'].plot(label='98251')
zip_98294['value'].plot(label='98294')

# adding title to the plot
plt.title('Top 5 Zipcodes Value')

# adding label to the x-axis
plt.xlabel('Year')

# adding legend to the curve
plt.legend();
# Generate a box and whiskers plot for top 5 zipcodes
# Add titles
box = zip_df.boxplot(figsize = (12,7,));
box.plot()
#adding title
plt.title('Top 5 Zipcode Value')
#adding xlabel
plt.xlabel('Zipcode')
#adding ylabel
plt.ylabel('Value');

zip_98168

Let’s take a look at zipcode 98168. The p-value is 0.94 which is not less than our critical value of -2.91, meaning the series is not stationary.

from statsmodels.tsa.stattools import adfuller

# Perform Dickey-Fuller test:
print ('Results of Dickey-Fuller Test: \n')
dftest = adfuller(zip_98168['value'])

# Extract and display test results in a user friendly manner
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
#differencing in order to remove trends
zip_98168_w = zip_98168.diff(periods=1)
#plotting the differenced zipcode
fig = plt.figure(figsize=(11,7))
plt.plot(zip_98168_w, color='blue',label='Value - rolling mean')
plt.legend(loc='best')
plt.title('Zipcode 98168 Differenced')
plt.show(block=False)

Stationary Zip Code Analysis

Now that we’ve differenced our top 5 zip codes and attempted to make them stationary, let’s do some analysis.

#Plotting the top 5 differenced zipcodes
zip_df_w.plot(figsize=(13,8), subplots=True, legend=True);

ACF and PACF

ACF and PACF are tools we can use to help us find the order of AR, MA, and ARMA models.

PACF

PACF, is a partial correlation, which can be explained by the amount of correlation between a variable and one lag of that variable which is not explained by lower order lags.

The shaded area on the graph is our confidence interval. If the correlation drops into the confidence interval it means it is not statistically significant.

ACF

ACF, is the correlation of the time series with a lagged version of itself.

Results

  • The ACF plot showed a strong correlation at 1
  • Geometric decay in ACF plot and PACF
  • the PACF plot showed a strong correlation at 1 and 2

This means that our time series could best be modeled using ARMA AR(p) = 2
MA(q) = 1

#plotting ACF 
#strong correlation at 1
from statsmodels.graphics.tsaplots import plot_acf

plot_acf(zip_98168.diff().dropna());
#Plotting PACF
#strong correlation at 2
from statsmodels.graphics.tsaplots import plot_acf

plot_pacf(zip_98168.diff().dropna());

Modeling

Train/Test Split

When doing a train/test split on a time series you cannot randomly split the data seeing as its time sensitive. Hence, in this case we will make our training data from our first 4 years of data and our testing data from our last year of data.

#Creating training data based on 2013-04-01 to 2017-04-01
zip_train = zip_df_w['2013-04-01': '2017-04-01']
train = zip_98168_w[:'2017-04-01']
zip_98043_train = zip_98043_w[: '2017-04-01']
zip_98146_train = zip_98146_w[: '2017-04-01']
zip_98251_train = zip_98251_w[: '2017-04-01']
zip_98294_train = zip_98294_w[: '2017-04-01']

#Creating test data based on the last year of our
zip_test = zip_df_w['2017-04-01':]
valid = zip_98168_w['2017-04-01':]
zip_98043_test = zip_98043_w['2017-04-01':]
zip_98146_test = zip_98146_w['2017-04-01':]
zip_98251_test = zip_98251_w['2017-04-01':]
zip_98294_test = zip_98294_w['2017-04-01':]
#observing the split
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(train, label='train')
# ax.plot(valid)
ax.plot(valid, label='valid')
ax.set_title('Train-Validation Split');
plt.legend();

Modeling

RMSE

Root mean square error (RMSE) is a commonly used measure for evaluating the quality of a model’s prediction. It shows how far a prediction is from the measured true value by using Euclidean distance.

The RMSE is computed for the naive model to compare against our later models.

AIC

The Akaiki information criterion (AIC) is an estimator of prediction error. A lower AIC generally indicates a better model. AIC will be used to compare our models.

Baseline Model

# visualizing the shift
fig, ax = plt.subplots()

train[0:30].plot(ax=ax, c='r', label='original')
naive[0:30].plot(ax=ax, c='b', label='shifted')
ax.set_title('naive')
ax.legend();

The Autoregressive Model (AR)

This model is a creation of two simpler models. It takes in three values for it’s order:

  • p: represents the AR term
  • d: represents the differencing order
  • q: represents the lagged error
#AR model with order of 1, 1, 0
ar_1 = ARIMA(train, order=(1, 1, 0)).fit()

# We put a typ='levels' to convert our predictions to remove the differencing performed.
ar_1.predict(typ='levels')
ar_1.summary()

Moving Average Model (MA)

This model is based on error. It makes predictions based on how far off the data was in the previous point of time.

#moving average model with order of 0, 0, 1
ma_1 = ARIMA(train, order=(0, 0, 1)).fit()
y_hat = ma_1.predict(typ='levels')
y_hat
ma_1.summary()

ARMA

Let’s combine our AR and MA values to fine tune our model. Our best performing AR model had an order of 2, 1, 0 and our best performing MA model had an order 0, 1, 2. This is line with the result of our ACF and PACF plots from previously.

#ARMA model with order of 2, 1, 2
arma_21 = ARIMA(train, order=(2, 1, 2)).fit()

#pulling up summary
arma_21.summary()

Auto ARIMA

According to our AIC our best model so far is an ARMA model with order of 2, 1, 2.

However, let’s use an Auto ARIMA model, which automatically searches for the best parameters and model to use.

According to our Auto ARIMA results a SARIMAX model with order 0, 1, 3 and seasonal order of 0, 0, 0, 0 should be tried.

# In order for this cell to run, you may need to install pmdarima if you haven't already.
import pmdarima as pm
auto_model = pm.auto_arima(train, start_p=0, start_q=0)
auto_model.summary()

SARIMAX

#SARIMAX model with order of 0, 1, 3
sari_mod =SARIMAX(train,
order=(0, 1, 3),
seasonal_order=(0, 0, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False).fit()
sari_mod.summary()

Comparing Models

Let’s find the best model by the RMSE of training and test data. The model with the best RMSE for training was the ARMA 21 model with a value of 665, the SARIMAX model had a value of 697.

The model with the best testing RMSE was the SARIMAX model with a value of 2887.

Since, the SARIMAX model performed better on our testing data this is the model we will use to forecast our time series.

#Finding RMSE
def find_rmse(model, train_data=train):
y_hat = model.predict(typ='levels')
return np.sqrt(mean_squared_error(train_data, y_hat))
#compare RMSE of training data
print(find_rmse(random_walk_model))
print(find_rmse(ar_1))
print(find_rmse(ar_2))
print(find_rmse(ma_1))
print(find_rmse(ma_2))
print(find_rmse(arma_21))
print(find_rmse(sari_mod))
3194.9479350518855
3416.0085509530472
3023.8846225127754
2405.197920753284
2995.382779587026
3612.2246646728277
2887.381390559421
#compare RMSE of testing data
def find_rmse_test(model, test_data=valid):
y_hat = model.predict(start=test_data.index[0], end=test_data.index[-1], typ='levels')
return np.sqrt(mean_squared_error(test_data, y_hat))
print(find_rmse_test(random_walk_model))
print(find_rmse_test(ar_1))
print(find_rmse_test(ar_2))
print(find_rmse_test(ma_1))
print(find_rmse_test(ma_2))
print(find_rmse_test(arma_21))
print(find_rmse_test(sari_mod))
3194.9479350518855
3416.0085509530472
3023.8846225127754
2405.197920753284
2995.382779587026
3612.2246646728277
2887.381390559421

Best Model: SARIMAX

Our SARIMAX model with an order of 0, 1, 3 and seasonal order of 0, 0, 0, 0 performed the best on our testing data.

#visualizing observed versus preducted data
y_hat_train = sari_mod.predict(typ='levels')
y_hat_test = sari_mod.predict(start=valid.index[0], end=valid.index[-1],typ='levels')

fig, ax = plt.subplots()
ax.plot(train, label='train')
ax.plot(valid, label='test')
ax.plot(y_hat_train, label='train_pred')
ax.plot(y_hat_test, label='test_pred')

plt.legend();

Data Interpretation

Forecasting Each Zip

The best model will be fit to each of our top 5 zip codes. Which will be used to forecast home prices for the next year. ROI will be calculated using the average forecast value versus the average house value for the last year of our data.

98168

#fitting best model
sari_mod =SARIMAX(zip_98168_w,
order=(0, 1, 3),
seasonal_order=(0, 0, 0, 0),
enforce_stationarity=False,
enforce_invertibility=False).fit()
#getting forecast for 1 year out
from sklearn import metrics
forecast = sari_mod.forecast(steps=12)
# visualizing forecast
fig, ax = plt.subplots()
ax.plot(zip_98168_w, label='so far')
ax.plot(forecast, label='forecast')
ax.set_title('Zipcode 98168 Home Value Prediction\n One Year out')
plt.legend();
#Getting yearly average for forecast
y_pred = forecast.mean()
print(y_pred)
#Getting yearly average of last year of data 
y = zip_98168_w.tail(12).mean()
print(y)
#Finding ROI
ROI = []
ROI_98168 = ((y_pred - y)/y)*100
print(ROI_98168)

We then repeat this process for the other 4 zipcodes

Predicted ROI of The Top Five Zip Codes

Zip Code Predicted ROI
98146 18.630165
98294 -0.095713
98043 -22.763217
98168 -29.933986
98251 -35.449640

#Visualizing predicted ROI
ROI_df_sort.plot(figsize= (10, 10), color = 'teal', x = 'Zipcode', y= 'Predicted ROI', kind='bar', title='Predicted ROI of Top 5 Zipcodes', ylabel = 'ROI (in percent)', xlabel = 'Zipcode');

Recommendations and Conclusion

Steady, our client, wants to know which zipcode they should look at in Washington to develop their next housing community.

The top three zip codes we recommend Steady to invest in are:

  1. 98146, Seattle, WA which has a predicted 1 year ROI of 18.63%
  2. 98294, Sultan, WA which has a predicted 1 year ROI of -0.09%
  3. 98043, Mountlake Terrace, WA which has a predicted 1 year ROI of -22.76%

Future Work

In the future it would be great to have more recent data to work with. This data is limited due to the housing market crash of 2008. It would be interesting to see what this data looks like once we have some distance from the crash.

For More Information

Thank you for looking at this project. If you have any further questions please contact Salome “Sam” Grasland at samgrasland@gmail.com. Check out the full repo here: https://github.com/SSGrasland/Phase4Project-

--

--