Using Statsmodels’ SARIMAX to Model Housing Data Pulled from Zillow

Andy Martin del Campo
Analytics Vidhya
Published in
3 min readNov 6, 2019

A time series is a series of data points indexed in time order. The following time series data is from zip code 94621. If you are familiar with the Bay Area, you may notice that this zip code is in Oakland, CA. This time series contains the average sale price per month of real estate in 94621 from July 2013 until April 2018.

Now lets get a picture of our data from this time period:

As you can see, the average housing sale price has more than doubled in size, about a 140% ROI. If you are not familiar with the term, the formula for ROI is [(current price) — (original price)] / (original price). You can draw many conclusions and a few predictions for the future prices of 94621 just visually. Another way to do this is to use Statsmodels’ SARIMAX, or Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors model.

If you are familiar with time series modeling, you can see that there is essentially no seasonality to our curve. I will only put in values for p, d, and q. For more information on SARIMA models, try out this blog.

ARIMA_MODEL = sm.tsa.statespace.SARIMAX(ts,
order=best_order,
enforce_stationarity=False,
enforce_invertibility=False)
output = ARIMA_MODEL.fit()
output.plot_diagnostics(figsize=(14,18))
plt.gcf().autofmt_xdate()
plt.show()

The two main components of SARIMAX are the time series (ts) you put in and the order of (p, d, q). In this example, I found that (1, 0, 2) worked well enough after some testing (feel free to reach out to ask how I got those values). The diagnostic plots appear as follows:

These four tables can help you determine whether your SARIMAX model can actually yield meaningful information. All of the diagnostic plots are titled and you can do more research on each one to see exactly what they show. But just know they pass all of the assumptions for a SARIMAX model in this scenario.

prediction = output.get_forecast(steps=60)
pred_conf = prediction.conf_int()

ax = df.plot(label=’observed’, figsize=(20, 15))
prediction.predicted_mean.plot(ax=ax, label=’Forecast’)
ax.fill_between(pred_conf.index,
pred_conf.iloc[:, 0],
pred_conf.iloc[:, 1], color=’k’, alpha=.25)
ax.set_xlabel(‘Date’)
ax.set_ylabel(‘Values’)
plt.legend()
plt.show()

Now, for what everyone really wants out of time series models — forecasting. The above code takes the trained model, uses it to forecast five years into the future, and then plots this information:

The yellow line projects the forecast values for data that has not yet been seen. The grey area around the line is the cone of uncertainty. As you can see, this model predicts that housing prices will rise rather rapidly in this area over the next five years.

ARIMA or SARIMA modeling are just the tip of time series and can quickly get much more complicated. Who doesn’t want to be able to predict the future?

Thank you for reading this blog post. Send me a message if you have any questions or concerns.

--

--

Andy Martin del Campo
Analytics Vidhya

Aspiring Data Scientist with a background in Electrical Engineering and Networking. Passionate about motorcycles and coffee.