Chicago Real Estate Time Series

Published in

The Startup

7 min readOct 23, 2019

A love story in Python.

Overshadowed by New York and LA as one of the most populous metropolitan areas in the United States, Chicago can seem like the forgotten middle child. But just as the overlooked sibling can get away with mischief, Chicago can provide a higher standard of living to its residents without the absurd cost of living as touted by its coastal counterparts.

Chicago is my hometown, and after more than six years away, I am homesick. With a burgeoning career in data science however, Chicago is looking like a very real — and affordable — location for me to consider moving back to.

Why live in Chicago?

Data science and machine learning are the fastest growing jobs in the U.S., according to LinkedIn’s 2017 U.S. Emerging Jobs Report, and are concentrated in the ten largest metropolitan areas. Chicago tied with Boston as the second most technologically innovative city in the U.S. based on a 2018 global poll conducted by KPMG.

So we know the job market is there, but how does Chicago rank for affordability? Kiplinger, a D.C.-based publisher of business forecasts and personal finance advice, rated Chicago a four on a ten point scale of affordability, with 10 being the least affordable. Plus, NerdWallet’s cost of living calculator estimates Chicago to be 50% more affordable than New York City and 40% more affordable than San Francisco.

So now that I’ve validated my choice to apply for positions in my hometown, let’s look at the state of Chicago’s real estate market and analyze where I should invest in property.

Time Series Analysis in Python

Living somewhat nomadically for the past six years has left me yearning for a place to call home, free of landlords and noisy neighbors. But Chicago is a big city, so I narrowed my analysis down to 11 zipcodes that encompass neighborhoods such as Logan Square, Ukrainian Village, and Pilsen.

We’ll be using Zillow’s housing data that ranges from 1996 to 2018 and includes the median housing price per zipcode in the entirety of the U.S. You can find the dataset here.

This is the dataframe that’s been narrowed down to the chosen zipcodes, with all irrelevant data removed. We’ll need to get this into a format that can be modeled.

Data Preprocessing

First, we’ll change the column names to datetime format with this function:

def get_datetimes(df):
    return pd.to_datetime(df.columns.values[1:], format='%Y-%m')get_datetimes(chicago_df)

Next, we’ll save each zipcode into separate CSVs and create a dictionary where the keys represent the zipcode and the values are the corresponding timeseries dataframe. The “RegionName” column holds the zipcodes.

chosen_zips = [60607, 60661, 60612, 60622, 60642, 60647, 60614, 60625, 60618, 60657, 60608, 60616]chi_dict = {}
for zipcode, data in chicago_df.groupby('RegionName'):
    
    if zipcode in chosen_zips:
        data.to_csv('{}.csv'.format(zipcode), header=True,      index_label=False)
        
        chi_dict[zipcode] = pd.read_csv('{}.csv'.format(zipcode))
    else:
        continue

Now we’ll reshape each of the dataframes into long format using the pandas melt() function. We can now iterate through the dictionary of dataframes by calling each zipcode individually.

def melt_data(df):
    melted = pd.melt(df, id_vars=['RegionName'], var_name='time')
    melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    return melted.groupby('time').aggregate({'value':'mean'})for zipcode in chi_dict.keys():
    chi_dict[zipcode] = melt_data(chi_dict[zipcode])

← So we’re left with this data in time series format.

Exploratory Data Analysis

In order to make this time series meaningful, we’ll need to narrow this down to affordable neighborhoods. By looking at the distribution of home values in each zipcode side-by-side, we see that most of the median values are below $400,000 but there are five in particular that are above. As this would be a starter home for me, I’m going to remove the more expensive zipcodes.

for zipcode in [60607,60614,60622,60642,60657]:
    chi_dict.pop(zipcode)

We’ll look now to see how the value has changed throughout the years. Values have been normalized to put them on the same scale:

We can see a marked drop around the time of the housing bubble burst, and for some of the zipcodes, the values were able to rebound to pre-burst levels. I’m interested in investing in property that will be more likely to withstand another crash. To assess this, we can simply find how far the value decreased during that time.

for zc in chi_dict:
    crash = chi_dict[zc]['2007':'2013']
    percent_change = (crash.max()-crash.min())/crash.max()    print(f'{zc} min: {crash.min()} \n {zc} max: {crash.max()} \n Percent change: {percent_change*-100} \n\n')

60608 and 60612 both decreased in value by over 40%, so we’ll avoid those neighborhoods. Now that we’ve narrowed our options down to four zipcodes, we’ll begin the modeling phase.

SARIMAX Modeling

Why SARIMAX?

SARIMAX stands for Seasonal AutoRegressive Integrated Moving Averages with eXogenous regressors. This model takes into account seasonality and other factors that cause a time series to not be stationary, which is an assumption of many other time series models. Because of this, we don’t have to detrend the time series before we begin modeling. More on detrending here.

Parameter Selection

The parameters of the ARIMA model are defined as follows:
p: The number of lag observations included in the model, also called the lag order.
d: The number of times that the raw observations are differenced, also called the degree of differencing.
q: The size of the moving average window, also called the order of moving average. — Jason Brownlee PhD, VIA Machine Learning Mastery

In order to find the optimum parameters for the model, we can conduct a “grid search” by iterating over every combinations of p, d, and q, including an option for seasonality as donoted by ‘s’.

import itertoolsp = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

Optimizing for AIC

The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection. — Wikipedia

We’ll look at values from 2011 on, as an anomoly such as the housing bubble may affect future predictions.

for zc in chi_dict.keys():
    chi_dict[zc] = chi_dict[zc]['2011':]

In order to optimize for AIC, we’ll model each zipcode with each combination of parameters and place the results into a dataframe.

import warnings
warnings.filterwarnings('ignore')AIC = []
for zipcode in chi_dict.keys():
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                mod = sm.tsa.statespace.SARIMAX(chi_dict[zipcode],
                                                order=param,
                                                seasonal_order=param_seasonal,
                                                enforce_stationarity=False,
                                                enforce_invertibility=False
                                                )
                results = mod.fit()
                AIC.append([zipcode, param, param_seasonal, np.abs(results.aic)])
            
            except:
                continue#Putting information into dataframeAIC_df = pd.DataFrame(AIC, columns = ["zip","pdq", "pdqs", 'aic']

Now we’ll group by zipcode and put into a dictionary with the zipcode as the key and the dataframe containing the results of the AIC test as the value. This may seem superfluous, however it will make sorting and modeling more simple later on.

AIC_dict = {}
for i, g in AIC_df.groupby('zip'):
    AIC_dict[i] = gfor zipcode in AIC_dict.keys():
    AIC_dict[zipcode].sort_values('aic', axis=0, inplace=True)
    AIC_dict[zipcode]=AIC_dict[zipcode].reset_index()
    AIC_dict[zipcode].drop(['index'], axis=1, inplace=True)

Now that they’re separate and sorted lowest to highest, let’s check out the first three parameters for each zipcode:

for zipcode in AIC_dict.keys():
    print(AIC_dict[zipcode].iloc[0:3,:])

Modeling

To keep the rest of the article simple, we’ll look at one zipcode, 60647. My very first apartment was in this zipcode, with a bedroom so small my bed fit snugly with a wall on three sides and a closet too small to have a door!

def SARIMAX_model(zipcode):
    pdq = AIC_dict[zipcode].loc[AIC_dict[zipcode]['aic'].idxmin()]['pdq']
    pdqs = AIC_dict[zipcode].loc[AIC_dict[zipcode]['aic'].idxmin()]['pdqs']
    
    mod = sm.tsa.statespace.SARIMAX(chi_dict[zipcode]['value'],
                                    order=pdq,
                                    seasonal_order=pdqs,
                                    enforce_stationarity=False,
                                    enforce_invertibility=False)
    results = mod.fit()
    return resultsmodel = SARIMAX_model(60647)
model.summary()

Now we’ll plot the diagnostics:

def plot_diagnostics(model, zipcode):
    model.plot_diagnostics(figsize=(16, 8))
    plt.title(zipcode)
    plt.show();plot_diagnostics(model, 60647)

The residuals seem to be following a Gaussian distribution.

Forecasting

From here, we can validate the model and predict future values.

def plot_predictions(model, zipcode):
    
    pred = model.get_prediction(start=pd.to_datetime('2015'), dynamic=False) #Get Predictions
    pred_conf = pred.conf_int()
    
    rcParams['figure.figsize'] = 15, 6ax = chi_dict[zipcode]['2011':].plot(label='observed')#Plot observed values
    pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.9) #Plot predicted valuesax.fill_between(pred_conf.index,
                pred_conf.iloc[:, 0],
                pred_conf.iloc[:, 1], color='g', alpha=.4)#Plot the range for confidence intervals#Set axes labels
    ax.set_xlabel('Date')
    ax.set_ylabel(f'Home Values for {zipcode}')
    plt.legend()plt.show()
    
    forecast = pred.predicted_mean
    real = chi_dict[zipcode]['value']['2015':]# Compute the MSE and RMSE
    mse = mean_squared_error(forecast, real)
    rmse = np.sqrt(mse)
    print(f'The Mean Squared Error of our forecasts for {zipcode} is {round(mse, 2)}\n\n')
    print(f'The Root Mean Squared Error of our forecasts for {zipcode} is {round(rmse, 2)}\n\n')plot_predictions(model, 60647)

The predictions follow along with observed values, however the Root Mean Squared Error indicates the model can predict values within $1,765 of the actual value. It’s not the best results, but not terrible either. Here, we’ll plot predictions for three years into the future:

def plot_forecast(model, zipcode):
    prediction = model.get_forecast(steps=36)pred_conf = prediction.conf_int()
    ax = chi_dict[zipcode]['value'].plot(label='observed', figsize=     (20, 15))
    prediction.predicted_mean.plot(ax=ax, label='Forecast')
    ax.fill_between(pred_conf.index,
                    pred_conf.iloc[:, 0],
                    pred_conf.iloc[:, 1], color='k', alpha=.25)
    ax.set_xlabel('Year')
    ax.set_ylabel('Home Value')    plt.legend()
    plt.show()plot_forecast(model, 60647)

As expected, our confidence interval widens the further in the future. The median home value is projected to increase to over $500,000, so now would be a great time to invest in a property in the Chicago zip of 60647.

View the github notebook here.