Real Estate Investment — New York or San Francisco?

A time-series ROI-based analysis from Zillow Home Value Index, using auto_arima for Python and SARIMAX modeling.

7 min readOct 31, 2019

Before anything else, note that the purpose of this article is an exercise on time series modeling, so please don’t take any conclusions as actual real estate investment recommendations.

That said, let’s start. We are supposing that our stakeholders are investors who are trying to decide where to put their real estate money: East Coast or West Coast, and more precisely, San Francisco, CA or New York City, NY?

They are interested in learning which city shows the best-predicted returns for both a 2-year investment and a 5-year investment.

We will use data from the Zillow Research Page to recommend, based on our analysis and modeling of the data, what is the soundest financial decision.

Our first steps were to download the data, filter data to our cities of interest only, reshape it from wide into a long format, and finally rename the column ‘RegionName’ as ‘Zipcode’. In order to not make our article too long, we won’t go over these procedures here and we assume our reader is already familiar with these initial steps of downloading and cleaning data.

Let’s start!

First Step — Data Visualization

The first thing we did once we had our data of interest was to melt all the zipcodes for each city. We created this function, and passed the data for both San Francisco and NYC:

We now have a time series for each city with the mean Zillow Home Value Index for all zip codes present in our data frame for each city.

My first step was to have a look at both city's overall Zillow Home Value Index (ZHVI) throughout the years with a simple line plot of both time series.

What we can notice from this plot is that San Francisco’s mean home index is higher than that of New York City. We can also see from the plot that New York apparently suffered less impact from the house bubble crisis, with mean prices looking more flat than at a loss during the aftermath of the crisis.

We can also see that San Francisco house prices are showing a steep positive trend more recently — with the growth in tech companies and the increase in the number of their well-paid employees, it is no news that San Francisco real estate market has seen increases in prices due to higher demand.

However, what we are really interested in learning is the return on investment (ROI) for both of these cities for 2-year investments and 5-year investments.

To access this, we created a function to calculate these returns throughout our time series:

Then, we applied the function to our time series and plotted our results side-by-side so that we could compare their returns.

line plot of ROI for 2-year investment for san francisco and new york city

What we could see was that investing in both cities has rarely represented a loss within 2 years, except during the house bubble crisis years. Let’s also look at the 5-year returns.

Within 5-years of investment, San Francisco seems to present better returns more recently, although there might be some declining trend. What we will do is to apply a model and get the predictions on these returns for 2 and 5-year investments, and then compare both cities results in order to pick the best investment choice.

Now, We Look Into the Future!

Before modeling our data we want to check if we have stationarity since this is an assumption of time series modeling. From simply inspecting our data we can tell it is not stationary (positive trend), but we can go further into evaluating our data for trends and seasonality. We will create a function to calculate and plot rolling statistics for mean and standard deviation, as well as perform a Dickey-Fuller test for stationarity.

plot of rolling mean and rolling standard deviation for san francisco 2-year roi

We are showing the results for San Francisco’s 2-year returns, but have applied the function to all four of our returns time series and all of them are not stationary.

With this, we will be using a SARIMAX model, since it takes into account trends and seasonality and thus we can model our data without differencing it and making it stationary beforehand.

We will use the function auto_arima, which performs a search over possible orders and helps us select the parameters that minimize a given metric. In this case, we are using the AIC (Akaike Information Criterion) value to measure the quality of our model.

(You can learn more about the auto_arima here.)

Our function returned orders ((2, 1, 3), (0, 0, 2, 12)) . We created a function to pass a SARIMAX model:

We then pass the function using the orders provided by our auto_arima to our San Francisco 2-year returns first: sf2_output = fit_sarimax_model(sf_roi2, order=(2, 1, 3), seasonal_order=(0, 0, 2, 12))

Here are our results:

We also created a function to get predictions from our models and plot results:

We passed the function to San Francisco’s 2-year returns model: sf2_predictions = get_predictions(sf_roi2, sf2_output) and these are our results:

Forecast for San Francisco 2-year returns

The RMSE for our forecast is relatively small, as well as the AIC and BIC of our model, and our prediction range is not as wide, with a 9.4% return predicted, possibly ranging from -18% to 37% at a 95% confidence interval.

What we want to do now is to streamline this process and apply it to all our time series so that we can compare which one is the best. Let’s create functions to do all this at once.

Once we pass the function to our 2-year return time series for both cities, these are our results:

This is a very interesting result! What our model results show us is that if our stakeholders are looking to buy and sell within 2 years, New York City shows the best gain prediction. Although NYC's maximum potential gain is similar to San Francisco’s, the later represents a larger potential loss. We can conclude that New York City offers more gain potential and presents less risk for that time frame of a 2-year investment.

We created another function to get 5-year returns as well:

And here are our results:

When the time span for the investment extends to 5 years, the predictions for San Francisco are not looking so good again. There is a loss predicted for San Francisco, and it could potentially be as large as almost 60%. New York City, although also with a possibility of relatively high loss at the lowest end of our confidence interval, nonetheless looks like a more consistent investment option, with better predicted and potential high gains when the time of investment is of 5 years.

In conclusion, what our model and prediction show us is that New York City is the best investment option than San Francisco when we compare predicted returns for both 2-year and 5-year investments. Go Big Apple!

If you’d like to see the entirety of this project, you can go to the GitHub repo.

Real Estate Investment — New York or San Francisco?

A time-series ROI-based analysis from Zillow Home Value Index, using auto_arima for Python and SARIMAX modeling.

First Step — Data Visualization

Now, We Look Into the Future!

Written by Giovanna Fernandes