Prophet vs Linear Regression on Real Estate: The Zillow Case

4 min readDec 15, 2021

By Nixtla Team. fede garza ramírez , Max Mergenthaler

TL; DR Recently there has been controversy in the data science community about the Zillow case. There has been speculation that the Zillow team may have used Prophet to generate forecasts of their time series. Although we do not know if the above is true, we contribute to the discussion by showing that creating good benchmarks is fundamental in forecasting tasks. Furthermore, we show that Prophet does not turn out to be a good solution on Zillow Home Value Index data. Better alternatives are simpler and faster models like auto.arima or statsforecast, and to improve them mlforecast is an excellent option because it makes forecasting with machine learning fast and easy and it allows practitioners to focus on the model and features instead of implementation details.

Introduction

Recently, Zillow announced that it would close its home-buying business because its models were not being able to correctly anticipate price changes. The Zillow CEO Rich Barton said, “We’ve determined the unpredictability in forecasting home prices far exceeds what we anticipated”. Since this news, several opinions have been published about the alleged technology used by them for forecasting. In particular, opinions criticize the fact that they requested Prophet in their job offers.

Forecasting time series is a complicated task, and there is no single model that fits all business needs and data characteristics. Best practices always suggest starting with a simple model as a benchmark; such a model will allow, on the one hand, to build models with better performance and, on the other hand, to measure the value-added of such models (data scientists should obtain a lower loss of their more complex models compared to the benchmark’s loss).

In this blog post, we have set ourselves the goal of empirically determining whether Prophet is a good choice (or at least a good benchmark) for modeling the data used in the context of Zillow. As we will see, auto.arima and even the naive model turn out to be better baseline strategies than Prophet for the particular dataset we use. We reveal that Prophet does not perform well compared to other models, which is consistent with the evidence found by other practitioners (for example here and here). Also, we show how using mlforecast (and LinearRegression from sklearn as training model) can beat auto.arima and Prophet in no more than 3 seconds.

Dataset

The dataset we use to evaluate Prophet is the Zillow Home Value Index (ZHVI), which can be downloaded directly from the Zillow research website. According to the page, the ZHVI is "a smoothed, seasonally adjusted measure of typical home value and market changes for a given region and housing type. It reflects the typical value of homes in the 35th to 65th percentile range" and "represents the "typical" home value for a region".

The dataset reflects price changes, so we decided to experiment with it because a stakeholder can potentially use it to make decisions. The dataset consists of 909 Monthly series for different aggregations of regions and states. We downloaded it on November 4, 2021 and anybody interested can find a copy of it here.

Experiments

To test the effectiveness of Prophet in forecasting the ZHVI, we use the last 4 observations as the test set and the remaining observations as the training set. We performed a hyperparameter optimization over each time series using the last 4 observations of the training set as validation for Prophet. In addition to Prophet, we ran auto.arima of R, some models of statsforecast (random walk with drift, naive, simple exponential smoothing, window average, seasonal naive, and historic average) and mlforecast.

mlforecast is a framework that helps practitioners forecast time series using machine learning models. They need to give it a model (in this case, we use LinearRegression from sklearn), define which features to use and let mlforecast do the rest.

Reproducing results

You can reproduce the results using this repo. Just follow the next steps. The whole process is automized using Docker, conda, and Make.

make init. This instruction will create a docker container based on environment.yml which contains R and python needed libraries.
make run_module module="python -m src.prepare_data". The module splits data into train and test sets. You can find the generated data in data/prepared-data-train.csv and data/prepared-data-test.csv respectively.
make run_module module="python -m src.forecast_prophet". Fits Prophet model (forecasts in data/prophet-forecasts.csv).
make run_module module="python -m src.forecast_statsforecast". Fits statsforecast models (forecasts in data/statsforecast-forecasts.csv).
make run_module module="Rscript src/forecast_arima.R". Fits auto.arima model (forecasts in data/arima-forecasts.csv).
make run_module module="python -m src.forecast_mlforecast". Fits mlforecast model using LinearRegression (forecasts in data/mlforecast-forecasts.csv).

Results

Performance

The following table summarizes the results in terms of performance.

As can we see, the best model is mlforecast.linear_regression for mape, rmse, smape, and mae metrics. Surprisingly, a very simple model such as naive (takes the last value as forecasts) turns out to be better in this experiment than Prophet.

Computational cost

The following table summarizes the results in terms of computational cost.

To run our experiments we used a c5d.24xlarge AWS instance (96 vCPU, 192 RAM). It costs 4.608 USD each hour. As can we see, mlforecast takes no more than 3 seconds and beats Prophet and auto.arima in performance.

Conclusion

This post showed in the context of the Zillow controversy that doing benchmarks is fundamental to addressing any time series forecasting problem. Those benchmarks must be computationally efficient to iterate fast and build more complex models on top of them. The libraries statsforecast and mlforecast are excellent tools for the task. We also showed better options than Prophet to run benchmarks, which is consistent with previous findings by the data science community.

Build benchmarks. Always.