Prophet vs Linear Regression on Real Estate: The Zillow Case
By Nixtla Team.
,TL; DR Recently there has been controversy in the data science community about the Zillow case. There has been speculation that the Zillow team may have used Prophet to generate forecasts of their time series. Although we do not know if the above is true, we contribute to the discussion by showing that creating good benchmarks is fundamental in forecasting tasks. Furthermore, we show that
Prophet
does not turn out to be a good solution on Zillow Home Value Index data. Better alternatives are simpler and faster models like auto.arima or statsforecast, and to improve them mlforecast is an excellent option because it makes forecasting with machine learning fast and easy and it allows practitioners to focus on the model and features instead of implementation details.
Introduction
Recently, Zillow announced that it would close its home-buying business because its models were not being able to correctly anticipate price changes. The Zillow CEO Rich Barton said, “We’ve determined the unpredictability in forecasting home prices far exceeds what we anticipated”. Since this news, several opinions have been published about the alleged technology used by them for forecasting. In particular, opinions criticize the fact that they requested Prophet
in their job offers.
Forecasting time series is a complicated task, and there is no single model that fits all business needs and data characteristics. Best practices always suggest starting with a simple model as a benchmark; such a model will allow, on the one hand, to build models with better performance and, on the other hand, to measure the value-added of such models (data scientists should obtain a lower loss of their more complex models compared to the benchmark’s loss).
In this blog post, we have set ourselves the goal of empirically determining whether Prophet
is a good choice (or at least a good benchmark) for modeling the data used in the context of Zillow. As we will see, auto.arima
and even the naive model turn out to be better baseline strategies than Prophet
for the particular dataset we use. We reveal that Prophet
does not perform well compared to other models, which is consistent with the evidence found by other practitioners (for example here and here). Also, we show how using mlforecast
(and LinearRegression
from sklearn
as training model) can beat auto.arima
and Prophet
in no more than 3 seconds.
Dataset
The dataset we use to evaluate Prophet
is the Zillow Home Value Index (ZHVI), which can be downloaded directly from the Zillow research website. According to the page, the ZHVI is "a smoothed, seasonally adjusted measure of typical home value and market changes for a given region and housing type. It reflects the typical value of homes in the 35th to 65th percentile range" and "represents the "typical" home value for a region".
The dataset reflects price changes, so we decided to experiment with it because a stakeholder can potentially use it to make decisions. The dataset consists of 909 Monthly series for different aggregations of regions and states. We downloaded it on November 4, 2021 and anybody interested can find a copy of it here.
Experiments
To test the effectiveness of Prophet
in forecasting the ZHVI, we use the last 4 observations as the test set and the remaining observations as the training set. We performed a hyperparameter optimization over each time series using the last 4 observations of the training set as validation for Prophet
. In addition to Prophet
, we ran auto.arima
of R, some models of statsforecast
(random walk with drift, naive, simple exponential smoothing, window average, seasonal naive, and historic average) and mlforecast
.
mlforecast
is a framework that helps practitioners forecast time series using machine learning models. They need to give it a model (in this case, we use LinearRegression
from sklearn
), define which features to use and let mlforecast
do the rest.
Reproducing results
You can reproduce the results using this repo. Just follow the next steps. The whole process is automized using Docker, conda, and Make.
make init
. This instruction will create a docker container based onenvironment.yml
which contains R and python needed libraries.make run_module module="python -m src.prepare_data"
. The module splits data into train and test sets. You can find the generated data indata/prepared-data-train.csv
anddata/prepared-data-test.csv
respectively.make run_module module="python -m src.forecast_prophet"
. FitsProphet
model (forecasts indata/prophet-forecasts.csv
).make run_module module="python -m src.forecast_statsforecast"
. Fitsstatsforecast
models (forecasts indata/statsforecast-forecasts.csv
).make run_module module="Rscript src/forecast_arima.R"
. Fitsauto.arima
model (forecasts indata/arima-forecasts.csv
).make run_module module="python -m src.forecast_mlforecast"
. Fitsmlforecast
model usingLinearRegression
(forecasts indata/mlforecast-forecasts.csv
).
Results
Performance
The following table summarizes the results in terms of performance.
As can we see, the best model is mlforecast.linear_regression
for mape
, rmse
, smape
, and mae
metrics. Surprisingly, a very simple model such as naive
(takes the last value as forecasts) turns out to be better in this experiment than Prophet
.
Computational cost
The following table summarizes the results in terms of computational cost.
To run our experiments we used a c5d.24xlarge AWS instance (96 vCPU, 192 RAM). It costs 4.608 USD each hour. As can we see, mlforecast
takes no more than 3 seconds and beats Prophet
and auto.arima
in performance.
Conclusion
This post showed in the context of the Zillow controversy that doing benchmarks is fundamental to addressing any time series forecasting problem. Those benchmarks must be computationally efficient to iterate fast and build more complex models on top of them. The libraries statsforecast and mlforecast are excellent tools for the task. We also showed better options than Prophet
to run benchmarks, which is consistent with previous findings by the data science community.
Build benchmarks. Always.