Automated Additional Regressor Selection for Forecasting with FBProphet

Aldo Pradana
Traveloka Engineering Blog
6 min readJun 23, 2022
Business growth concept in allusive graph chart showing marketing sales profit

Editor’s Note: When you create a machine learning model model to predict an outcome, feature selection is a vital method that can either make or break your model. But, it often requires fine-tuning to get it right. Aldo will share a procedure that allows the automation of this process in a multivariate time series forecasting use case.

Aldo Pradana is Data Analyst with the Flight Transport team, whose day-to-day responsibilities include, but not limited to, supporting the flight business and the product team in feature development as well as in insight analysis and monitoring.

Forecasting is a statistical problem that commonly arises across business domains. It is about looking at time-based information, including historical data or knowledge of any upcoming events, to help predict the uncertain future as certainly as possible. This knowledge should be an integral part that serves additional input to improve decision-making.

Univariate vs Multivariate Time Series
Generally, when considering multiple variables within a forecasting model, there are two approaches: Univariate and Multivariate Time Series.

Univariate time series, as the name suggests, is a data series with one time-dependent variable. For example, the data below shows the number of daily bike rentals.

Figure 1. An example of nivariate time series dataset

A univariate forecasting approach with this dataset would forecast that time-dependent variable, query information contained in its past values, and gauge and extract a pattern from it.

On the other hand, a multivariate time series could be used when a time series dataset has two or more time-dependent variables. Consider the example below:

Figure 2. An example of multivariate time series dataset

The same dataset as in Figure 1 but with the additional variables of season, workingday, holiday, temp(erature), hum(idity), and windspeed. In this case, using a multivariate time series approach will not only look at how each variable depends on its own past values (just like univariate time series), but also at the dependency between those variables to forecast future values.

In this article, I would like to share my experience in approaching a time series forecasting problem with FBProphet, an easy-to-use model that is capable of taking additional variables (regressors) as inputs. The deeper aspects of how FBProphet works will not be covered in this article as I will instead focus on how to automate regressor selection in a multivariate time series with FBProphet.

As shown in the dataset snapshot in the Figure 2 above, there are additional variables regressors. When inputted into a time series model, they can potentially improve or harm the prediction performance of a multivariate time series model. With that in mind, it comes down to one question: from the combination of variables in the dataset, which additional regressor (or set of additional regressors) is able to predict the future values with the least error?

The approach that we took to answer that question was inspired from a commonly used feature selection method in linear regression modeling: Stepwise Regression. It is a method of fitting regression models, where the choice of predictive variables is carried out by an automatic procedure.

Data Preparation
Let us start with preparing the data source. In this case, we are using Bike Sharing in Washington D.C. Dataset | Kaggle that shows daily rental bikes in Washington D.C. between 2011 and 2012 with corresponding daily weather and seasonal info. Our use case here would be to create a model with the best regressors that could predict usage patterns for the next two weeks (1–14 January 2013).

Data Exploration
The columns in the dataset has the following information:

Figure 3. DataFrame information on the dataset
  • season: The season of the date (1:spring, 2:summer, 3:fall, 4:winter)
  • mnth: month (1 to 12)
  • holiday: if day is holiday then 1, otherwise 0
  • weekday: The day of the week
  • workingday: If the day is not weekend or holiday then 1, otherwise 0
  • weathersit:
    1: clear, few clouds, partly cloudy, partly cloudy
    2: mist + cloudy, mist + broken clouds, mist + few clouds, mist
    3: light snow, light rain + thunderstorm + scattered clouds, light rain + scattered clouds
    4: heavy rain + ice pellets + thunderstorm + mist, snow + fog
  • temp: The normalized temperature in Celsius. The value is divided to 41 (max)
  • atemp: The normalized feeling temperature in Celsius. The value is divided to 50 (max)
  • hum: The normalized humidity. The value is divided to 100 (max)
  • windspeed: The normalized wind speed. The values is divided to 67 (max)
  • cnt: The count of total rental bikes

Setup List of Regressor Variables Combinations
In this case, for lower computational load and illustrational simplicity, we narrow down the 11 available variables and try to identify the best combination from the following five variables: holiday, weathersit, atemp, hum, and windspeed. Intuitively, we feel that these five variables would provide sufficient coverage of information from the dataset.

Figure 4. Creating a list object with all the possible regressor combinations

And from those five variables, the following are the 32 combinations that can happen:

Figure 5. List of all the possible regressor combinations

Training the Model
Firstly, we need to set up a time frame for the model train-test cross validation. In this case, we will use three iterations of two-week rolling cross validations for model training as illustrated below:

Figure 6. Dataset train-test split

In each iteration, we loop the FBProphet model to run on each regressor combination to create predictions for the next two weeks. Below is the code to run this loop. Note the `add_regressor()` function (underlined in red), where we input the additional regressor to the model.

Figure 7. Looping the model on each regressor combination

Below is a snapshot of the output from the above code, essentially showing the predicted values for each future time frame, in each set of additional regressors combination (config_name), and in each iteration (cv_period).

Figure 8a. Predicted number of bike rent from each configuration and cross validation iteration
Figure 8b. Predicted number of bike rent from each combination and cross validation iteration

Testing Prediction with Evaluation Metrics
Then, we compare each of the two weeks forecast against the actual data simply by joining the predicted values on a certain date with the actual data to get the error rate. We can then use evaluation metrics such as absolute error → |yhat — actual_y| to get the MAE (mean absolute error) or squared error → (yhat — actual_y)² to get the RMSE (root mean squared error).

Figure 9. Error rate of predicted from actual values

Getting Config Performance
Lastly, after getting each test observation’s error metric, we can then calculate the MAE and the RMSE to compare the error rate for each config.

Figure 10. Aggregating error metric for each regressor combination (config_name)

And here’s the output (10 regressor combinations sorted by the least RMSE):

Figure 11. List of regressor combinations sorted on lowest error metric

Finally, using RMSE as our error metric, we can identify that holiday, weathersit, atemp, and windspeed have the least error compared to other regressor combinations. We can then use them to predict the next two weeks in 2013.

Conclusion
In this article, we have explored some steps on how to automate the selection of additional regressors from a multivariate time series data. It allows us to identify the best subset of regressors that produce the least test error compared to other combinations. With the right regressors, a multivariate model could potentially improve the predictive power of time series forecast compared to a univariate model.

Here in Traveloka, time series modeling is one of the many facets of data analytics that we use to answer questions and deliver values. If you want to learn more, check out Traveloka’s Career Page and join us on our journey.

--

--