How to Forecast Time Series with Tree-Based Methods Beyond the Average

Henkel Data & Analytics

Published in

Henkel Data & Analytics Blog

6 min readOct 31, 2023

By Andrii Kleshchonok.

Why predict a distribution?

Chronologically ordered data is a basis in various industrial applications ranging from sales and marketing data to sensor readings and robotics. A straightforward choice of modeling techniques for these examples would be models that are able to capture time series nature either via classical techniques like ARIMA, feature engineering, or deep learning. Often for production applications of the machine learning models, it is not enough just to achieve a certain level of accuracy, but to analyze the errors and uncertainty of the models as well. Typical use cases when uncertainty plays a crucial role are business decision-making, anomaly and out-of-distribution detections, hints towards data under- and over-representation, and applications of existing models to new scenarios. The natural choice when it comes to uncertainty estimation would be either bootstrap techniques or, alternatively, a Bayesian approach. The latter one, besides being by design a probabilistic modeling framework can offer accounting for prior knowledge, working with a limited amount of data and regularization. Bootstrap analysis, in contrast to explicit distribution modelling, is a computationally intensive technique that depends on the estimator applied and is limited by the sampling size and representation. The approach described below aims to combine statistical tree-based modelling that focuses on prediction accuracy with the stochastic nature of data and estimation of the underlying distribution.

Decision tree-based distribution modelling approach

In this article, we would like to focus on the distribution modeling of time series data based on boosting methods as a part of the Generalized Additive Model for Location, Scale, and Shape (GAMLSS) framework (see more details here). The GAMLSS approach extends conventional regression of the expected mean to the parametrized modelling of the whole underlying distribution of the dependent variable including location, scale, shape, etc. In the case of Normal distribution, these parameters would be mean and variance, rate parameter for Poisson distribution or concentration parameter for Dirichlet distribution, etc. The implementations of GAMLSS exist in R (see publication by Stasinopoulos, D. M., & Rigby, R. A.) and, recently, realizations around tools like XGBOOST — XGBoostLSS and LightGBM — LightGBMLSS have become available in Python. For details, we refer a reader to the following papers by Alexander März and Thomas Kneib here, here, and here.

Noisy sine wave modelling

Before we dive into the model training for the specific data set, let’s remind ourselves of the steps and limitations of tree-based methods related to time series. First, they require manual feature engineering to account for the sequence and nature of the signal. Second, tree-based methods are not optimal for extrapolation tasks towards the out-of-distribution data. To illustrate these, we first look at a synthetic example of the sin wave data with the noise (see Fig.1 below).

Sin wave — Fig.1. Illustrative example of the *sin* noisy function that has a noise level increase on every 3rd period.

Since we would like to go beyond the mean estimates, but rather learn an underlying distribution, the data is generated with the noise level that is increased every 3rd period. Let’s take a look at two different cases — with and without feature engineering. Even for a simple sin function, tree-based methods are prone to extrapolate the new set of independent variables (see Fig.2). In this case, time as an independent variable in time-series models is out-of-distribution with respect to the training set and therefore, the tree-based methods predict constant.

Sin wave xgboost — Fig.2. Fitting of the tree-based model to the noisy *sin* function without explicit feature engineering. In this case, the tree-based methods fail to extrapolate outside of the training set and predict just a constant value.

In order to correct this behavior and have reliable predictions for future time intervals, we would need to explicitly engineer time-based features. Normally this would be features like time of the day, calculated time lags, statistics on previous time intervals, derivatives or integrals of the sensor data, etc. In the case of our example for illustration, we can already explicitly account for the phase of sin wave and remind of the period division by 3 as a feature. These 2 variables are enough to predict the amplitude of the sin wave and learn the noise level depending on the period. We would assume that our measurement variable follows the Gaussian distribution and therefore, after fitting the tree-based GAMLSS model, one can estimate mean and variance as a function of time. The results of this fit are shown in Fig.3, here we have correctly captured the desired behavior of the amplitude mean and as well the increase in the Gaussian scale parameter as a function of the period. In the case of the more realistic data, the same approach of going beyond point estimates, but rather fitting a parametrized distribution of the dependent variable, might be applied.

Sin wave LSS — Fig.3. Modelling of the noisy *sin* function with *GAMLSS* tree-based model with feature engineering. Top: Shows distribution modelling of the sin time series, where the shaded region shows the respective Gaussian scale. Bottom: Scale of fitted Gaussian distribution plotted as a function of time. In this case, we can clearly see that the model extends to the test set and captures the larger noise level at every 3rd period.

Time series energy consumption predictions

Finally, let’s look at the Hourly Energy Consumption dataset. This dataset captures the hourly energy consumption of the Eastern US Interconnection grid in megawatts. One could formulate the problem as an estimation of the future energy consumption that is time and season-dependent. This well-studied problem could be addressed with the XGBoostLSS approach as well. One would only need to specify extra parameters of the model on top of the standard XGBoost approach:

from xgboostlss.model import xgboostlss
from xgboostlss.distributions.Gaussian import Gaussian
distribution = Gaussian                 
distribution.stabilize = "L2"

Here, we specified a Gaussian distribution, stabilization method for derivatives calculation. Afterwards, we can use wrapper provided by XGBoostLSS for training with XGBoost parameters and input training data in the Dmatrix:

xgboostlss_model = xgboostlss.train(param, 
                                    dtrain, 
                                    dist=distribution, 
                                    num_boost_round=1000)

For inference, one would specify model, Dtest matrix, distribution, number of samples drawn from this distribution, and model seed:

pred_y = xgboostlss.predict(xgboostlss_model,
                            dtest,
                            dist=distribution,
                            pred_type="response",
                            n_samples=n_samples,
                            seed=123)

As the result on top of point estimates of the hourly energy consumption we can obtain the scale and location parameters of the fitted Normal distribution:

pred_params = xgboostlss.predict(xgboostlss_model,
                                 dtest,
                                 dist=distribution,
                                 pred_type="parameters")

In the example below, we split the train and the test sets in a similar fashion as was shown above for the sin wave, i.e., with the test set being in the future with respect to the train set. Besides, we apply standard scaling to our target to set the scale of the problem and explicitly generate seasonal features (hour, week, day of year, day of month, day of the week, etc) and time-based statistics (lagged features of last 6, 12 and 24 hours) to improve model accuracy. After this preprocessing and feature engineering one can train XGBoostLSS and obtain the scale estimate for the model (see Fig.4).

Energy — Fig.4. Top: Shows distribution modelling of the first 500 points of the test set of the energy consumption time series, where the shaded region shows the respective Gaussian scale. Bottom: Scale of fitted Gaussian distribution plotted as a function of time.

An interesting question to ask would be if the modelled Gaussian scale could be correlated to the absolute error of the model fit and therefore represent the uncertainty. Indeed, if we focus on the test data set from Fig.4., one can obtain a statistically significant level of 95% confidence that the correlation is >0.6 (see Fig.5). This result is not perfect and the correlation gets lower if we model data further in the future. But, it still could give an indication of the model uncertainty.

Scale vs. error correlation — Fig.5. *GAMLSS*-modelled Gaussian scale plotted with respect to the mean absolute error of the mean prediction. In this example, the correlation reaches 0.6.

Conclusion

In general, GAMLSS’ accurate estimations of scale and its high correlation with the model error are not given but might serve as a good indicator for analysis of model errors, possible areas where a model is under- or overconfident, or more data collection is needed. Naturally, the mean error of GAMLSS model is not smaller when obtained with original boosting methods, however, if one aims for estimation beyond point average estimates, this approach offers an out-of-the-box wrapper that does not require large modifications to the code.

Whether shampoo, detergent, or industrial adhesive — Henkel stands for strong brands, innovations, and technologies. In our data science, engineering, and analytics teams we solve modern data challenges for the benefit of our customers.
Learn more at henkel.com/digitalization.