LightGBM for TimeSeries forecasting

Michele Pace
Data Reply IT | DataTech
6 min readJan 19, 2022
SolarSeven / Getty Images

Is it possible to use a regression model for time series forecasting? And is LightGBM a good candidate for this task?

Well, the simplest answer is: Yes, sure!

But in order to choose the best possible solution for our problem, we need to understand better both, LightGBM itself and how to use a regression model for time-series forecasting.

Let’s start with some notions

Gradient boosting

Gradient boosting is a machine learning technique for regression, classification, and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

The idea of boosting came out of the idea of whether a weak learner can be modified to become better.

A weak hypothesis or weak learner is defined as one whose performance is at least slightly better than random chance.

Hypothesis boosting was the idea of filtering observations, leaving those observations that the weak learner can handle and focusing on developing new weak learns to handle the remaining difficult observations.

LightGBM

LightGBM is a gradient boosting framework that uses a tree-based learning algorithm. It is an open-source library that has gained tremendous popularity and fondness among machine learning practitioners.

The two novel ideas introduced by LightGBM are:
•Gradient-based One-Side Sampling(GOSS)
•Exclusive Feature Bundling(EFB)

Besides these, LGBM also uses an efficient histogram-based method to identify splitting points in continuous features. Split points are the feature values depending on which data is divided at a tree node.

GOSS: is a novel sampling method that downsamples the instances on basis of gradients. As we know instances with small gradients are well trained (small training error) and those with large gradients are undertrained. A naive approach to downsample is to discard instances with small gradients by solely focussing on instances with large gradients but this would alter the data distribution. In a nutshell, GOSS retains instances with large gradients while performing random sampling on instances with small gradients.

EFB: If we are able to downsample the feature we will speed up tree learning. LightGBM achieves this by bundling features together. We generally work with high dimensionality data. Such data have many features which are mutually exclusive i.e they never take zero values simultaneously. LightGBM safely identifies such features and bundles them into a single feature to reduce the complexity

The level-wise strategy grows the tree level by level. In this strategy, each node splits the data prioritizing the nodes closer to the tree root. The leaf-wise strategy grows the tree by splitting the data at the nodes with the highest loss change. Level-wise growth is usually better for smaller datasets whereas leaf-wise tends to overfit. Leaf-wise growth tends to excel in larger datasets where it is considerably faster than level-wise growth.
The leaf-wise tree growth finds the leaves which will reduce the loss to the maximum, and split only that leaf.

Some key points of LightGBM:

  1. Faster training speed and higher efficiency: Light GBM uses a histogram-based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.
  2. Lower memory usage: Replaces continuous values to discrete bins which results in lower memory usage.
  3. Compatibility with categorical features
  4. Better accuracy than any other boosting algorithm: It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy. However, it can sometimes lead to overfitting which can be avoided by setting the max_depth parameter.
  5. Compatibility with Large Datasets: It is capable of performing equally well with large datasets with a significant reduction in training time as compared to XGBOOST.
  6. Parallel learning supported.

Forecasting as Supervised Learning

There are multiple ways to approach this problem, but we will focus on the most used using a single-output algorithm like LightGBM.

Single-Step Forecasting

We take the previous k time-step values as our regressors and output the k+1 value. This is a One-Step Ahead forecaster.

Multi-Step Forecasting (Recursive Strategy)

One way to achieve this is by iterating over multiple steps (which is known as Forecasting Horizon) and using the forecasted value as input for forecasting the next value. This is known as Recursive Strategy and despite some of its limitations (error sensitivity), it works well in real-world settings.

Multi-Step Forecasting (Direct Strategy)

A second way to achieve multi-step forecasting is by learning N models independently, where N is the number of steps that we want to forecast. Since the Direct strategy does not use any approximated values to compute the forecasts, it is not prone to any accumulation of errors. But it makes a simplistic assumption of conditional independence between future values.

Multi-Step Forecasting (DirRec Strategy)

The DirRec strategy combines the architectures and the principles underlying the Direct and the Recursive strategies. DirRec computes the forecasts with different models for every horizon (like the Direct strategy) and, at each time step, it enlarges the set of inputs by adding variables corresponding to the forecasts of the previous step (like the Recursive strategy). However, note that, unlike the two previous strategies, the embedding size n is not the same for all the horizons.

Feature engineering

Using feature engineering when dealing with supervised algorithms for time series is really helpful. It helps the model discover some relations between features and combinations of them, with values to forecasts.

Simple feature engineering frequently used:
•Lags
•Window functions (max, min, mean, exp…)
•Date decomposition (year, quarter, month, day)

Recap

Recursive

Pro
•Single model
•Need less memory
•You can change the number of steps
to predict without retrain

Cons
•You need to manage a cycle to reuse
previous predictions
•You need to recompute the feature
engineering process every step
•You need to forecast exogenous
variables if present and not available

Direct

Pro
• You don’t need to cycle in order to reuse
previous predictions model
• Compute features engineering only once
• You don’t need to forecast exogenous
variables

Cons
• Multiple models to train
• You need to retrain if you want to change
the number of steps to predict
• More memory required
• Each step have a different number
of samples in the training set

Conclusion

Now that you have a basic understanding of how LightGBM works and when it could be useful and you know the principal approaches to use a regression model to address a time series forecasting problem, you have some more tools to solve your task in the best way possible! Good luck!

--

--