Unlock the NeuralProphet potential: main components

Analyzing key disparities from Prophet, examining its main components, highlighting AR-Net’s advantages and its seamless integration within the algorithm

Published in

Eni digiTALKS

14 min readSep 4, 2023

Are you tired of traditional time series forecasting models that require onerous tuning and have limited explainability? Look no further than NeuralProphet, a cutting-edge forecasting library that leverages the power of neural networks and decomposable time series to generate accurate predictions.

Whether you’re working in finance, marketing or any other industry which relies on forecasting, NeuralProphet is the tool you need to take your predictions to the next level.

This article will deep dive into NeuralProphet components and will guide you to take advantage of it, so grab a cup of coffee and let’s explore the future of time series forecasting.

Differences with Prophet

As a data scientist, you might be familiar with Prophet a well-known algorithm developed in 2017 by Facebook, if you want to deep dive into it, check this article. NeuralProphet can be seen as an evolution of Prophet algorithm which incorporates the power of neural networks. In this paragraph we will quickly illustrate the main differences between the two algorithms.

NeuralProphet was developed in 2020 by Facebook [1, 2]. Similarly with Prophet, it is based on decomposable time series, but, in addition, it includes an autoregressive component based on neural network: AR-net. The latter can also be extended to include external regressors which are modelled with a different AR-net compared to the one of the targeted time series.

NeuralProphet uses PyTorch as its backend, which allows it to exploit all the latest innovations and discoveries of the deep learning community, such as:

a modern version of mini-batch stochastic gradient descent (SGD)
advanced optimizers (e.g., Adam)
the possibility to customize all the components, from layers to loss functions (you can find the implementation at the end of the article).

Last but not least, one of the greatest news introduced with NeuralProphet, is the support of Global Modelling, that is, using historical observations of multiple time series to fit the forecasting model. This feature is particularly suitable for time series made up of smaller ones, as it allows to train a single model on different related series.

NeuralProphet’s main components

NeuralProphet documentation is highly concise, therefore we try to fully explore its main concepts and provide the essential elements which will simplify its usage and comprehension in the upcoming paragraphs. So, get ready as we take a closer look at each component.

As previously mentioned, NeuralProphet is based on decomposable time series. According to this approach, any time series can be broken down into various unobservable components that exhibit distinct patterns. These components are trend, seasonality, and random variation. NeuralProphet also includes other additional terms: special events, future regressors, autoregressive components and lagged regressors. Therefore, predictions at time t are described by the following formula:

Let’s explore how each component is computed in NeuralProphet.

Trend

Trend represents long-term movement of the series. The most classic representation of trend is the combination of an offset and a growth rate.

Where k represents the growth rate and m represents the offset.

Unfortunately, in the real world only a few phenomena can be represented through this linear trend. NeuralProphet, starts from this simple idea of trend modelling, but allows the growth rate to change at a finite number of locations, named changepoints. Between two changepoints the growth rate is kept constant. By doing this, the trend is modelled as continuous piece-wise linear series, providing an interpretable, yet non-linear model.

So, let C be a set of n𝒸 changepoints, the trend can be described by the following formula:

Let’s go step by step to find out its meaning.

The first term (δ₀ + Γ(t)ᵀ δ) represents the growth rate at time t, which is determined by adding to the initial growth rate δ₀, the sum of the rate adjustment at all changepoints up to time step t.

The second term (ρ₀ + Γ(t)ᵀ ρ) represents a time dependent offset. Again, ρ₀ represents the first segment’s offset, and the offset at time t is given by the sum of ρ₀ and all the adjustments at each changepoint up to time t. The piece-wise constant vector Γ(t) ∈ R represents whether the time t is past each changepoint.

In the following picture we can observe how NeuralProphet is modelling the trend for a sample time series. The dotted lines represent the changepoints position.

FIgure 3 — Example of how the changing trend is built starting from the changepoints

Seasonality

The idea behind seasonality modelling in NeuralProphet is the Fourier series, which allows to decompose a continuous periodic function f(x) as a series of sine and cosine terms (the so-called Fourier series). These terms are called Fourier harmonics and are characterized by their own amplitude and frequency.

To model seasonality, we need as many Fourier terms (N) as many macroscopic cycles of the series, as shown in the following pictures:

Figure 4 — Seasonality modelling. Right side: representation of the time series signal; left side: transformed Fourier of the series, representation of the frequency.

Seasonality is computed through the following relation:

Where k represents the number of Fourier terms and p represents a periodicity. Usually there are different periodicities, for instance a series might depend on the month of the year but also on the day of the week. Therefore, the complete formula for modelling seasonality in NeuralProphet is:

Where P represents the set of all the periodicities. NeuralProphet supports both additive and multiplicative seasonal patterns:

Where the seasonality is multiplied to the trend T(t).

NeuralProphet allows yearly, weekly, and daily seasonality. The default number of Fourier terms per seasonality are k = 6 for p = 365,25 (yearly), k = 3 for p = 7 (weekly), and k= 6 for p = 1 (daily). However, it is also possible to customize the number of Fourier terms.

Events and holidays

Events and holidays occur sporadically, if their presence has an impact on the examined phenomenon, they can be included in the model as binary variables. Similarly with the features, their impact can be trend dependent or independent:

NeuralProphet library already includes a set of predefined holidays for each country, but users can define and add further events. Moreover, it is possible to choose an interval around the event date. For example, considering Christmas, you may want to consider an event also the day before and the day after, thus your Christmas event will be from the 24ᵗʰ to the 26ᵗʰ included.

Future regressors

Regressors are external variables whom effect impacts how the target variable (in our case time series) varies. In NeuralProphet, regressors whom values are known in the future are called future regressors, whereas regressors whom values are unknown in the future are called lagged regressors.

Let’s start exploring future regressors. They are represented by the following formula:

Where:

T(t) represents the trend and Ff(t) denotes the effect of feature f at time t. If the effect of the regressor f is amplified as the trend varies, then the regressor should be modelled as multiplicative.

NeuralProphet AutoRegressive (AR) components

AR-net

Before diving into the model’s components based on autoregression, it is advisable to focus on AR-net.

The idea of leveraging neural networks to perform autoregression comes from a recent work by Stanford and Facebook researchers [3].

They proposed a new framework combining traditional statistical models with neural networks. Differently from most used NN in time-series, such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), or attention-based models, the authors adopt a simple Feed-Forward Neural Network (FFNN), usually without any hidden layers.

A NN architecture equivalent to an AR model can be seen in Figure 5: the output of the network is the linear combination between the inputs and the weights learned by the model during training. This architecture has two main advantages: interpretability and reduction of the number of parameters to be tuned.

By simply using the past values as input to predict future value, the NN mimics a traditional AR process, but with the following advantages:

1. Keeps the same level of interpretability as classic AR model;

2. The computational complexity scales linearly (the computational complexity of the classical AR model scales at least quadratically), thus AR-net can scale well to large orders, allowing estimates of long-range dependencies;

3. AR-net can automatically select and learn sparse coefficients.

Furthermore, the NN is a non-parametric data model which does not require restrictive assumptions.

Figure 5 — NN architecture equivalent to an AR model — image from the original paper [1]

To explain the first advantage, the authors simulated an AR process, then fitted both a classic AR and the FFNN. By looking at figure 6, we can see that, despite using a different optimization technique (i.e., least-squares for AR and SGD for AR-net) the weights learned by the two models are almost the same, which are also quite similar to the true weights.

Figure 6 — Weights learned by Classic-AR and AR-net, along with the true underlying ones — image from the AR-net original paper. [1]

To show the second advantage, the authors measured the actual training time of both classic AR and AR-net for several values of the p order (i.e., the number of previous values of the time-series used to compute the forecast). The training time of classic AR scales up almost exponentially compared to the flat line of the AR-net, as can be seen in fig. 8. This makes AR-net the only viable option when the order is high.

Figure 7 — Training time of classic AR model and AR-Net as the order p of the AR model — Image from the AR-net original paper. [1]

The third advantage is achieved by introducing a regularization term, called R, to the loss. There are two major consequences:

The possibility to fit a larger model without knowing in advance the true AR order
The end of the assumption that AR coefficients must consist of consecutive lags. For example, taking into account typical business time series, there are missing data during weekends. Such time series cannot be modeled directly with an AR process, because the lags are not consecutive, indeed there are only five consecutive lags, then a jump skipping two days.

In figure 8 you can see an example of Sparse AR. Regardless of the order being p=7 (i.e., there are seven past points as inputs), the regularization shrinks some lags to 0.

Figure 8 — Example of Sparse AR-net, the regularization term reduces the order from 7 to 3

In figure 9 there is an example taken directly from the documentation.

Figure 9 — Example of the AR relevance of the same model with different regularization terms. Above the regularization term is set to 0.1, while below there is a case of extreme sparsity with the regularization set to 10 — Image from the NeuralProphet official documentation [2]

Here, the regularization parameter is increased in the bottom plot, and as a consequence the relevance of some lags reduced to 0. It is worth noting that these plots represent the average lag relevance, since for each future point we want to forecast the influence of past lags is different. For instance, if we want to predict tomorrow data, the AR associated with the current day takes a value, but when we forecast the day after tomorrow, the AR associated with the current day assumes a different value. Thus, what is plotted is a sort of average value (for all the forecast points) for each lag.

Lastly, in order to compare such AR-net with the classic AR, the authors fitted classic AR and sparse AR-net with varying model sizes to an AR process of order 3 (fig. 10). Without going into details, on the y-axis there is a measure of the difference between learned and actual weights/coefficient. Thus, the lower the metric, the better the model. Sparse AR-net is able to maintain good performance also with larger model size, while classic AR performance degrades as soon as the order drifts from the actual one.

Figure 10 — Performance of classic AR and AR-net with varying model sizes — Image from the AR-net original paper [2]

Deep AR-net

We could have concluded our digression on AR-net at this point, but the authors chose to explore further, prompting us to do the same. Indeed, they also considered the possibility to insert hidden layers (see Figure 11 left side). In this case there is a tradeoff between forecast accuracy and interpretability. By adding hidden layers, together with an activation function (e.g., a ReLU) it is possible to model non-linear relationships between the lags and the response variable, but the clear relationship between AR-net weights and classic AR ones is lost. However, it is still possible to use the weights after the first layer as a proxy of the relevance of the lags (Figure 11 right side).

Figure 11 — Example of how to get a measure of relevance for each input in case of AR-net with hidden layers — Images from the AR-net original paper modified by the authors [2]

AR components implemented in NeuralProphet

The autoregressive component implemented in NeuralProphet strongly relies on AR-net, but it is not a plain copy, like a plug-and-play. For most users such changes are not relevant, but to provide all the information and be aware of the tool used, we briefly describe those variations. First, concerning the deep AR version, the ReLU activation function and the bias after the hidden layers were removed. Also, the regularization function at the basis of the sparse AR was modified. Specifically, interested readers could refer to the AR-net and NeuralProphet papers. Finally, the most important development concerns the forecast horizon, i.e., how many steps ahead we can forecast. Standard AR-net can forecast 1 point, not necessarily the first step ahead. Thus, if we want to forecast tomorrow and the day after tomorrow, we should build two NN, one for each step. NeuralProphet authors managed to produce multiple forecasts with the same NN, which can forecast all the steps. Unluckily there are not details on how they took care of that. Generally, there are not published studies on how those differences affect the performance. Again, we just wanted to make you aware of the model you are using.

As stated by the authors it is important to keep in mind that for each forecast point there are more predictions, made at different origin. Look at fig 13 for an example. The first steps are used as lags, therefore there is no forecast associated. The first forecast available is the 2020–09–06 and was made the step before, located in the columns yhat1. The next step (2020–09–13) has two predictions, yhat1 made the step before and yhat2, made two steps before. The more steps ahead the more forecasts there are, up to the number of steps to forecast.

Figure 12 — Example of forecasting results — Image made by the authors

Lagged regressors

The Lagged Regressor component takes advantage of the above-explained neural network to forecast regressors which future values are unknown. Then, they can be modelled as if they were future regressor. Each regressor, also known as covariate, is modelled by a specific AR-net, with inputs the p last observations of the regressor.

Following the easy-to-use policy of the authors, the ability to effortlessly add regressors with unknown future values without building a dedicated model is great. Nonetheless the control you have over those regressors forecasts is limited, both in terms of parameters of the NN and models category, since you can only mimic an AR process, which is not guaranteed to be the best.

An example of the new flexibility: Customizing the loss function

We highlighted the improvements of NeuralProphet compared to Prophet and its main components. Here we would like to give a real-world example of the potential of NeuralProphet in terms of flexibility. Indeed, without the need to twist the model, we can simply plug in a custom loss function, as long as it is written in PyTorch.

In a time series based use case we developed, due to specific business needs, overshooting was preferred over undershooting, we thus experimented with a customized loss function. As you can see in the formulas below, we start by defining a relative error measure E. Then, in those cases where E < -0.05, which means that we are underestimating the actual value by more than 5%, the loss is the MSE multiplied by a constant k. Otherwise the loss is simply the MSE. In this way the loss function stays differentiable, while penalizing undershooting. On the right side of figure 14, the python implementation of this function can be found. C stands for condition, since it is the error when the target is not 0, otherwise is the target. Then, l is MSE multiplied if c is below -0.05, otherwise it represents just the MSE.

Figure 13 — Customized loss function and its code implementation

In Figure 14 there is an example of the experiments we carried over. Specifically, there are three different forecasts made respectively with these loss functions: classic MSE loss, penalized MSE with k = 2, and penalized MSE with k = 3. Increasing k slightly shifts up the forecast curve, as expected.

NeuralProphet was developed with the aim of being a simple tool, usable also by non-technical people, without the need to finely set up the model. However, we showed how this easiness did not limit the flexibility and potential of the model. Here we have just defined a simple custom loss function, but advanced users could also go about the neural network itself and adapt it for their needs.

Conclusion

To sum up, NeuralProphet is an exciting and rapidly evolving tool that combines the best of classical statistical models and deep learning when it comes to time-series forecasting. Its unique blend of classical and modern methods, as well as its straightforward and easy-to-use interface, make it an ideal choice for anyone looking to build accurate and interpretable forecasting models.

After this theoretical overview, if you want to deep dive into how to tune hyperparameters, check out the second part of this article.

This article has been written with the precious collaboration of Riccardo Tambone

References

[1] O. Triebe, H. Hewamalage, P. Pilyugina, N. Laptev, C. Bergmeir and R. Rajagopal, “NeuralProphet: Explainable Forecasting at Scale,” arXiv, 2021.

[2] “NeuralProphet Documentation,” [Online]. Available: https://neuralprophet.com/.

[3] O. Triebe, N. Laptev and R. Rajagopal, “AR-Net: A simple Auto-Regressive Neural Network for time-series,” arXiv, 2019.

[4] P. Montero-Manso and R. J. Hyndman, “Principles and algorithms for forecasting groups of time series: Locality and globality,” International Journal of Forecasting, vol. 37, no. 4, pp. 1632–1653, 2021.

[5] H. Hewamalage, C. Bergmeir and K. Bandara, “Global models for time series forecasting: A Simulation study,” Pattern Recognition, vol. 124, 2022.