Explain me like I’m 5: how Facebook Prophet works and how to tune it?

Paolo Finardi
Eni digiTALKS
Published in
9 min readJul 5, 2022

A better understanding of this powerful tool will give you more control on your time series forecasting projects.

Figure 1 Source https://stock.adobe.com

The basic idea of Prophet is to generate a continuous function that fits our historical data. The key feature of this approach is that this function depends only on time, thus it can describe our timeseries at any moment.

This approach is a stone that can catch two birds:

1. Prophet can extend the forecast unlimitedly in the future.

2. Since the function works in the past as well, it can fill the missing data in the historical series.

Or, put another way, you can predict the future even if you miss some data! …ARIMA users like this post.

This beautiful magic is possible thanks to the GAM approach also known as Generalized Additive Model. Basically, it is a decomposable time series model or, more simply, a sum of functions.
Each function represents a different component of our time series:

Formula [1] Prophet time series model with three components: trend, seasonality, and holidays

This decomposition should be very clear to anyone who used Prophet at least once and it’s the main reason for Prophet popularity:

  • g(t): The trend function
  • s(t): The seasonality function
  • 𝜀: The error term

What, I’m sure, is not that clear is the math behind every component and how prophet deals with the ancient data science quest:

Fit the model but not too much, we don’t like overfit!

Reading through the next lines will clarify the meaning of all those model parameters that we usually (and sometimes randomly) set, hoping to find the best model fit.

The trend Function

Formula [2] Trend function — linear growth

Researchers had proven that human beings try to understand a formula as far as it contains less characters than their name, otherwise we tend to skip it and ignore its meaning at all.

Figure 2 A better use for mathematical symbols from https://xkcd.com/

Yes, that’s a fake news, but anyway let’s go by steps. We can start assuming that a(t), the transposed vector in formula[2], is equal to zero.

Now everything is brighter, the formula is reduced to g(t)=kt+m.
It is the first formula we learn in geometry class! That’s a straight line where m is the offset from 0 and k in the angular coefficient.

Figure 3 Linear trend with constant growth in time

It is not uncommon to have a trend with this behavior and this is fine if the time series has a constant growth. Yet, the real world is more complicated and the trend could change over time, it can go up and down, which is why we need the rest of the formula. We should consider a(t) different from zero and deal with changepoints.

Changepoints represent the moment in which the trend component significantly changes. Every adjustment, of the trend is described by two variables:

  1. Sⱼ the moment when occurs the jᵗʰ changepoint

2. δⱼ the amount of change of the angular coefficient at the jᵗʰ changepoint

S and δ are the vectors that contain all the change point elements.

To sum up, our simple trend will change its initial angular coefficient k at time t=S₁ of an adjustment δ₁.

How can we add this simple idea in the straight-line formula? We need the vector a(t) that sums, at any time t, all the adjustment δ up to that point in time, or, to say it with a formula:

Formula [3] Linear trend angular coefficient at time

Math works if we define the a(t) {0,1}ˢ vector such that:

Formula [4] Definition of a(t) vertor

This definition of a(t) allows to sum from zero to the jᵗʰ element and we can rewrite the formula[3] like k + a(t)ᵀδ.

The last guy in the formula[2] is 𝛾 and it is the simplest. It’s the adjustment of the offset to keep the function continuous.
It’s equal to -δⱼ*Sⱼ, respectively the angular coefficient adjustment and the changepoint j.

Everything easy so far, now we just need to know where and how our trend changes.

When we don’t know something, usually we try to guess it. Facebook Prophet does the same. It sets many potential change points in our historical time series, evenly distributed.

How many? You decide! The n_changepoint parameter defines the number of elements of the vector S.

Figure 4 Linear trend with many changepoints

Setting the S vector is trivial, unfortunately to initialize the δ vector we need a little bit more of math.

Again, we start with an initial guess, and we pick a random value from a Laplace distribution for any δⱼ: δⱼ~Laplace(0,𝜏).

Do you remember the form of the Laplace(0,𝜏) distribution? You should because that 𝜏 in the formula is one of the parameters you should change to tune your model, it is called changepoint_prior_scale.

Figure 5 Laplace(μ,b) probability density function

A small 𝜏 increases the probability that the adjustment δⱼ is null or near to zero, meaning that the trend at time Sⱼ is unchanged.

This is the reason why we called the change point potential: when you have an adjustment equal to zero there is no change in the angular coefficient of the trend.

In contrast, if 𝜏 is large, the trend will have many visible changepoint, like the one in figure 4.

TLDR: your trend component seems to be too much rigid? Try to increase the changepoint_prior_scale!

The generative model for the trend

The recipe is ready, everything is initialized, and we “just” need to fit the model. Let’s suppose we already fitted our trend component and we know the best parameters describing past data. How can Prophet use such knowledge in a generative mode to forecast the future behavior?

The (strong) assumption here is that the trend in the future is going to change with the same rate seen in the past. For example, in a time series of one year long, the algorithm found four change points then, the forecast of the next six month will include two changepoints. Pretty easy, isn’t it?

At this point, the algorithm must set the size of the future adjustment and it does it literally guessing a random number. Of course, this randomness is driven by a probability distribution with a shape that we already know, the Laplace(0, λ) distribution.

Note the parameter of the Laplace distribution is not τ but λ.

  • With τ we describe the distribution of the prior δ, the initial guess.

but then the fitting process modifies this distribution to adapt the change point to the actual historical data and..

  • With λ we describe the distribution of change point adjustment (δ) after the fitting.

λ is chosen to maximize the likelihood of the Laplace(0, λ) distribution with the fitted change point adjustment. This is really important because now we have a distribution to generate the change point in the future forecast.

Figure 6 Example of forecast with high uncertainty

Well, now we have a clear idea of the forecasting mechanics, can you figure out how this uncertainty shadow is drowned? The generative model is used to simulate hundreds of possible future trends and those simulated trends are used to compute the uncertainty intervals (the light-blue cone).

Ok then, if you red till here, you should understand why a large value of τ gives a huge flexibility to your model. A flexible function can describe the historical time series with precision and the training error drops. However, when projected forward this flexibility will produce wide uncertainty intervals.

The Seasonality Function

Prophet’s second component is seasonality. Its aim is to model those periodic cycles that occur in the timeseries. The period can vary from yearly down to daily span, they are all managed in the same way as they all are continuous periodic function.

That’s the key to the math behind the seasonality component: it’s a continuous periodic function.

Thanks to the Fourier theorem, we know that a periodic function f(x), which is reasonably continuous, may be expressed as the sum of a series of sine or cosine terms.

If for you this is hard to believe, check this beautiful video of a guy who manage to draw Fourier’s portrait as a sum of sine and cosine terms!

Formula [5] Seasonality component function

A sum of sine and cosine, that’s all we need.

We’ll take N pairs of sine and cosine, each one with a different period

and with the proper amplitude; [a₁, …, aₙ] for sine and [b₁, …, bₙ] for cosine.

As a prophet’s user, you must define the number of pairs, also known as Fourier order. For example, to model yearly seasonality in Prophet, you can change the Fourier order with the yearly_seasonality parameter.

Different story for the amplitude coefficient of each sine and cosine. To set those coefficients we don’t have a parameter but, as we have already seen for changepoint coefficients, the user should define the initial distribution as a starting point for model fitting.

We will not deal with the fitting math, but we can use a clear and useful intuition: the goal of the fitting algorithm is to find a local minimum value for the cost function.

If we are looking for a smooth seasonal function, we need small amplitude coefficients. We initialize the prior scale to a small value and the fitting algorithm will look for a solution in their neighborhoods.

Figure 7 Seasonality function with overfit

Once again here we face the overfit problem. In fact, large amplitude coefficient can result in a noisy seasonality function [figure 7]. In the example, the typical winter-summer cycle is mixed with other components, probably an intra-month seasonality, which results in a noisy function.

An important take away is that if your seasonal component is too noisy you should consider changing the seasonality_prior_scale parameter. Just remember that the prior distribution is a Normal(0,σ²) distribution and with the seasonality_prior_scale parameter you define the variance σ², this means that high prior scale led to big amplitude coefficient and noisy seasonality.

Figure 8 Normal distribution. Source: https://en.wikipedia.org/wiki/Normal_distribution

Lowering the prior scale is not the only way to smooth the curve. Instead of leveraging your knowledge on prior scale distribution to push coefficient near to zero, you can simply consider limiting the Fourier order to just two or three components. Here a visual example to help your parameter guessing.

Figure 9 Two way to reduce overfit in seasonality function

Which one is the best? Answering this question is not the purpose of this article, but now you know that adding one Fourier order means adding 2 more parameters to the fitting algorithm.

Conclusion

We have seen here just the tip of the iceberg, starting from the simplest part, but you know: great acts are made of small deeds.

This version of Prophet is a good tool to prototype a forecast model. It is simple, quick and explainable, a great starting point for your forecasting project. For further details you can check the paper at this link.

I hope this reading made the Prophet fundamentals a little bit clearer and I’ll be very happy if, somehow, it’ll save any minutes from your tuning phase.

--

--