2,000+ Words on Time Series Forecasting

10 min readJan 9, 2018

With all of the literature out in the world about time series forecasting, it is easy to feel overwhelmed. The blog post that follows is intended to be a synthesis of the oodles of articles that I have read (and am still reading) on the subject, and my hope is that it is helpful to fellow time series debutant(e)s.

STL Decomposition

STL (seasonal and trend decomposition using LOESS) is a very popular method used in time series forecasting that effectively decomposes a time series into three components — seasonal (systematic/periodic structure), trend (long-term behavior), and residual (the leftovers).

Advantages:

An STL decomposition is quick/flexible for anomaly detection, as one can often identify additive outliers directly from the residual terms (provided that STL is indeed good for your time series, of course).
One can examine the three (seasonal, trend, residual) components separately, which will often shed a great deal of insight into the nature of your time series. That said, an STL decomposition can be useful to run even during exploratory data analysis.

Disadvantages:

This approach is rather inflexible “out-of-the-box”. If you have multiple trends/seasonalities, for instance, the lesser ones will still be present after running an STL decomposition. Thus, one would need to be careful about using the residuals to identify anomalies if there are, in fact, other significant seasonalities/trends in the time series.
If you are interested in forecasting, you will likely have to rely on one of the other methods (ARIMA, for instance) to forecast the components after running an STL decomposition.

Steps:

Decompose time series into seasonal, trend, and residual components.
If interested in forecasting, use any method of choice to forecast each component and recombine them.
If interested in anomaly detection, use any method of choice (generalized ESD test) to identify outliers from residual components.

Ideal For:

The obvious ones — time series containing single strong trend/seasonal components without significant structural changes.

ARIMA

ARIMA (autoregressive integrated moving average) models are dependent on lagged observations (AR) and error terms (MA) after differencing (I). Standard notation for an ARIMA model is ARIMA(p,d,q), where p is the lag order, d is the degree of differencing, and q is the order of the moving average. (If extended to include seasonality, you’ll see an additional (P,D,Q)s term, where (P,D,Q) mean the same thing for the seasonal component and s is the number of periods associated with the seasonal behavior.)

Advantages:

The model is (relatively) simple to interpret and is only dependent on the observed time series.
Additionally, selecting the model parameters (p,d,q) (also (P,D,Q)s if incorporating seasonality) is relatively simple to do with the ACF/PACF/various tests of stationarity. (R users need look no further than auto.arima() function.)

Disadvantages:

The time series must be stationary after (d orders of) differencing. If differencing does not induce stationarity, one must reach into his/her bag of tools to make it stationary prior to using ARIMA (in which case, you’ll just use an ARMA model).
There is no underlying model that attaches physical significance to the moving parts in an ARIMA model. If one is comfortable with using the past to inform values of the future without any sort of interpretation, this is a nonissue! If you want to glean something more from your analysis, however, you may be better served by an alternative method.
ARIMA models are rather inflexible in the sense that one needs to fit a new model each and every time one introduces new data.

Steps (if not using automated procedure):

Make time series stationary (via differencing, log transforms, etc.). This will provide the user with the parameter d.
Look to the ACF/PACF to determine p and q (use the ACF for MA(q), PACF for AR(p); if you have a model with nonzero p and q, it often makes sense to do a bit of experimentation after examining the ACF/PACF).
If you suspect seasonality may be present, difference your time series by a lag equal to the expected period. For example, if you have daily data with expected weekly seasonality, you would calculate the seventh differences. From there, examine the ACF/PACF. Seasonal behavior will be evident at multiples of the expected period, whereas non-seasonal behavior will crop up at early lags (if any).
If interested in anomaly detection, forecast future values (ideally, with prediction intervals) and identify outliers with any method of choice. Note that these intervals will naturally increase the further we look out into the future, potentially limiting the realm of applicability for anomaly detection.

Ideal For:

Broadly applicable.

Exponential Smoothing

The term “exponential smoothing” refers to a class of models that use past observations and exponential weights to smooth a time series. There is an entire taxonomy of exponential smoothing models (here is an excellent reference) that enables a user to inject as much or as little sophistication into his/her model as necessary. It is also worth pointing out that there are equivalences/overlaps between ARIMA and exponential smoothing that make the advantages/disadvantages of each fairly similar.

Advantages:

The basic idea behind exponential smoothing is perhaps more intuitive than ARIMA — we assign more weight to past values that occur closer to the present, making our predictions less sensitive to older observations.
The number of smoothing parameters is small and can be optimized efficiently, allowing for quick, yet trustworthy, forecasts.

Disadvantages:

When forecasting with exponential smoothing, one is only able to obtain point forecasts. That said, the inability to generate a prediction interval (or some semblance of one) would make this approach less viable (if usable at all) for anomaly detection. (However, ETS models, which fold in error terms, would be appropriate here.)
The taxonomy of exponential smoothing models requires the user to think a bit about what model is appropriate for his/her time series before diving right into forecasting — do we need to account for trends? Seasonality? Are the two additive or multiplicative components? Although fast/flexible, this could be a bit of an extra wrinkle that may cause users to seek out an alternative, “hands-off” method.

Steps:

Explore time series to determine appropriate exponential smoothing model (seasonality/trend/additive/multiplicative?).
Optimize smoothing parameters.
Generate forecasts (the way in which you do so depends on the specific model).
If interested in anomaly detection, ETS models would be more appropriate here, as you can generate prediction intervals that provide a systematic way of identifying anomalies.

Ideal For:

Broadly applicable.

Bayesian Structural Time Series (BSTS) Models

Bayesian structural time series (BSTS) models are just as you might imagine from their name— a happy marriage of Bayesian statistics with structural time series models. In essence, structural time series models are characterized by two overarching equations: (1) the observation equation, which relates the observed data to a set of latent state variables, and (2) the transition equation, which encode the time dependence of the state variables. Through the use of priors on all model parameters and a conditional distribution on the parameters that govern the initial state, we put the “B” in BSTS.

Advantages:

Within a BSTS framework, one is able to turn all of the knobs and levers, if you will. By literally write down your model explicitly, you can insert any/all of the moving parts that you would like under the hood (suitable priors, seasonal/trend terms, etc.). This offers an added layer of interpretability often missing from more traditional approaches.
Tying into the last point, two particular strengths unique to BSTS are (1) the control that one has over uncertainty and (2) the ability to incorporate feature selection via spike-slab priors. For anomaly detection, (1) is particularly relevant, as what is classified as an anomaly is often dependent upon how far predicted values stray from the observed values.
All of the components of the underlying model are modeled simultaneously, and the user has the ability to explore these components independent of each other by appropriately marginalizing over the posterior distribution.

Disadvantages:

Arguably, the only downside to this approach is the bit of extra thought/effort that crafting an appropriate model takes. Other popular approaches can be “blackboxed” and used to generate predictions fairly quickly; this, however, requires one to (1) specify an appropriate model, (2) sample the posterior distribution, and (3) use the posterior samples to get what you need to tell a story.

Use Case:

Write down your statistical model (likely motivated by something previously discussed, such as one from the taxonomy of exponential smoothing models) with suitable priors.
Use MCMC to sample posterior distribution.
If using BSTS for forecasting, make forecast via the posterior predictive distribution.
If using BSTS for anomaly detection, use either the mean/median of or quantiles from the posterior predictive distribution to get “predicted” values/intervals. If using intervals, one might flag observed values that fall outside the prediction intervals as anomalous.

Ideal For:

Broadly applicable, but probably the most practical when forecasting/future comes with actual risk and/or user wants more control over uncertainty.

Tree-Based Methods (CART, RF)

Tree-based methods (classification and regression trees, random forests, etc.) are also relevant for time series data. The literature on these methods is extensive, but the basic idea is that we can train a tree/trees using features constructed from our time series (lagged terms, Fourier terms from seasonal components, other exogenous predictors, etc.) to predict future values.

Advantages:

The user has the ability to introduce additional features outside of the observed time series values.
Trees are relatively simple to construct/train (vanilla scikit-learn and/or similar R packages will do the trick) with small number of parameters that can be tuned without much trouble. These packages often handle missing data internally as well (other approaches may barf if the data is not evenly spaced and/or require imputation).
They are also interpretable. One can examine the splits within trees and/or check out feature importance to provide context for predictions.

Disadvantages:

One of the biggest criticisms of tree-based methods is that they cannot predict values that fall outside the range of values contained in the training set. For volatile time series, this would result in poor performance if one is solely interested in forecasting. For anomaly detection, however, this could, in theory, be advantageous, as it would be impossible to get close to anomalous values (provided that you haven’t included anomalous values during training, of course) while performing reasonably well on the rest of the dataset.

Steps:

Construct set of features from training set. As mentioned earlier, this could include lagged values of the response variable, features associated with seasonal/trend components, or other exogenous features.
Use trained tree to make predictions for test set. If one is interested in forecasting using lagged values, one could either use predictions or observed values of the target variable as lagged values if we are looking to predict beyond the first time point after our training set.

Ideal For:

Stable, bounded series (as one would not need to worry about any potentially funky behavior near the boundaries of the training set).

Neural Networks (RNNs, LSTMs)

(Author’s Note: I am currently trying to learn deep learning deeply (pun intended?), and that said, this bit is a continual work in progress.)

RNNs (recurrent neural networks)/LSTMs (long short-term memory networks) are neural networks that contain connections between nodes that allow them to “remember” things about values that have been fed into them previously, rendering them relevant for any task in which we want to make use of any underlying temporal structure. The difference between RNNs and LSTMs is that the latter is better suited for learning long-term dependencies.

Advantages:

With RNNs/LSTMs, we are (potentially) able to capture all of the complex structure (trends, seasonality, deviations from stationarity, nonlinear behavior, etc.) that we work hard to find/remove using more traditional approaches without having to do it ourselves.

Disadvantages:

One will likely sacrifice model interpretability for accuracy, and in many situations, the former will outrank the latter.
To really have an edge over more traditional methods, one would need a fairly long (i.e., well sampled) time series with a good amount of complex structure. Otherwise, you may only see marginal improvement.
Neural networks are particularly prone to overfitting and require the hand of an experienced user. This can be mitigated, of course, if one exercises caution while training the network (proper regularization, decent amount of data, etc.), but other methods don’t carry quite as many traps. In the context of anomaly detection, this is particularly relevant, as one would not want to have a model that identifies anomalous behavior as “normal”.

Steps:

Prior to feeding your data into an LSTM, you’ll want to normalize/standardize your input features — the choice of how you do this is yours, but it is necessary to remove different scales to optimize the gradient calculations during backpropagation.
After splitting your time series up into training/test sets, you’ll also want to further divide it into samples to feed into your LSTM. Let’s say you have hourly data for a year’s worth of some time series. Rather than stick it into your LSTM in one chunk (you wouldn’t be making use of the “M” component of your LSTM), you’ll need to slice it into smaller pieces. For instance, you might try doing this in one day pieces (features are twenty-three hours worth of data, predict the value of the target variable at the twenty-fourth hour). I’m sure that there are some rules of thumb that folks use to determine the optimal size for samples, but it seems reasonable to arbitrarily choose one that makes sense or do some trial-and-error.
When it comes to properly defining a neural network architecture (number of hidden layers/nodes, regularization parameters, etc.), I admit that I still have a lot to learn. I have used Keras/TensorFlow in my travels, and when I have used LSTMs, I have just tinkered around with the architecture until I got something reasonable.
Forecast uncertainty with LSTMs also looks to still be an active area of research (check out this sweet article from Uber). This added wrinkle may be problematic when it comes to using LSTMs for anomaly detection, but it may not be as much of an issue if one is purely interested in forecasting.

Ideal For:

Long time series with a fair amount complex structure (perhaps if/when all of the other approaches have proved to be futile, too).

Conclusion

While these are some of the most trendy statistical methods used for working with time series data, it is important to remember to think carefully about the problem that you are trying to solve before choosing which one (if any) to use. This list is also not meant to be exhaustive — this area is an active field of research, so keep an eye out for the latest and greatest approaches to these types of problems!

2,000+ Words on Time Series Forecasting

STL Decomposition

Advantages:

Disadvantages:

Steps:

Ideal For:

ARIMA

Advantages:

Disadvantages:

Ideal For:

Exponential Smoothing

Advantages:

Disadvantages:

Steps:

Ideal For:

Bayesian Structural Time Series (BSTS) Models

Advantages:

Disadvantages:

Use Case:

Ideal For:

Tree-Based Methods (CART, RF)

Advantages:

Disadvantages:

Steps:

Ideal For:

Neural Networks (RNNs, LSTMs)

Advantages:

Disadvantages:

Steps:

Ideal For:

Conclusion

Written by Charlie Bonfield