A Brief History of Time Series Models

13 min readOct 15, 2022

[updated on December 11, 2023]

TL;DR: For folks who are interested in learning more about time series models, below is an incomplete roadmap that attempts to summarize the development of this complex, fast evolving field.

M Competition is the equivalence of ImageNet to computer vision for time series model and deep learning beat traditional statistical models for the first time in M4 that took place in 2018 despite all the advancement in computer visions and NLP prior to that. Two latest deep learning models from Google, i.e., TSMixer and TiDE, appear to show promising results. However, the best model seems to depend on the dataset. So, we need to try multiple models. If your time series is too short for deep learning model, you may want to try cross learning by using the M4 data as shown in ESRNN. As always, if the simpler approach such as Exponential Smoothing and ARIMA(X) works, it is unnecessary to go nuclear with deep learning. For instance, the Theta model won the M3 competition. Finally, if you can only afford one Python package for time series model, you won’t regret by going with Darts.

Competitions

M Competition is the equivalence of ImageNet to computer vision for time series model. Traditional statistical models had always dominated the competitions until M4 where a hybrid approach of Exponential Smoothing and RNN proposed by Uber called ESRNN won the competition. There are typically several publications after the M competitions that summarize the findings and we can learn a lot from them. M6 is the latest competition focusing on financial time series. Below is a list of relevant competitions (for M competition, the year refers to the year of publication). Obviously, there are also multiple relevant competitions on Kaggle but they are not included here.

1982 — M1
1993 — M2
2000 — M3: Theta model was the winner
2006 — NN3
2008 — NN5
2011 — Tourism forecasting competition
2020 — M4: ESRNN was the winner (ended in May 2018)
2021 — M5: LightGBM was the winner on Walmart hierarchical time series sales data (ended in June 2020)
2023 — M6: Competition on financial time series to study Efficient Market Hypothesis (EMH)
202X — M7: Coming soon

M6 competition consisted of three categories which were won using different techniques:

Best Overall — a meta-learning approach that is based on an encoder-decoder hypernetwork with rank optimization strategy
Best Forecasting — Bayesian model using PyMC
Best Investment Decision — AutoML using AutoTS

Statistical and ML Models

Below is a chronology of both statistical and ML time series models. It is not a comprehensive list but should capture the key development. Also, the timeline is not exact as the year when something was “formally” released or published can sometimes be ambiguous.

1950s — Exponential Smoothing
1970s — ARIMA(X)
1980 — VAR
1980s — GARCH
2000 — Theta model
2011 — TBATS
2013 — Bayesian Structural Time Series — BSTS (Google)
2014 — XGBoost (University of Washington)
2016 — LightGBM (Microsoft)
2017 — CatBoost (Yandex)
2017 — Prophet (Facebook)
2017 — DeepAR (Amazon)
2017 — Multi-Horizon Quantile Recurrent Forecaster — MQRNN (Amazon)
2018 — Deep State Space Model — DSSM (Amazon)
2018 — Multivariate Bayesian Structural Time Series — MBSTS (UC Santa Barbara)
2018 — Temporal Convolution Network — TCN (CMU)
2018 — Temporal Graph Convolutional Network — T-GCN (Central South University, Changsha, China)
2018 — ESRNN (Uber)
2019 — N-BEATS (ElementAI)
2019 — AR-Net (Facebook)
2020 — NeuralProphet (Facebook)
2020 — Temporal Fusion Transformer — TFT (Google)
2021 — ThymeBoost
2021 — Greykite/Silverkite (Linkedin)
2021 — Orbit (Uber)
2022 — N-HiTS (CMU/Unity Technologies/Nixtla)
2023 — PatchTST (Princeton/IBM)
2023 — TimesNet (Tsinghua University, Beijing, China)
2023 — TSMixer (Google)
2023 — Time-series Dense Encoder — TiDE (Google)
2023 — TimeGPT (Nixtla)
2023 — Lag-Llama (Morgan Stanley/ServiceNow/University of Montreal/McGill University/Mila)
2024 — TimesFM (Google)
2024 — MOIRAI (Salesforce)
2024 — Chronos (Amazon)
2024 — MOMENT (CMU and UPenn)

From Exponential Smoothing in the 1950s and ARIMA(X) in the 1970s to TBATS in 2011 and BSTS in 2013, traditional statistical models dominated the time series field especially for univariate forecasting until ESRNN won M4 in 2018. Although NLP essentially works with sequences, a lot of the advancement in NLP cannot be translated to time series effectively because regular time series data lacks the deep structure prevalent with text data. So, in order to make deep learning works for time series, specific neural network architecture is required. N-BEATS is one such example and it outperforms ESRNN for M4 data while N-HiTS subsequently become the state-of-the-art.

Deep learning models for time series generally fall into one of the following categories:

Extending classical model typically with RNN, e.g., DeepAR, ESRNN, AR-Net
Architecture based on multi-layer perceptron (MLP) typically including a residual connection, e.g., N-BEATS, N-HiTS, TSMixer, TiDE; I found this group of models to be the most useful in practice
Transformer-based architecture includes many proposed models such as Autoformer, Informer and FEDformer but recent work (Zeng et al., 2022) has shown that they can be easily outperformed by simple linear model; PatchTST is the latest transformer model introduced to address the shortcoming of previous transformer-based forecasting approaches; Temporal Fusion Transformer (TFT) is useful for multivariate forecasting using covariates and auxiliary features which PatchTST cannot do
CNN-based architecture, e.g., TCN and TimesNet but I have not found them to be useful in the past
Amid the AI frenzy generated by ChatGPT, Nixtla introduced TimeGPT (which is currently still in closed beta); it is a foundation model pre-trained on 100 billion data points from a broad array of domains; it supports zero-shot inference and fine-tuning; Lag-Llama, TimesFM, MOIRAI, Chronos and MOMENT are other examples of foundation model.

ESRNN introduced cross learning that is critical for common business application. Deep learning typically requires a lot of data. While an electricity demand forecasting problem with minute by minute weather data for a large region over the last ten years can easily satisfy the most complex deep neural network, most business applications deal with monthly or quarterly data say for the last ten years (if you are lucky). Instead of building one model for each time series like a traditional statistical model would do, ESRNN feed all the data into one complex model to forecast multiple time series. For example, in M4, there are 24,000 quarterly time series that originated from different domains covering different time period. This is particularly relevant when you have hierarchical data such as different products in the same store or same product across many stores.

N-BEATS introduced another important approach of ensembling models with different input horizons (2x to 7x of forecast horizon), metrics and random initializations for bagging. They found that it is a more helpful regularization technique compared to using dropout or L2 norm penalty.

Gradient boosting algorithms such as XGBoost and LightGBM are not exactly time series models but one can often reduce a time series problem into something more cross-sectional-like. For example, LightGBM won the M5 competition for hierarchical time series. ThymeBoost is a gradient boosting model designed specifically for time series but it is still relatively new. Currently, it does not consistently outperform ESRNN for M4 data.

If you are working with time series that has a heavy seasonal component with strong holiday effect (think Facebook and Linkedin), you may find Prophet and Greykite helpful but they are not based on deep learning. Facebook subsequently adapted AR-Net into Prophet and created NeuralProphet which is essentially a deep learning extension of ARIMA just like how ESRNN is a deep learning extension of Exponential Smoothing.

Empirical Results

Input length L = 512 for TSMixer, TFT and PatchTST but 720 for TiDE and 336 for DLinear; prediction length of T = 96; MSE metrics reported in “TSMixer: An All-MLP Architecture for Time Series Forecasting”, Chen et al. (2023) and “Long-term Forecasting with TiDE: Time-series Dense Encoder”, Das et al. (2023); for information on the datasets, please refer to (Wu et al., 2021); best metrics are in bold

Table above summarizes the performance comparison for a few important models using multivariate time series without covariates and auxiliary features. The latest MLP-based models, i.e., TSMixer and TiDE, tend to produce the best forecast. DLinear (Zeng et al., 2022) is a simple linear model that serves as an important baseline.

M5 competition metric (WRMSSE) from Chen et al., 2023 and Das et al., 2023. The results for ES_bu and the M5 winner’s LightGBM model are reported in “The M5 Accuracy competition: Results, findings and conclusions”, Makridakis et al. (2020); prediction length T = 28

Table above shows the results using M5 data that contain covariates and auxiliary features. TiDE is the best among all the deep learning models but did not outperform the winner of M5 who used LightGBM. ES_bu (Exponential Smoothing with bottom-up reconciliation) is the top performing benchmark in M5.

The lesson learned here is that we need to try multiple models because the best model seems to depend on the dataset as concluded by this paper titled “Unified Long-term Time-series Forecasting Benchmark” (still under review as a conference paper at ICLR 2024). To efficiently try multiple models, you will need a good Python package to make your life easier.

Packages

Below are some Python packages that are useful for time series forecasting and they implement some of the algorithms mentioned above. Darts is particularly useful for trying multiple advanced algorithms along with helpful functions such as rolling cross validation etc. which are also provided by sktime. AutoML is a relatively new entrant for time series models with AutoTS, PyCaret and some automated feature engineering libraries. Note that one interesting feature engineering method is to apply Fourier transformation on calendar dates (more discussion here).

If you miss the old-school econometrics concepts such as Augmented Dickey Fuller test for stationarity, Breusch-Pagan test for heteroskedasticity and Ljung-Box test for autocorrelation, statsmodels will prove to be handy. “Forecasting: Principles and Practice” is a classic text book for old-school time series models.

statsmodels — Standard statistical models such as ARIMA(X), exponential smoothing, Theta, VAR and useful tools such as various statistical tests, acf/pacf plots, time series decomposition etc.
pmdarima — Automated ARIMA
sktime (2019) — Sklearn-like library for time series with AutoETS and AutoARIMA including utility for rolling cross validation and grid search
GluonTS (2019) — Similar to Darts but with less model options
PyTorch Forecasting (2020) — State-of-the-art time series forecasting with deep neural networks
Darts (2020) — A hybrid of sktime and PyTorch Forecasting
AutoTS (2020) — AutoML package that offers sklearn-style interface; won the decision category in M6
PyCaret (2021) — AutoML for time series recently integrated into the stable release
NeuralForecast (2022) — User friendly state-of-the-art neural forecasting models; tend to include the implementation for the model from the latest research
tsfresh and tsfeatures — Automated feature engineering for time series

Appendix

Here is a quick summary for a few selected models mentioned above. It is impossible to do justice to them with the short descriptions below. So, I would recommend that you do more research to further your understanding.

Exponential Smoothing

One naive forecast approach is to simply use the last observed value as the next prediction. One could also take the simple average of all observed data to make the forecast. Exponential smoothing is in between these two extreme approaches giving larger weights to more recent observations.

This simple concept was subsequently extended to have an error/level, trend and seasonal component with either additive or multiplicative formulation.

“Forecasting: Principles and Practice”, Hyndman et al. (2021)

ARIMA(X)

An autoregressive model uses the lagged values of the target as predictors in a regression. An autoregressive model of order p can be written as:

Rather than using the past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model.

These two approaches can be combined into one framework which was subsequently extended to capture seasonality and additional regressors. Note that the time series has to be stationary to satisfy the underlying assumptions of the model. Typically taking the first difference of the time series would render it stationary.

Theta Model

The original Theta method proposes the decomposition of the de-seasonalized data into two theta lines. The first theta line removes completely the curvature of the data, thus being a good estimator of the long term trend component. The second theta line doubles the curvature of the series, as to better approximate the short-term behavior. A generalization of the Theta method was subsequently proposed to optimize the selection of the second theta line.

TBATS

This model forecasts time series with complex multiple seasonal patterns using exponential smoothing. The acronym stands for:

T — Trigonometric seasonality
B — Box-Cox transformation
A — ARIMA errors
T — Trend
S — Seasonal components

Prophet

A modular regression model with interpretable parameters that can be intuitively adjusted by analysts with domain knowledge about the time series. The specification is similar to a generalized additive model (GAM) framing the forecasting problem as a curve fitting exercise with interpretable parameters and components. Also, additional regressors can be accommodated. Note that the default behavior is to rely on the first 80% of the data for the trend component when making future forecast. This may or may not make sense for your application.

“Forecasting at Scale”, Taylor et al. (2017)

ESRNN

ESRNN is a hybrid approach combining Exponential Smoothing with RNN. Each time series is first decomposed into its level, trend and seasonality components by the multiplicative Exponential Smoothing method. Then, the RNN focuses on learning non-linear trends on the de-seasonalized and normalized values. At a high level the model consists of dilated LSTM-based stacks with Resnet-style shortcut when there are two or more blocks in the LSTM stacks.

The model uses the quantile/pinball loss function which minimizes the quantile of the target variable. It also adds a penalty on the variance or wiggliness of the predictions. Smyl (2019) suggested that he would not have won the M4 competition without it. Intuitively, the level should be a smooth version of the time series, with no seasonality patterns. It turned out that the smoothness of level helped substantially the forecasting accuracy. It appears that when the input to the NN was smooth, the NN concentrated on predicting the trend, instead of overfitting on some spurious, seasonality-related patterns. A smooth level also means that the seasonality components properly absorbed the seasonality.

“A hybrid method of Exponential Smoothing and Recurrent Neural Networks for time series forecasting”, Smyl (2019)

N-BEATS

This is the first work to empirically demonstrate that pure DL using no time-series specific components outperforms well-established statistical approaches on M3, M4 and TOURISM datasets. It focuses “on solving the univariate time series point forecasting problem using deep learning”. Subsequently, the Darts package adapts the original N-BEATS architecture to multivariate time series by flattening the source data to a 1-dimensional series. So, you can include additional regressors as features.

Note that N-BEATS is not based on a recurrent architecture such as LSTM. N-BEATS uses a simple but powerful architecture of ensembled feed-forward networks with a novel hierarchical doubly residual topology of forecasts and ‘backcasts’. Previous block removes the portion of the signal that it can approximate well, making the forecast job of the downstream blocks easier. The proposed architecture design generalizes well across time series of different nature while ESRNN had to use very different architectures hand crafted for different horizons. Finally, if interpretability is important for your application, this model offers an “interpretable” architecture consisting of two stacks: A trend stack and a seasonality stack.

“N-BEATS: Neural basis expansion analysis for interpretable time series forecasting” , Oreshkin et al. (2020)

Temporal Fusion Transformer (TFT)

TFT is an attention-based DNN designed to explicitly align the model with the general multi-horizon forecasting task, i.e. predicting variables-of-interest at multiple future time steps, for both accuracy and interpretability. TFT supports 3 types of features: i) temporal data with known inputs into the future ii) temporal data known only up to the present and iii) exogenous categorical/static variables. See more discussion here.

The architecture integrates the mechanisms of several other neural architectures:

A temporal multi-head attention block that identifies the long-range patterns
LSTM sequence-to-sequence encoders/decoders to summarize shorter patterns
Gated residual network blocks, GRNs, that are to weed out the unimportant, unused inputs

“Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting”, Lim et al. (2020)

DLinear

DLinear is a combination of a Decomposition scheme used in Autoformer and FEDformer with linear layers that directly regresses historical time series for future prediction via a weighted sum operation as shown below. It first decomposes a raw data input into a trend component by a moving average kernel and a remainder (seasonal) component. Then, two one-layer linear layers are applied to each component, and we sum up the two features to get the final prediction. Note that DLinear shares weights across different variates and does not model any spatial correlations. This simple linear model serves as an important baseline for more complex models.

“Are Transformers Effective for Time Series Forecasting?”, Zeng et al. (2022)

TSMixer

Extending on simple linear model, TSMixer contains interleaving time-mixing and feature-mixing MLPs to aggregate information. The number of mixer layer is denoted as N. The time-mixing MLPs are shared across all features and the feature-mixing MLPs are shared across all of the time steps. The design allow TSMixer to automatically adapt the use of both temporal and cross-variate information with limited number of parameters. It also includes residual connection and normalization layer. The author further extended it to capture auxiliary features such as static and future time-varying features using an align stage to concatenate all the features.

TiDE

Motivated by simple linear model, TiDE encodes the past of a time-series along with covariates using dense MLPs and then decodes the encoded time-series along with future covariates. The encoding section has a novel feature projection step followed by a dense MLP encoder. The feature projection step uses a residual block to perform dimensionality reduction. The decoder section consists of a dense decoder followed by a novel temporal decoder, which is a residual block that concatenate the decoded vector with future covariates. It serves as a “highway” for future covariates that have strong influence on the time series such as holiday effect on sales. Finally, the global residual connection ensures the simple linear model is always a subclass of this model.

Disclaimer

This article represents my own opinion and does not necessarily represent the opinions of my current and former employers.