An Overview of Time Series Forecasting Models Part 1: Classical Time Series Forecasting Models

Shailey Dash
21 min readMar 7, 2020

--

“The best qualification of a prophet is to have a good memory”

— Marquis of Halifax

Time Series forecasting of electricity

Timeseries forecasting is an important business application of forecasting. Practically everything a business or an enterprise does, requires prediction of the future requirements so that these can be budgeted for or planned for including sales, inputs/intermediates and manpower. Techniques of time series are well developed with both the classical statistically based approaches such as ARIMA modelling and the newer deep learning approaches. This article is part of a sequence of articles on time series methods. This article overviews the important techniques available in time series forecasting and delineates key assumptions behind each technique and the scenarios where it can be applied. The focus of this article is on understanding the theory and intuition behind various time series forecasting, the underlying assumptions and caveats and relevant applications for each. This is not however a ‘how to’ guide. So, python code, discussion on libraries, etc to be used is kept for another day in the interests of making the discussion tractable.

There are a large number of techniques available for time series forecasting including the ever popular ARIMA model. With the advent of large volumes of digitized data, new methods of time series forecasting leveraging Neural Networks are becoming popular apart from tried and trusted methods such as exponential smoothing and ARIMA. In this scenario, for someone new to the area, which method to be use and the assumptions underlying that method can be very confusing

Importance of time series forecasting

Timeseries forecasting is an important business application of forecasting. Practically everything a business or an enterprise does, requires prediction of the future requirements so that these can be budgeted for or planned for including sales, inputs/intermediates and manpower. It is one of the most applied data science techniques in business and is used extensively in finance, supply chain management, production and inventory planning. Generating accurate and reliable forecasts is an important endeavour for many organisations as it can lead to significant savings and cost reductions. With the advent of sensor data and advanced data storage capabilities, time series with higher sampling rates (sub-hourly, hourly, daily) are becoming more common in many industries. Below are a few examples of how more availability of more advanced data can enable better forecasting:

· Time series forecasting can be used for demand management in the utilities industry (electricity and water usage). Such series may exhibit complex seasonality patterns such as multiple seasonal patterns, non-integer seasonality, calendar effects, etc. Forecasting that takes into account these patterns can have a significant impact on demand management both in the short run and over a longer horizon. Leading to more efficient resource management

· The demand variations in the transportation, tourist, and healthcare industries can also be largely influenced by multiple seasonal cycles which can be forecasted for capture in the planning process

We now move to the basics of time series analysis.

What is Time Series Analysis?

A time series is a sequence of data points taken at successive equally spaced points in time. Time series analysis has many different techniques, but originally started as a special case of regression analysis. It is different from regression analysis because the dependent variable is a function of past values of the dependent variable itself. Hence, time series can be described as a type of univariate analysis where the data set has only two dimensions: the variable itself and a time index.

Time series consist of four components:

(1) Seasonal variations: that repeat over a specific period such as a day, week, month, season, etc.,

(2) Trend: Trend is defined as long term increase or decrease in the data. It can be linear or non-linear

(3) Cyclical variations: A cyclic pattern exists when data exhibit rises and falls that are not of fixed period.

(4) Random variations: that do not fall under any of the above three classifications.

The complex nature of data in a time series, such as seasonality, trend, and level, may bring numerous challenges to produce accurate forecasts. Figure 1 below shows an example of how noisy time series data can be! For forecasting purposes, it is important to identify and work with parts of the series which are more systemically driven and hence can be forecasted.

Example of a noisey series (Source: https://en.wikipedia.org/wiki/Time_series)

Univariate vs Multivariate Forecasting

Traditionally most time series analysis are univariate in approach. Other variables are not incorporated because the features themselves may have predicted values which will propagate to the time series variable being predicted. Further, empirically, it has been found that often pure time series models perform better on predictions than those using features. Given this, most time series models are based on historical values of the dependent variable. A key point to be noted is that time series analysis is used only for forecasting purposes. It does not provide a descriptive or diagnostic aspect as is the case with other multiple regression techniques that incorporate other variables as well.

With advent of deep learning, time series forecasting has become more complex and many more variants are possible. A univariate time series forecasting problem will have only two variables: one is date-time, and the other is the field which we are forecasting. For example, if we want to predict average temperature, univariate forecasting will only consider past values of the temperature variable. Multivariate forecasting would include not just lagged values of the temperature variable but also other relevant variables such as humidity, air pressures, rainfall, etc.

Techniques of Forecasting

There are many techniques of time series forecasting, some of which are very simple others are more complex such as neural networks. This article overviews the most popular classical techniques. The objective is to provide you with sufficient knowledge to be able to choose when to use which technique.

We can broadly categorize the approaches to forecasting into four broad buckets. Within each there are many variants and hybrid varieties:

1. Simple Moving Average (SMA)

2. Exponential Smoothing (SES)

3. Autoregressive Integration Moving Average (ARIMA)

4. Neural Network (NN) based methods such as LSTM

We begin first by overviewing the standard or classical timeseries models which are primarily univariate and have a strong base in statistics. Time series forecasting with neural networks is a relatively new technique and requires elaboration. In this article we focus more on the classical methods of time series forecasting. The second articles in this series picks up on time series methods using deep learning.

There are also some multi variate approaches such as the Vector Autoregression (VAR) Model which are essentially a generalization of univariate ARIMA method to multiple variables. For simplicity we focus on univariate forecasting as that is the main base of classical forecasting methods. We discuss each of these models briefly, highlighting the theoretical underpinnings, assumptions, scenarios where best suited and any caveats that need to be highlighted when using.

  1. Simple Moving Average (SMA)

A simple Moving Average is the easiest method for forecasting. It is an average of a subset of periods in a time series. A moving average is defined as an average of fixed number of items in the time series which move through the series by dropping the bottom items of the previous averaged group and adding the next in each successive average.

Moving Averages are usually plotted as a line chart to give an idea of the overall trend in the series. They can be useful in confirming the direction of a trend or having a visual of its magnitude. The basic assumption behind averaging and smoothing models is that the time series is locally stationary with a slowly varying mean. Hence, we take a moving (local) average to estimate the current value of the mean and then use that as the forecast for the near future or for very short-term forecasting.

The forecast for the value of Y at time t+1 that is made at time t equals the simple average of the most recent m observations

Moving averages are lagging indicators, hence when there is an uptrend in the variable the moving average will underestimate it as it is an average of earlier lower prices as well. Similarly, for a downtrend. The figure below shows an example of a series which appears to exhibit random fluctuations around a slowly-varying mean. Thus, we say the average age of the data in the simple moving average is (m+1)/2 relative to the period for which the forecast is computed: this is the amount of time by which forecasts will tend to lag behind turning points in the data. The average in a SMA is said to be centered at period t-(m+1)/2, which means that the estimate of the mean or the forecast value tends to lag the true value by about (m+1)/2 periods. A 5-term simple moving average is shown by the blue line. The average age of the data in this forecast is 3 (=(5+1)/2), so that it tends to lag behind turning points by about three periods. (For example, a downturn seems to have occurred at period 21, but the forecasts do not turn around until several periods later. If m=1, the simple moving average (SMA) model is equivalent to the random walk model (without growth). If m is very large (comparable to the length of the estimation period), the SMA model is equivalent to the mean model.

Source: https://people.duke.edu/~rnau/411avg.htm

There is no theoretical rule for working out what is the exact interval for the MA. Typically, the higher the forecasting period (averaging period), the greater the smoothening effect of the moving average, but also the greater is the lag behind turning points. The forecasting interval, m, can be treated as a parameter of the SMA forecasting model and can be adjusted to obtain the best “fit” to the data or the smallest forecast errors on average. The best way to compute the forecasting period is to compute several forecasting periods and calculate the RMSE (Root Mean Square Error) and the period with the lowest RMSE selected.

When and why to use SMA?

1. The SMA is a simple method and easy to understand and is often preferred to methods that are more rigorously statistical

2. It gives a good visual of the trend and smoothens out short term fluctuations. It also reduces out the effects of extreme values

3. On the con side the method does not have a statistical methodology to determine the forecasting period

2. Exponential Smoothing (SES)

Simple exponential smoothing (SES) methods are the next step up from the SMA technique. This is another popular technique for smoothening out time series. The moving average is a simple average where all observations are applied an equal weight. The exponential average assigns decreasing weights over time. Intuitively, this makes sense as past data should be discounted in a more gradual fashion: or example, the most recent observation should get a little more weight than the 2nd most recent and so on. The simplest form of SES is given by the formula:

Lt = αYt + (1-α) Lt-1

Where α is the smoothing factor, and 0 < α < 1. The smoothed statistic Lt is a simple weighted average of the current observation, Yt, and the previous smoothed statistic, Lt-1. The parameter, α, controls the closeness of the interpolated value to the most recent observation. Values of α close to one have less of a smoothing effect and give greater weight to recent changes in the data, while values of α closer to zero have a greater smoothing effect and are less responsive to recent changes. This is shown in the figure below where we see that α =.2 is (red line) is much smoother than the α =.8 (blue line).

https://rstudio-pubs-static.

Is the SES model an improvement on SMA model? For a given average age (i.e., amount of lag), the simple exponential smoothing (SES) forecast is somewhat superior to the simple moving average (SMA) forecast because it places relatively more weight on the most recent observation, i.e., it is slightly more “responsive” to changes occurring in the recent past. For example, an SMA model with 9 terms and an SES model with α=0.2 both have an average age of 5 for the data in their forecasts, but the SES model puts more weight on the last 3 values than does the SMA model and at the same time it doesn’t entirely “forget” about values more than 9 periods old, as shown in the figure below.

Source: https://people.duke.edu/~rnau/411avg.htm

Variants and extensions of SES model

The SES model is only one of the types of exponential smoothening. There are others that address some of the problems with the SES model. The key limitation of SES forecasting is that the SES models assume that there is no trend of any kind in the data. The long-term forecasts from the SES model are a horizontal straight line, as in the SMA model and the random walk model without growth. This acts as a limitation for anything except very short-term forecasting, particularly if there is a short term or long term trend in the data. Various extensions of the SES models have been developed to get around this problem. The main ones are:

1. Holt’s linear LES (Double Exponential Smoothening)

2. Holt Winters Model or Triple Exponential Smoothening

The SES model can be modified to incorporate a constant linear trend. This is known as Browns’s SES. An SES model is actually a special case of an ARIMA model — a point explained further on.

Holt’s Linear Exponential Smoothing

To incorporate a linear trend into the model, Holt’s LES model adds two smoothing constants, one for the level (α) — which was already there in SES — and one for the trend (β). At any time, t, there is an estimate Lt of the local level and an estimate Tt of the local trend. These are computed recursively from the value of Y observed at time t and the previous estimates of the level, Lt-1, and trend, Tt-1, by two equations that apply exponential smoothing to them separately.

If the estimated level and trend at time t-1 are Lt‑1 and Tt-1, respectively, then the forecast for Yt­ that would have been made at time t-1 is equal to Lt-1+Tt-1. The model is set up as follows:

Equation to estimate the Level

Equation to estimate the trend

The change in the estimated level, Lt ‑ Lt‑1, can be interpreted as a noisy measurement of the trend at time t. The updated estimate of the trend is then computed recursively by interpolating between Lt ‑ Lt‑1 and the previous estimate of the trend, Tt-1, using weights of β and 1-β:

The forecasts for the near future that are made from time t are obtained by extrapolation of the updated level and trend:

The interpretation of the trend-smoothing constant, β, is analogous to that of the level-smoothing constant, α. Models with small values of β assume that the trend changes very slowly over time, while models with larger β assume that it is changing more rapidly.

Which type of trend-extrapolation is best: horizontal or linear? Empirical evidence suggests that, if the data have already been adjusted (if necessary) for inflation, then it may be imprudent to extrapolate short-term linear trends very far into the future. Trends evident today may slacken in the future due to varied causes such as product obsolescence, increased competition, and cyclical downturns or upturns in an industry. For this reason, simple exponential smoothing often performs better out-of-sample than might otherwise be expected, despite its “naive” horizontal trend extrapolation. The LES model needs to be used with care as it may not make sense to extrapolate short term trends over many periods.

Holt Winters Model or Triple Exponential Smoothening

This is also called the Multiplicative Holt-Winters) is usually more reliable for data that shows trends and seasonality. Holt (1957) and Winters (1960) extended Holt’s method to capture seasonality. The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level, Lt, one for the trend Tt, and one for the seasonal component, St, with corresponding smoothing parameters α, β and γ. We use m to denote the frequency of the seasonality, i.e., the number of seasons in a year. For example, for quarterly data m=4, and for monthly data m=12.

There are two variations to this method that differ in the nature of the seasonal component. The additive method is preferred when the seasonal variations are roughly constant through the series, while the multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series. The Holt Winters method is fairly complex and probably requires its own entry. I mention it here so that you are aware of its existence and in which scenarios it is used.

Which model to use can be tricky and depends on what we want to do. Some guidelines:

a. If we believe that the trend in the past are important for determining current trends, we can use the LES model. If we are not sure of whether there is a trend or not we can use the SES model.

b. What type of trend extrapolation makes sense? Extrapolating trends over very long periods may not make sense given that trends change as product life cycles change, increased competition in a market or a cyclical downturn. Hence we find frequently SES performs better often despite its naïve assumption of a horizontal trend

When to use Different types of exponential smoothening and Why?

SES

1. SES is usually used to make short term forecasts. It is more effective than SMA because it gives a higher weight to more recent data points vs equal weightage given by SMA

2. SES cannot do longer term forecasts reliably primarily because this method does not consider any trend in the data. Hence, we need to look at extensions of the SES model such as Holt’s double smoothening and Holt and Winter’s triple smoothening

Holt’s double smoothening

1. If we know that there is a trend in the data, then this method can be used.

2. Extrapolating trends over very long periods may not make sense given that trends change as product life cycles change, increased competition in a market or a cyclical downturn. Hence we find frequently SES performs better often despite its naïve assumption of a horizontal trend

Holt Winters Model

1. Use this when there is seasonality in the data

We now move on to the third set of modelling techniques which is extremely popular for time series forecasts, i.e, ARIMA models.

3. ARIMA Models

Univariate ARIMA(p,d,q) is a forecasting technique that projects the future values of a series based entirely on its own inertia or lagged values. Its main application is in the area of short-term forecasting and requires at least 40 historical data points. It works best when the data exhibits a consistent pattern over time with a minimum amount of outliers. Sometimes called Box-Jenkins (after the original authors), ARIMA is usually superior to exponential smoothing techniques when the data is reasonably long and the correlation between past observations is stable. If the data is short or highly volatile, then some smoothing method may perform better. If you do not have at least 38 data points, you should consider some other method than ARIMA.[1]

Stationarity

Intuitively a time series can be taken to be a series or sequence of a random variable over time. How do we forecast it? If everything’s different tomorrow, then it’s impossible to forecast, because everything’s going to be different. For forecasting we need to find the constant (time invariant) component in the series, otherwise it’s impossible to forecast. Stationarity is essentially invariance of the series over time. Statistically this is defined as the joint probability distribution remaining constant overtime. This would require the parameters of the distribution, i.e., mean and variance to be constant. For example, if the series is consistently increasing over time, the sample mean and variance will grow with the size of the sample and they will always underestimate the mean and variance in future periods, hence it would be difficult to forecast.

Types of AR models:

When we talk about ARIMA (autoregressive integrated moving average) it is important to understand that this model is a generalization of an autoregressive moving average (ARMA) model. Non-seasonal ARIMA models are generally denoted ARIMA (p,d,q) where parameters p, d, and q are non-negative integers:

Let us define the three components:

AR (autoregressive):

The AR part of ARIMA indicates that the evolving variable of interest is regressed on its own lagged (i.e., prior) values

p is the order (number of time lags) of the autoregressive model

I (Integrated Differencing):

The I (for “integrated”) indicates that the data values have been replaced with the difference between their values and the previous values (this differencing process may be performed more than once. This step is carried out to make the series stationary

d is the degree of differencing (the number of times the data have had past values subtracted)

MA (Moving average):

The MA part indicates that the regression error is actually a linear combination of error terms whose values occurred contemporaneously and at various times in the past

q is the order of the moving-average model.

The ARIMA forecasting equation for a stationary time series is a linear regression equation in which the predictors consist of lags of the dependent variable and/or lags of the forecast errors. That is:

Predicted value of Y = Constant + a weighted sum of one or more recent values of Y + a weighted sum of one or more errors

1. If the predictors consist only of lagged values of Y, it is a pure autoregressive model. For example, a first-order autoregressive (“AR(1)”) model for Y is a simple regression model in which the independent variable is just Y lagged by one period

2. If some of the predictors are lags of the errors, then it is no longer a linear regression model as “last period’s error” cannot be specified as an independent variable. The problem with using lagged errors as predictors is that the model’s predictions are not linear functions of the coefficients, even though they are linear functions of the past data. So, coefficients in ARIMA models that include lagged errors must be estimated by nonlinear optimization methods (“hill-climbing”) rather than by just solving a system of equations.

The ARIMA forecasting equation

The forecasting equation is constructed as follows. First, let y denote the dth difference of Y, which means:

If d=0: yt = Yt

If d=1: yt = Yt — Yt-1

If d=2: yt = (Yt — Yt-1) — (Yt-1 — Yt-2) = Yt — 2Yt-1 + Yt-2

In terms of y, the general forecasting equation is:

ŷt = μ + ϕ1 yt-1 +…+ ϕp yt-p — θ1et-1 -…- θqet-q

Note the moving average parameters (θ’s) are defined so that their signs are negative in the equation, following the convention introduced by Box and Jenkins.

Steps in implementing an ARIMA model

The steps for simplicity are illustrated for a non-seasonal model. Once that process is understood extending the model for seasonality is straightforward.Step 1

Stationarize the model: We first need to begin by determining the order of differencing (d) to stationarize the series. This step is usually done in conjunction with a variance-stabilizing transformation such as logging or deflating. At the end of this transformation the differenced series is constant, and we have fitted a random walk or random trend model.

1a. Check for stationarity of the series. There are many methods to check whether a time series (direct observations, residuals, otherwise) is stationary or non-stationary.

A. The simplest is to eyeball the plotted graph of the data

B. More rigorously we can use The Augmented Dickey-Fuller test is a type of statistical test called a unit root test. The intuition behind a unit root test is that it determines how strongly a time series is defined by a trend.

a. Null Hypothesis (H0): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure

i. p-value > 0.05: Accept the null hypothesis (H0), the data has a unit root and is non-stationary.

b. Alternate Hypothesis (H1): The alternative hypothesis is accepted; it suggests the time series does not have a unit root, meaning it is stationary. It does not have a time-dependent structure.

i. p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary

An example of an output below shows the ADF test for stationarity. Running the example prints the test statistic value of -4. The more negative this statistic, the more likely we are to reject the null hypothesis (we have a stationary dataset). This suggests that we can reject the null hypothesis with a significance level of less than 1% (i.e. a low probability that the result is a statistical fluke).

ADF Statistic: -4.808291 [1]

p-value: 0.000052

Critical Values:

5%: -2.870

1%: -3.449

10%: -2.571

Thus we see that the series has an ADF of -4.8 which is less than -3.44 at 1% level of significance. This means that we reject the Ho at 1% and can infer that the data is stationary.

The ADF tells us that the series needs differencing. Determining the correct level of differencing is somewhat of an art. We have various rules of thumb are available as guides for determining the level of differencing required. [2]

Rule 1: If the series has positive autocorrelations out to a high number of lags, then it probably needs a higher order of differencing

The correct amount of differencing is the lowest order of differencing that yields a time series which fluctuates around a well-defined mean value and whose autocorrelation function (ACF) plot decays fairly rapidly to zero, either from above or below. If the series still exhibits a long-term trend, or otherwise lacks a tendency to return to its mean value, or if its autocorrelations are positive out to a high number of lags (e.g., 10 or more), then it needs a higher order of differencing [2]

Differencing tends to introduce negative correlation: if the series initially shows strong positive autocorrelation, then a nonseasonal difference will reduce the autocorrelation and perhaps even drive the lag-1 autocorrelation to a negative value. If you apply a second nonseasonal difference (which is occasionally necessary), the lag-1 autocorrelation will be driven even further in the negative direction.

If the lag-1 autocorrelation is zero or even negative, then the series does not need further differencing. You should resist the urge to difference it anyway just because you don’t see any pattern in the autocorrelations!

Rule 2: If the lag-1 autocorrelation is zero or negative, or the autocorrelations are all small and patternless, then the series does not need a higher order of differencing. If the lag-1 autocorrelation is -0.5 or more negative, the series may be over differenced.

Step 2

The stationarized series may still have autocorrelated errors. This is essentially correlation of the forecast variable with its past values. This would require that some number of AR terms (p ≥ 1) and/or some number MA terms (q ≥ 1) are also needed in the forecasting equation. This requires determination of whether to add AR, MA or both terms to the equation.

Which approach is best for dealing with autocorrelation? A rule-of-thumb for this situation, is that positive autocorrelation is usually best treated by adding an AR term to the model and negative autocorrelation is usually best treated by adding an MA term. In business and economic time series, negative autocorrelation often arises as an artefact of differencing. (In general, differencing reduces positive autocorrelation and may even cause a switch from positive to negative autocorrelation.) So, the ARIMA(0,1,1) model, in which differencing is accompanied by an MA term, is more often used than an ARIMA(1,1,0) model.

An important point to be noted is that the ARIMA(0,1,1) model is essentially equivalent to the SES model with some added flexibility with regard to the smoothing factor and also in the trend.

When to use ARIMA models and Why?

1. ARIMA models are more flexible than other statistical models such as exponential smoothing or simple linear regression. In fact, some exponential models are special cases of ARIMA models. For example, a simple exponential smoothing model is equivalent to an ARIMA (0,1,1) model. ARIMA models occupy that middle-range area of being simple enough to not overfit while being flexible enough to capture some of the types of relationships you see in the data

Limitations of Classical models: (Exponential Smoothing models, ARIMA — based, models)

Classical forecasting models have several limitations:

1. Missing values are not supported

2. Assumption of linearity in the relationship. This problem is partly overcome by transforming the data using transformations such as logs, etc

3. These models work on uni-variate data. Most of the models in time series forecasting don’t support multiple variables to be taken as inputs

Due to the above limitations there is a strong case for using deep learning techniques in forecasting. ARIMA has been a standard method for time series forecasting for a long time. Even though ARIMA models are very prevalent in modelling economical and financial time series they have some major limitations. For instance, in a simple ARIMA model, it is hard to model the nonlinear relationships between variables. Furthermore, it is assumed that there is a constant standard deviation in errors in ARIMA model, which in practice it may not be satisfied. When an ARIMA model is integrated with a Generalized Auto-regressive Conditional Heteroskedasticity (GARCH) model, this assumption can be relaxed.

All this leads to the fact that new techniques of forecasting using neural networks overcome some of the problems associated with the Classical methods. I will overview Neural Network methods for forecasting in a the second articles in this series.

References

1. https://people.duke.edu/~rnau/Notes_on_the_random_walk_model--Robert_Nau.pdf

2. https://people.duke.edu/~rnau/411avg.htm

3. https://en.wikipedia.org/wiki/Moving-average_model

4. https://en.wikipedia.org/wiki/Time_series

5. https://machinelearningmastery.com/taxonomy-of-time-series-forecasting-problems/

6. https://blogs.oracle.com/datascience/7-ways-time-series-forecasting-differs-from-machine-learning

7. https://machinelearningmastery.com/how-to-develop-convolutional-neural-network-models-for-time-series-forecasting/ https://rstudio-pubs-static.s3.amazonaws.com/362102_b6f0aa9a1e5a4192b444ee9ff77be8b5.htmll

https://jeddy92.github.io/JEddy92.github.io/ts_seq2seq_conv/

https://bair.berkeley.edu/blog/2018/08/06/recurrent/

https://medium.com/@satyam.kumar.iiitv/understanding-wavenet-architecture-361cc4c2d623

http://www.erogol.com/dilated-convolution/#disqus_thread, Dilated Convolution

FEBRUARY 6, 2017 EROGOL 7 COMMENTS

https://towardsdatascience.com/neural-networks-over-classical-models-in-time-series-5110a714e535

Jason Brownlee, Machine Learning Algorithms in Python, Machine Learning Mastery, Available from https://machinelearningmastery.com/machine-learning-with-python/, accessed April 15th, 2018.

https://machinelearningmastery.com/time-series-data-stationary-python/

--

--

Shailey Dash

AI Researcher, Writer and Teacher passionate about making AI accessible to everyone