It’s time to talk about Time Series Forecasting - Math edition

We can forecast the future without time machines, but sorry, not without math.

Published in

SRM MIC

9 min readDec 6, 2020

Time series forecasting is an important aspect of machine learning and data science which hasn’t been given its due. It deals with making predictions for the future using historical data which includes time component as a separate feature in the dataset. Time series forecasting is important because we need to identify the measure of significance that needs to be assigned to data from the past, i.e., record of data values from 2 decades ago can’t be given equal weightage as compared to data values obtained from 2 years ago.

GIF

So this blog is part one of a two-part series and will cover the intuition and math behind the concept of time series forecasting. In the second part, we’ll cover some more math and finally delve into some python code.

First of all, let’s conceptualize what a time series means; a time series is represented by a mathematical relationship between an output or target variable and time. It is quite simply written as shown below:

In a graphical sense, a time series is basically a set of points that represents a quantity (target variable) and how it changes with time.

These points vary with respect to time in the X-axis and their values in the Y-axis. For example, here is a time series plot that we will be making in python (in part 2) for the monthly beer production dataset obtained from Kaggle:

As we can see, the data points have been plotted over a span of 40 years and the beer production value keeps changing with respect to time. Now let’s dive into the components of a time series and familiarize ourselves with the details.

Components of a time series

Trend:

Trend is referred to as the general tendency of the target variable to increase or decrease with respect to time. Trends basically depict the variance of the target variable in a given time interval.

Trends can be both linear or non linear.

How do we distinguish them? When we plot the data points over a given time period, the cluster formed by them will trace a rough outline. If this rough outline represents a straight line, the trend is said to be linear; if it resembles a parabola or any other curve for that matter, it is said to possess a non linear trend.

Trend, seasonality and irregular fluctuations — graphical representation

Seasonal variations (seasonality):

Seasonal variations refer to the repeating patterns of change in data points with respect to time. These variations keep occurring in regular intervals and are observed for a period less than a full year.

GIF

For example, in India, mangoes are sold the most in the summer season hence is a good example of a data/target variable, i.e. mango sales, which possesses seasonality. Seasonality maybe hourly, weekly, monthly or quarterly.

Cyclic variations:

The variations observed in data points over a period of one year constitute cyclic variations. They follow a standard pattern of peak, recession, trough and recovery. In simpler terms, let’s say that the sales of a given product hits peak during a given month, slowly recedes and hits rock bottom in the next 2 months. However, if the sales value recovers and reaches a decent amount, comparable to its former peak value, in the coming months, it is said to possess a component called cyclic variation.

Irregular fluctuations:

The sudden or unexpected fluctuations in data points at certain time intervals are referred to as irregular or random variations in the time series. These changes are controlled by external or independent forces and can’t be predicted. The adverse impact of the ongoing Covid-19 pandemic on various businesses and the global economy as a whole, is a good example of an irregular fluctuation.

Note: To represent irregular fluctuations, we use a term called ‘white noise’, which is just a sequence of random numbers related to our time series in order to take into consideration, the occurrence of such unforeseen circumstances. We also forecast errors in an ideal time series using white noise. It is usually represented by ϵt.

Before moving on to understanding the statistical algorithms used to forecast a time series, let’s first explore what stationary and non-stationary time series mean, to better comprehend the working and necessity of all the different algorithms.

Stationary vs. Non-stationary time series

Quite simply put, a time series with no trend or seasonal variations is said to be stationary in nature. This means that our data points possess no change in mean or variance and that the covariance between 2 data points in a given time interval is constant.

Why are stationary time series preferred by most algorithms to make predictions? This is because the data becomes easier to analyze over long periods of time as it won’t necessarily keep varying and so, the algorithms can assume that stationary data has been readily served on a plate.

But as we all know, the whole point of a time series’ existence is the fact that target variable is going to keep changing with respect to time. So, given below is the plot of a relatively more realistic time series that isn’t stationary.

This plot clearly depicts the variations in a time series showing that this does in fact, contain the trend and seasonality components. But such a non-stationary time series makes it difficult for algorithms to make fair predictions and needs to be made stationary.

So, before moving on, how can we, without graphically plotting a time series, figure out if it is stationary or not?

Test for stationarity

The Augmented Dickey Fuller test or the ADF test is used to tackle the aforementioned problem and falls under the category of a unit root test. A given time series has one unique characteristic called the unit root which is the coefficient of the data point from a previous time period (lag order). This unit root defines how strong the trend component of a time series is.

The lag value or lag order of a time series is the value that denotes how many previous time steps need to be traversed.

All we need to know from this equation is that, if the coefficient of Yt-1 (value of the data point in the previous time period) i.e. α = 1, this means that a unit root (α) exists and so, the time series is said to be non-stationary.

The ADF test is a modified version of the Dickey Fuller test and takes into account lag values from many time steps in the past unlike the Dickey Fuller test.

The ADF test churns out an important variable called the p-value and if this value is less than 0.05, then the time series is stationary. A p-value greater than 0.05 indicates a non-stationary time series.

The image above shows the python implementation of Augmented Dickey Fuller test. The p-value is obtained by printing the 1st value in the list stored in the adf variable. Since this value is greater than 0.05, the time series isn’t stationary. We’ll discuss this in detail in the next blog.

Now, let’s deal with some mind blowing statistical algorithms that might, well, blow your mind.

Do note that we’ll be covering in this blog: 3 algorithms that assume stationarity of the input time series. We will deal with algorithms that allow us to input non-stationary time series in the second part of this blog.

3 algorithms for (stationary) time series forecasting

Autoregression (AR):

This statistical algorithm uses a dependency or relationship between observations from the present as compared to their values in the past. To make it clearer, it uses data values from the past (using lag order) to predict values for the future.

It is represented as AR(p) where p represents the lag order of the model. This means that we consider data points from p previous time steps.

Mathematical representation

Let’s break this down by decoding the meanings of all the variables:

What is yt? Value of the data point to be predicted.
What is C? A constant.
What is the symbol Φ? The coefficient of each data point from a previous time period.
What is p? The lag order used for autoregression.
What is ϵt? The white noise added to the expression to compensate for any random/irregular variations.

Let’s just say that Autoregression uses value of past data points to forecast the future.

Moving Average (MA):

This is yet another statistical algorithm that, unlike AR, uses the dependency between a given observation and the residual errors calculated from observations in previous time steps. Note: the errors refer to the difference between the actual data value at time t and the average (moving average) of the preceding data values taken in subsets.

It is represented as MA(q) where q is the order of the moving average and represents how many previous time steps to traverse through, in order to obtain their errors values.

Let’s break this equation down in a way similar to that of Autoregression:

The values of yt, c ϵt and θ (or Φ from the AR equation) represent exactly the same quantities they did, in the AR equation.

The only alien variables are the ϵt-1, ϵt-2…, ϵt-q terms. These terms are the error terms obtained from previous data points and they distinguish the MA algorithm from that of AR. Now let’s move on to the last and final algorithm for this blog.

GIF

Autoregressive Moving Average (ARMA):

As you might’ve guessed, Autoregressive Moving Average is a combination of both Autoregression and Moving average algorithms. Hence, it uses both the dependencies i.e. between data values and errors values from the past to optimize the predictions.

It is represented as ARMA(p, q) where p is the lag order of Autoregression and q is the moving average order.

We are already familiar with all the terms in the equation given above. Here, the two sigma signs are loops from 1 to P and 1 to q. They are used to sum up all observations (AR) and errors (MA) respectively. So in conclusion, ARMA forecasts future values using the past data values and the errors calculated i.e. it combines the methodologies used in AR and MA for making better predictions.

Conclusion

So in this blog we: defined a time series, studied its 4 components, categorized it into stationary/non-stationary types and finally, tried to decode the working of 3 statistical algorithms that forecast the future, only using stationary time series i.e. time series without trend or seasonality.

In the next blog i.e. part 2 of this series, we’ll study some more algorithms like ARIMA, Seasonal ARIMA(SARIMA) and SARIMAX that don’t necessarily assume stationarity of a time series. We’ll also delve into some python code and explore the monthly beer production dataset to successfully implement time series forecasting in python. So stay tuned for part 2!