Statistics and mathematics

Statistics: Time Series Analysis — Compilation of the fundamental concepts

Yuki Shizuya
Intuition
Published in
15 min readJul 6, 2024

--

Gentle introduction of time series analysis basics with visualization and detailed mathematical derivation

Photo by Maxim Hopman on Unsplash

I have recently been learning about fundamental statistical time series analysis for examination [1]. However, there are intimidating mathematical formulas, and nothing stays in my head. So, I decided to summarize the basic concepts, such as autocovariance, autocorrelation, stationarity, and some representative time series processes with visualization. In this blog, I will explain these concepts using visualization and Python implementation in as much detail as possible.

Table of Contents

  1. What are time series? — autocovariance, autocorrelation and stationarity
  2. Representative time series processes

-2.1 white noise

-2.2 Autoregressive(AR) process

-2.3 Moving average(MA) process

-2.4 Autoregressive — moving average(ARMA) and ARIMA process

3. Statistical test for time series

-3.1 Augmented Dickey Fuller (ADF) test

-3.2 Durbin-Watson test

1. What are time series? — autocovariance, autocorrelation and stationarity

As its name suggests, time series is related to the time. The data observed as time passes is called time series data: for example, heart rate monitoring, the daily maximum temperature, etc. Although these examples are observed at regular intervals, there are time series data observed at irregular intervals, such as intraday stock trading, clinical trials, etc. This blog will use the time series data with a regular observing span and have only a single variable(univariate time series). We mathematically can define time series as follows:

The definition of the time series data

If we regard Xₜ as a random variable, we can define a mean and a variance dependent on the observed time t.

The time series statistics

For time series data, we might want to compare the data between the past and the current one. To do this, two essential concepts, autocovariance and autocorrelation, are necessary for further discussion. So, let’s dive into them!

Autocovariance

Technically, the autocovariance is the same as the covariance. Please remember that the covariance has the following formula:

Covariance formula

In this case, the covariance calculates the relationship between two variables, X and Y. Note that when calculating the sample covariance, we divide the difference between each observation and the mean by n-1, similar to the sample variance. For the autocovariance, we calculate the sample covariance between the previous observation and the current one. The formula is following:

The autocovariance formula

h is called a lag. Lagged X is the previous X value shifted by h. As you can see, the formula is the same as the covariance.

Autocorrelation

Autocorrelation is also the same as correlation, like autocovariance. The correlation has the following formula.

The correlation formula

The correlation divides the covariance by the standard deviations of the variables X and Y. We can think that the correlation is analogous to the normalized covariance by the standard deviation. For autocorrelation, we calculate the correlation between the previous and current observations. h also refers to the lag in this formula.

The autocorrelation formula

When the covariance and correlation take higher positive values, the two variables, X and Y, have a positive relationship. How about autocovariance and autocorrelation? Let’s check visualization.

For the first example, I generated data from the AR(1) process (we will see it later). It looks like noisy data.

The sample time series plot

In this case, the autocovariance and autocorrelation graphs look like the ones below. The x-axis refers to the lag.

The autocovariance and the autocorrelation for the sample time series

As you can see, the autocovariance and the autocorrelation have similar tendencies. Thus, you can imagine that the autocorrelation can be considered the normalized autocovariance.

For the following example, I will use real-world data, such as AirPassengers [4]. AirPassengers data has a clear upward tendency.

AirPassenger data plot

In this case, the autocovariance and autocorrelation graphs look like the ones below. The x-axis refers to the lag.

The autocovariance and the autocorrelation of the AirPassenger data

Likewise, autocovariance and autocorrelation have similar tendencies. This data has more correlations and even bigger lags than the first example.

Now, we understand the two key concepts, autocovariance and autocorrelation. Next, we discuss a new concept called stationarity. A stationary time series means the data properties, such as mean, variance, and covariance, are not dependent on the observed time. There are two types of stationarity:

  1. Weak stationarity(Second-order stationarity)
  2. Strict stationarity(Strong stationarity)

Let’s check the definitions.

1. Weak stationarity(Second-order stationarity)

The process has relationships below called weak stationarity, second-order stationarity, or covariance stationarity. (There are so many ways to call it.)

Weak stationarity

where µ is constant and 𝛾ₜ does not depend on 𝛕. These formulas show that the mean and the variance are stable over time, and the covariance depends on the time lag. For instance, the first example in the previous paragraph has weak stationarity.

2. Strict stationarity(Strong stationarity)

When we let Fx(・) represent the joint density function, strictly stationarity is described as:

Strict stationarity

If the joint distribution of all time series data is invariant to shifts in time, that time series has strict stationarity. Strict stationarity implies weak stationarity. This property is very restrictive in the real world. Thus, many applications rely on weak stationarity.

There are some statistical tests to check whether the time series data is stationary. They will be more understandable after we learn about the representative time series process in section 2. So, I will introduce them in section 3.

2. Representative time series processes

From now on, we will see the representative time series processes, such as white noise, autoregressive(AR), moving average(MA), ARMA, and ARIMA processes.

2.1 White noise

When we have the time series data with the following properties, that time series data has white noise.

White noise formula

White noise has zero mean, and its variance is identical over the time steps. Furthermore, it has zero covariance, meaning that the time series and its lagged version are uncorrelated. So, the autocorrelation also takes zero. It is generally used for the assumption that the residual term in the time series regression analysis should be satisfied. The plot of white noise looks like the one below.

The graph of white noise

We can easily generate white noise series by sampling from the standard normal distribution. As you can see, there seems to be no correlation besides the lag 0, the variance seems almost identical over time steps, and the mean seems zero.

2.2 Autoregressive(AR) process

Some time series data take values similar to those of the previous steps. In such cases, the autoregressive(AR) process can interpret the data well. The AR process has an order that represents the number of previous values in the series, which is used to predict the value at present. We denote it as AR(order). For simplicity, let’s see the AR(1) process. The following formula represents the AR(1) process.

AR(1) process

Uₜ is assumed to be white noise, and 𝜙 is an unknown parameter corresponding to the one-step previous value. Uₜ is also called shock. We can get the formula below when we solve formula (1) towards the past steps.

Transformation of the AR(1) process

From the above formula, 𝜙ᵗ₁ only affects the Y series. Thus, we can realize the things below:

  1. If |𝜙₁| < 1, the influence of the past values becomes smaller as the steps go by.
  2. If |𝜙₁| = 1, the influence of the past values is constant regardless of the lags.
  3. If |𝜙₁| > 1, the influence of the past values affects the current values as the steps go by.

Let’s view the visualization of each case.

AR(1) process with various parameter visualization

As the value of 𝜙₁ becomes bigger, the current step follows the previous step’ values. Thus, it looks smoother as the value increases until 𝜙₁ = 1. When 𝜙₁ has a larger value than 1, the values will increase like infinity, so the series looks like the final graph.

Note that |𝜙₁| < 1 case has the weak-stationary process. When the AR(1) process satisfies the weak stationarity, the mean and the covariance will be as follows:

When the AR(1) process satisfies, it has the above statistics

For the mean, we use the mean as constant over time. Using the fact the white noise has the mean zero, we can derive the formula as follows.

Derivation of the mean (= expected value)

For the covariance, we need to change the formula (1) first.

Transformation of the formula (1)

Then, we need to derive the variance and the covariance in this order. For the variance, we can derive it by taking a square of the above derived formula.

The derivation of the variance

For the covariance, we can derive it by multiplying the subtraction of the previous step value and the mean.

The derivation of the covariance

We can consider the AR(p) process similarly.

AR(p) process

Generally, the AR(p) process can be weak-stationary when it satisfies the following condition (5)(6).

Weak stationarity condition for the AR(p) process

The formula (5) and (6) means that all roots of the formula (5) must lie outside the unit circle. Although we can extend the p number anyway, it is enough to think about several previous steps in the real world.

2.3 Moving average (MA) process

The moving average(MA) process is composed of the summation of the current and previous shocks. The MA process has an order that represents the number of previous residual, or shocks(Uₜ) included. We denote it as MA(order). For simplicity, let’s see the MA(1) process. The following formula represents the MA(1) process.

MA(1) process

Uₜ is assumed to be white noise, and θ₁ is an unknown parameter corresponding to the one-step previous shock. MA(1) process is composed of white noise, so its mean is always µ. On the other hand, the variance and the covariance can be derived as follows:

The statistics for the MA(1) process

We can derive the variance as follows:

The variance derivation

Likewise, we can derive the covariance as follows:

The covariance derivation

White noise assumes that each variable is independent of the other, so we can cancel them. Thus, the MA(1) process is a weak stationary process for any parameter θ₁. Now, let’s check this by visualization.

MA(1) process with various parameters

As you can see, the mean and variance seem to remain the same compared to the AR(1) process. As the parameter value increases, the series becomes smoother relatively. Note that the MA(1) process and white noise variance differ.

Generally, the MA(q) process is also weak-stationary.

MA(q) process

The mean and the covariance can be formulated as follows.

The statistics for the MA(q) process

Although we can extend the q number anyway, it is enough to consider several previous steps in the real world, like the AR process.

2.4 Autoregressive — moving average(ARMA) process and ARIMA process

As its name suggests, the autoregressive-moving average(ARMA) process combines the AR and MA processes. Intuitively, the ARMA process can reinforce each other’s weaknesses and obtain more flexibility in representing the data. Mathematical representation is as follows:

ARMA(p,q) process

We denote the ARMA process as ARMA(p, q). The parameters p and q correspond to the parameters of the AR and MA processes. Since the MA process always has weak stationarity, the ARMA process’s weak stationarity depends on the AR part. Thus, if the AR part of the formula (14) satisfies the formula (5)(6), it has weak stationarity.

Let’s check how it looks like the ARMA process by visualization. AR(p=1,q=1) process looks like as follows:

ARMA(1, 1) process with various parameters

AR(p=3, q=2) process looks like the graph below.

ARMA(3,2) process with some parameters

As you can see, it can grasp more complex data structures better than the AR and MA processes alone. The more parameter value increases, the smoother the graph becomes.

Finally, the Autoregressive-integrated-moving average(ARIMA) process shares some parts with the ARMA process. The difference is that the ARIMA has an integrated part (I). An integrated part refers to the number of times the data needed to be differenced to get stationarity. What does it mean? First, we define the differencing operator ∇ as follows:

Differencing operator

When we want to difference more, we can extend it to powers by iterating, as in:

The differencing operator with power 2

Now, using the difference parameter, we can define the ARIMA(p, d, q) process as:

ARIMA(p,d,q) process

p is the order for the AR process, d is the number of times to be differentiated, and q is the order for the MA process. Thus, after differentiating the data, the ARIMA process becomes the ARMA process. The ARIMA process is useful, for instance, when the mean differs in the time series, which means the time series is not stationary. Please remember the AirPassengers dataset. It does not have the same mean in all the series. When we apply nabla to this series, the graph looks like as follows:

The comparison between the original and the differencing data

The mean on the right graph seems stable in the time series compared to the original data on the left graph. Now, we want to fit the ARMA process after differentiating. However, there is a problem. How do we define parameters? There are some methods to determine them as follows.

  1. Using autocorrelation function (ACF) plot to determine the order(q) of the MA process and partial autocorrelation function (PACF) plot to determine the order(p) of the AR process.
  2. Using the AIC or BIC to determine the best-fitted parameters.

According to the first method, we use ACF and PACF plots to determine the order of MA and AR processes. PACF is also autocorrelation but removes the indirect correlation between Yₜ and Yₜ+ₖ for the lag n through the 0 < n < k range. For further information, you can reference these lecture pdf [5][6]. However, we sometimes cannot decide the parameters using only the plots. Thus, I recommend you use the second method. AIC and BIC are the information criteria used to estimate the model quality relative to each of the other models. Thanks to the library pmdarima [7], We can easily find the best parameters based on the above information criteria. For instance, the result will be below when we use pmdarima to estimate the AirPassengers data.

# fit stepwise auto-ARIMA
arima = pm.auto_arima(y_train, start_p=1, start_q=1,
max_p=3, max_q=3, # m=12,
seasonal=False,
d=d, trace=True,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings
stepwise=True) # set to stepwise
arima.summary()
AirPassengers prediction using auto_arima

By writing only a few lines, we can fit and forecast the data with good quality. Furthermore, pmdarima can estimate the time series using more advanced models such as SARIMA. Thus, pmdarima is very helpful in practical use cases.

3. Statistical test for time series

In the final section, I will introduce two famous statistical tests for time series. These tests are often used to check whether the data is stationary or the residual term has autocorrelation. Before diving into each test, there is an important concept called a unit root. If the time series has a unit root, it is not stationary. Thus, if the AR(p) process satisfies at least one root of the formula (5) equals 1, which means that the AR(p) process is not stationary, we can say that the AR(p) process has a unit root. A couple of statistical tests use this concept.

3.1 Augmented Dickey-Fuller(ADF) test

The augmented Dickey-Fuller (ADF) test evaluates whether a unit root exists in the given univariate time series.

ADF test uses the following formula derived from the formula (10).

ADF test setting

Then, it sets the following null hypothesis and the alternate hypothesis.

Hypothesis for ADF test

The statistic is the following formula.

ADF statistic

Thus, the numerator must be negative when the time series is stationary. Practically, several libraries allow us to calculate the ADF tests, so you don’t need to implement them. The following example shows the three examples of time series data. The left one refers to the AR(1) process, the center one refers to the MA(1) process, and the last uses the AirPassenger dataset. The graph title shows the process name and the p-value of the ADF test.

The visualization of time series data and the result of the ADF test

As you can see, the stationary data (left and center) are smaller than the significance of the threshold so that we can reject the null hypothesis, which means the data is stationary. On the other hand, the non-stationary data (right) are more significant than the threshold, so we cannot reject the null hypothesis, which means the data is not stationary.

3.2 Durbin-Watson test

Durbin-Watson test evaluates whether the residual term in the time series regression model has autocorrelation. How can we use it? When we assume the following regression model using time series, we can estimate the parameters using the least-squared method.

The regression model

If the Uₜ doesn’t follow white noise, the model quality is not good. There is room to consider Uₜ has some autocorrelation or serial correlation, and we should include them in our model. To check this, we can use the Durbin-Watson test. The Durbin-Watson test assumes the residual term has an AR(1) model.

AR(1) process

Then, it sets the following null hypothesis and the alternate hypothesis.

The hypothesis for DW test

We use the following Durbin-Watson statistic.

Durvin-Watson statistic

This formula could be more intuitive, so let’s change it. We assume the T is big enough for the following relationship.

The approximations for the residual term

Using these formulae, we can transform the Durbin-Watson statistic as follows:

Transformation of the DW statistic

𝛾 means the first order autocorrelation. The DW statistic is closer to two when the autocorrelation is closer to zero, which means there is little autocorrelation in the time series. On the other hand, if there is autocorrelation in the time series, the DW statistic becomes smaller than 2.

The relationship among the DW statistic, the autocorrelation and the hypothesis

Let’s check the DW statistic using the ARIMA model we created in section 2.4.

from statsmodels.stats.stattools import durbin_watson

arima = pm.arima.ARIMA(order=(2,1,2))
arima.fit(y_train)

dw = durbin_watson(arima.resid())
print('DW statistic: ', dw)
# DW statistic: 1.6882339836228373

The DW statistic is smaller than 2, so autocorrelation or serial correlation remains. The residual plot below shows that the residual still has some correlation.

Residual plot

In this case, we need to use a more advanced model to fit the data correctly. We can use SARIMA, recurrent neural networks, prophets, etc.

I used the code below to produce the results in this blog.

This is the end of this blog. Thank you for your time to read my blog!

References

[1] 統計検定準一級

[2] Wang, D., Lecture Notes, University of South Carolina

[3] Buteikis, A., 02 Stationary time series

[4] AirPassengers dataset, kaggle

[5] Eshel, G., The Yule Walker Equations for the AR coefficients

[6] Bartlett, P., Introduction to Time Series Analysis. Lecture 12, Berkeley university

[7] pmdarima: ARIMA estimators for Python

--

--

Yuki Shizuya
Intuition

Data Scientist in Japanese IT company. I write blogs about machine learning/deep learning/statistics.