Statistics and mathematics
Statistics: Time Series Analysis — Compilation of the fundamental concepts
Gentle introduction of time series analysis basics with visualization and detailed mathematical derivation
I have recently been learning about fundamental statistical time series analysis for examination [1]. However, there are intimidating mathematical formulas, and nothing stays in my head. So, I decided to summarize the basic concepts, such as autocovariance, autocorrelation, stationarity, and some representative time series processes with visualization. In this blog, I will explain these concepts using visualization and Python implementation in as much detail as possible.
Table of Contents
- What are time series? — autocovariance, autocorrelation and stationarity
- Representative time series processes
-2.1 white noise
-2.2 Autoregressive(AR) process
-2.3 Moving average(MA) process
-2.4 Autoregressive — moving average(ARMA) and ARIMA process
3. Statistical test for time series
-3.1 Augmented Dickey Fuller (ADF) test
-3.2 Durbin-Watson test
1. What are time series? — autocovariance, autocorrelation and stationarity
As its name suggests, time series is related to the time. The data observed as time passes is called time series data: for example, heart rate monitoring, the daily maximum temperature, etc. Although these examples are observed at regular intervals, there are time series data observed at irregular intervals, such as intraday stock trading, clinical trials, etc. This blog will use the time series data with a regular observing span and have only a single variable(univariate time series). We mathematically can define time series as follows:
If we regard Xₜ as a random variable, we can define a mean and a variance dependent on the observed time t.
For time series data, we might want to compare the data between the past and the current one. To do this, two essential concepts, autocovariance and autocorrelation, are necessary for further discussion. So, let’s dive into them!
Autocovariance
Technically, the autocovariance is the same as the covariance. Please remember that the covariance has the following formula:
In this case, the covariance calculates the relationship between two variables, X and Y. Note that when calculating the sample covariance, we divide the difference between each observation and the mean by n-1, similar to the sample variance. For the autocovariance, we calculate the sample covariance between the previous observation and the current one. The formula is following:
h is called a lag. Lagged X is the previous X value shifted by h. As you can see, the formula is the same as the covariance.
Autocorrelation
Autocorrelation is also the same as correlation, like autocovariance. The correlation has the following formula.
The correlation divides the covariance by the standard deviations of the variables X and Y. We can think that the correlation is analogous to the normalized covariance by the standard deviation. For autocorrelation, we calculate the correlation between the previous and current observations. h also refers to the lag in this formula.
When the covariance and correlation take higher positive values, the two variables, X and Y, have a positive relationship. How about autocovariance and autocorrelation? Let’s check visualization.
For the first example, I generated data from the AR(1) process (we will see it later). It looks like noisy data.
In this case, the autocovariance and autocorrelation graphs look like the ones below. The x-axis refers to the lag.
As you can see, the autocovariance and the autocorrelation have similar tendencies. Thus, you can imagine that the autocorrelation can be considered the normalized autocovariance.
For the following example, I will use real-world data, such as AirPassengers [4]. AirPassengers data has a clear upward tendency.
In this case, the autocovariance and autocorrelation graphs look like the ones below. The x-axis refers to the lag.
Likewise, autocovariance and autocorrelation have similar tendencies. This data has more correlations and even bigger lags than the first example.
Now, we understand the two key concepts, autocovariance and autocorrelation. Next, we discuss a new concept called stationarity. A stationary time series means the data properties, such as mean, variance, and covariance, are not dependent on the observed time. There are two types of stationarity:
- Weak stationarity(Second-order stationarity)
- Strict stationarity(Strong stationarity)
Let’s check the definitions.
1. Weak stationarity(Second-order stationarity)
The process has relationships below called weak stationarity, second-order stationarity, or covariance stationarity. (There are so many ways to call it.)
where µ is constant and 𝛾ₜ does not depend on 𝛕. These formulas show that the mean and the variance are stable over time, and the covariance depends on the time lag. For instance, the first example in the previous paragraph has weak stationarity.
2. Strict stationarity(Strong stationarity)
When we let Fx(・) represent the joint density function, strictly stationarity is described as:
If the joint distribution of all time series data is invariant to shifts in time, that time series has strict stationarity. Strict stationarity implies weak stationarity. This property is very restrictive in the real world. Thus, many applications rely on weak stationarity.
There are some statistical tests to check whether the time series data is stationary. They will be more understandable after we learn about the representative time series process in section 2. So, I will introduce them in section 3.
2. Representative time series processes
From now on, we will see the representative time series processes, such as white noise, autoregressive(AR), moving average(MA), ARMA, and ARIMA processes.
2.1 White noise
When we have the time series data with the following properties, that time series data has white noise.
White noise has zero mean, and its variance is identical over the time steps. Furthermore, it has zero covariance, meaning that the time series and its lagged version are uncorrelated. So, the autocorrelation also takes zero. It is generally used for the assumption that the residual term in the time series regression analysis should be satisfied. The plot of white noise looks like the one below.
We can easily generate white noise series by sampling from the standard normal distribution. As you can see, there seems to be no correlation besides the lag 0, the variance seems almost identical over time steps, and the mean seems zero.
2.2 Autoregressive(AR) process
Some time series data take values similar to those of the previous steps. In such cases, the autoregressive(AR) process can interpret the data well. The AR process has an order that represents the number of previous values in the series, which is used to predict the value at present. We denote it as AR(order). For simplicity, let’s see the AR(1) process. The following formula represents the AR(1) process.
Uₜ is assumed to be white noise, and 𝜙 is an unknown parameter corresponding to the one-step previous value. Uₜ is also called shock. We can get the formula below when we solve formula (1) towards the past steps.
From the above formula, 𝜙ᵗ₁ only affects the Y series. Thus, we can realize the things below:
- If |𝜙₁| < 1, the influence of the past values becomes smaller as the steps go by.
- If |𝜙₁| = 1, the influence of the past values is constant regardless of the lags.
- If |𝜙₁| > 1, the influence of the past values affects the current values as the steps go by.
Let’s view the visualization of each case.
As the value of 𝜙₁ becomes bigger, the current step follows the previous step’ values. Thus, it looks smoother as the value increases until 𝜙₁ = 1. When 𝜙₁ has a larger value than 1, the values will increase like infinity, so the series looks like the final graph.
Note that |𝜙₁| < 1 case has the weak-stationary process. When the AR(1) process satisfies the weak stationarity, the mean and the covariance will be as follows:
For the mean, we use the mean as constant over time. Using the fact the white noise has the mean zero, we can derive the formula as follows.
For the covariance, we need to change the formula (1) first.
Then, we need to derive the variance and the covariance in this order. For the variance, we can derive it by taking a square of the above derived formula.
For the covariance, we can derive it by multiplying the subtraction of the previous step value and the mean.
We can consider the AR(p) process similarly.
Generally, the AR(p) process can be weak-stationary when it satisfies the following condition (5)(6).
The formula (5) and (6) means that all roots of the formula (5) must lie outside the unit circle. Although we can extend the p number anyway, it is enough to think about several previous steps in the real world.
2.3 Moving average (MA) process
The moving average(MA) process is composed of the summation of the current and previous shocks. The MA process has an order that represents the number of previous residual, or shocks(Uₜ) included. We denote it as MA(order). For simplicity, let’s see the MA(1) process. The following formula represents the MA(1) process.
Uₜ is assumed to be white noise, and θ₁ is an unknown parameter corresponding to the one-step previous shock. MA(1) process is composed of white noise, so its mean is always µ. On the other hand, the variance and the covariance can be derived as follows:
We can derive the variance as follows:
Likewise, we can derive the covariance as follows:
White noise assumes that each variable is independent of the other, so we can cancel them. Thus, the MA(1) process is a weak stationary process for any parameter θ₁. Now, let’s check this by visualization.
As you can see, the mean and variance seem to remain the same compared to the AR(1) process. As the parameter value increases, the series becomes smoother relatively. Note that the MA(1) process and white noise variance differ.
Generally, the MA(q) process is also weak-stationary.
The mean and the covariance can be formulated as follows.
Although we can extend the q number anyway, it is enough to consider several previous steps in the real world, like the AR process.
2.4 Autoregressive — moving average(ARMA) process and ARIMA process
As its name suggests, the autoregressive-moving average(ARMA) process combines the AR and MA processes. Intuitively, the ARMA process can reinforce each other’s weaknesses and obtain more flexibility in representing the data. Mathematical representation is as follows:
We denote the ARMA process as ARMA(p, q). The parameters p and q correspond to the parameters of the AR and MA processes. Since the MA process always has weak stationarity, the ARMA process’s weak stationarity depends on the AR part. Thus, if the AR part of the formula (14) satisfies the formula (5)(6), it has weak stationarity.
Let’s check how it looks like the ARMA process by visualization. AR(p=1,q=1) process looks like as follows:
AR(p=3, q=2) process looks like the graph below.
As you can see, it can grasp more complex data structures better than the AR and MA processes alone. The more parameter value increases, the smoother the graph becomes.
Finally, the Autoregressive-integrated-moving average(ARIMA) process shares some parts with the ARMA process. The difference is that the ARIMA has an integrated part (I). An integrated part refers to the number of times the data needed to be differenced to get stationarity. What does it mean? First, we define the differencing operator ∇ as follows:
When we want to difference more, we can extend it to powers by iterating, as in:
Now, using the difference parameter, we can define the ARIMA(p, d, q) process as:
p is the order for the AR process, d is the number of times to be differentiated, and q is the order for the MA process. Thus, after differentiating the data, the ARIMA process becomes the ARMA process. The ARIMA process is useful, for instance, when the mean differs in the time series, which means the time series is not stationary. Please remember the AirPassengers dataset. It does not have the same mean in all the series. When we apply nabla to this series, the graph looks like as follows:
The mean on the right graph seems stable in the time series compared to the original data on the left graph. Now, we want to fit the ARMA process after differentiating. However, there is a problem. How do we define parameters? There are some methods to determine them as follows.
- Using autocorrelation function (ACF) plot to determine the order(q) of the MA process and partial autocorrelation function (PACF) plot to determine the order(p) of the AR process.
- Using the AIC or BIC to determine the best-fitted parameters.
According to the first method, we use ACF and PACF plots to determine the order of MA and AR processes. PACF is also autocorrelation but removes the indirect correlation between Yₜ and Yₜ+ₖ for the lag n through the 0 < n < k range. For further information, you can reference these lecture pdf [5][6]. However, we sometimes cannot decide the parameters using only the plots. Thus, I recommend you use the second method. AIC and BIC are the information criteria used to estimate the model quality relative to each of the other models. Thanks to the library pmdarima [7], We can easily find the best parameters based on the above information criteria. For instance, the result will be below when we use pmdarima to estimate the AirPassengers data.
# fit stepwise auto-ARIMA
arima = pm.auto_arima(y_train, start_p=1, start_q=1,
max_p=3, max_q=3, # m=12,
seasonal=False,
d=d, trace=True,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings
stepwise=True) # set to stepwise
arima.summary()
By writing only a few lines, we can fit and forecast the data with good quality. Furthermore, pmdarima can estimate the time series using more advanced models such as SARIMA. Thus, pmdarima is very helpful in practical use cases.
3. Statistical test for time series
In the final section, I will introduce two famous statistical tests for time series. These tests are often used to check whether the data is stationary or the residual term has autocorrelation. Before diving into each test, there is an important concept called a unit root. If the time series has a unit root, it is not stationary. Thus, if the AR(p) process satisfies at least one root of the formula (5) equals 1, which means that the AR(p) process is not stationary, we can say that the AR(p) process has a unit root. A couple of statistical tests use this concept.
3.1 Augmented Dickey-Fuller(ADF) test
The augmented Dickey-Fuller (ADF) test evaluates whether a unit root exists in the given univariate time series.
ADF test uses the following formula derived from the formula (10).
Then, it sets the following null hypothesis and the alternate hypothesis.
The statistic is the following formula.
Thus, the numerator must be negative when the time series is stationary. Practically, several libraries allow us to calculate the ADF tests, so you don’t need to implement them. The following example shows the three examples of time series data. The left one refers to the AR(1) process, the center one refers to the MA(1) process, and the last uses the AirPassenger dataset. The graph title shows the process name and the p-value of the ADF test.
As you can see, the stationary data (left and center) are smaller than the significance of the threshold so that we can reject the null hypothesis, which means the data is stationary. On the other hand, the non-stationary data (right) are more significant than the threshold, so we cannot reject the null hypothesis, which means the data is not stationary.
3.2 Durbin-Watson test
Durbin-Watson test evaluates whether the residual term in the time series regression model has autocorrelation. How can we use it? When we assume the following regression model using time series, we can estimate the parameters using the least-squared method.
If the Uₜ doesn’t follow white noise, the model quality is not good. There is room to consider Uₜ has some autocorrelation or serial correlation, and we should include them in our model. To check this, we can use the Durbin-Watson test. The Durbin-Watson test assumes the residual term has an AR(1) model.
Then, it sets the following null hypothesis and the alternate hypothesis.
We use the following Durbin-Watson statistic.
This formula could be more intuitive, so let’s change it. We assume the T is big enough for the following relationship.
Using these formulae, we can transform the Durbin-Watson statistic as follows:
𝛾 means the first order autocorrelation. The DW statistic is closer to two when the autocorrelation is closer to zero, which means there is little autocorrelation in the time series. On the other hand, if there is autocorrelation in the time series, the DW statistic becomes smaller than 2.
Let’s check the DW statistic using the ARIMA model we created in section 2.4.
from statsmodels.stats.stattools import durbin_watson
arima = pm.arima.ARIMA(order=(2,1,2))
arima.fit(y_train)
dw = durbin_watson(arima.resid())
print('DW statistic: ', dw)
# DW statistic: 1.6882339836228373
The DW statistic is smaller than 2, so autocorrelation or serial correlation remains. The residual plot below shows that the residual still has some correlation.
In this case, we need to use a more advanced model to fit the data correctly. We can use SARIMA, recurrent neural networks, prophets, etc.
I used the code below to produce the results in this blog.
This is the end of this blog. Thank you for your time to read my blog!
References
[1] 統計検定準一級
[2] Wang, D., Lecture Notes, University of South Carolina
[3] Buteikis, A., 02 Stationary time series
[4] AirPassengers dataset, kaggle
[5] Eshel, G., The Yule Walker Equations for the AR coefficients
[6] Bartlett, P., Introduction to Time Series Analysis. Lecture 12, Berkeley university