Statistics and Mathematics

Statistics: Multivariate time series analysis — fundamental concepts, VMA, VAR and VARMA

The mathematical understanding of VMA, VAR, and VARMA and practical Python implementation

Yuki Shizuya
Intuition

--

Photo by Maxim Hopman on Unsplash

Multivariate time series is a topic that often goes unmentioned in university classes. However, real-world data usually has multiple dimensions, and we need multivariate time series analysis techniques. In this blog, we will learn about multivariate time series concepts with visualization and Python implementation [1]. I assume that readers already know univariate time series analysis. If not, you can refer to my previous blog [2].

Table of Contents

  1. What are Multivariate time series?
  2. Vector moving average processes (VMA)
  3. Vector autoregressive process (VAR)
  4. Vector autoregressive moving average process (VARMA)
  5. Application: US monthly retail sales revenue

1. What are Multivariate time series?

As its name suggests, the multivariate time series is the multi-dimensional data related to the time. We can define the multivariate time series data in the mathematical formula as follows:

The multivariate time series notation

where Zᵢ,ₜ is the i-th component variable at time t, and note that it is a random variable for each i and t. Zₜ has the (m, t) dimension. When we analyze the multivariate time series, we cannot apply the standard statistical theory. What does it mean? Please remember the multiple linear regression.

The multiple linear regression formula

When calculating the likelihood of multiple linear regression (1), we assume that each observation(=xᵢ) from the sample is independent of the others. Thus, we can easily compute the likelihood by the product of the probability density of the single observations. Generally, we assume that the observations follow the normal distribution with the following parameters (2).

The gaussian distribution of the multiple linear regression and its maximum likelihood estimation

However, in the multivariate time series, Zₜ is dependent on i and t. Thus, we cannot apply the same assumption to multiple linear regression. To analyze the multivariate time series, we need to know some fundamental concepts. Let’s dive into them.

Stationarity

In the univariate time series, when the time series has the same mean and variance over time, and the covariance depends on the time lag, it has a weak stationarity. Similarly, the m-dimensional multivariate time series also has a stationarity if each component series is weak stationary and its mean and variance are time-invariant. The following figure illustrates this situation for intuitive understanding.

The illustration of the stationarity of the multivariate time series

Covariance and Correlation matrix

Now, let’s consider the statistics about the stationary multivariate time series process Zₜ. The mean of the m-dimensional multivariate time series process can be written as:

The expected value of the m-dimensional multivariate time series

The mean vector has (m, 1) dimension. On the other hand, the lag k covariance matrix will be as follows:

The covariance matrix of the m-dimensional multivariate time series

Here is the proof:

The proof of the covarinace matrix of the multivariate time series

When k = 0, the matrix Γ(0) can easily be seen as the variance-covariance matrix among other variables. When k > 0, the matrix Γ(k) is the expansion of the autocovariance for the univariate time series. We calculate not only the autocovariance between the same variable but also the autocovariance between one variable and other variables.

For the correlation matrix, we just normalize the covariance matrix using the variance matrix.

The correlation matrix of the multivariate time series

where D is the diagonal matrix that each element is the i-th component series variance. Thus, the i-th diagonal element of ρ(k) is the autocorrelation function for the i-th component series Zᵢ,ₜ whereas the (i, j)-th off‐diagonal element of ρ(k) is the cross‐correlation function between component series Zᵢ,ₜ and Zⱼ,ₜ.

Vector White noise process

The m-dimensional vector process, aₜ, is called the vector white noise process if it has the following parameters.

Vector white noise process (VWN)

where Σ is a (m x m) symmetric covariance matrix. It is the expansion of the white noise in univariate time series. The components of the white noise process are uncorrelated at different times, as is the case with the univariate white noise process. Thus, the covariance between the one vector and its time-lagged vector becomes zero. Note that it may be contemporaneously correlated among the components of the white noise process. Generally, we assume the Gaussian white noise, which means aₜ follows a multivariate Gaussian distribution. We can refer to the Gaussian white noise as VWN(0, Σ).

We now understand the fundamental concepts of multivariate time series. In the next section, we will explore representative vector time series processes, such as VMA, VAR, and VARMA.

2. Vector moving average processes (VMA)

The vector moving average (VMA) process is the multi-dimensional variable version of the moving average (MA) process. Let’s quickly recap the MA process. The moving average(MA) process is composed of the summation of the current and previous shocks, Uₜ. MA(q) process can be written in the formula below.

MA(q) process

Uₜ is assumed to be the white noise in many cases. Intuitively, the MA(q) process has the time series {Yₜ} composed of the previous q step shock. Note that the MA(q) process is weak-stationary, which means that the mean and the variance are constant, and the covariance depends on the time lag. When we use a backshift operator B, we can rewrite the formula as follows:

MA(q) process written by the backshift operator

As you can see, we can rewrite it in an organized way. Many textbooks utilize this way of expression. The MA(q) process has the following properties.

MA(q) process properties

Back to the topic of the VMA, it is just the vector form (=multivariate) version of MA process! To understand the VMA process intuitively, let’s consider the m-dimensional VMA(1) example.

VMA(1) process

where Zₜ, 𝜇, and aₜ has the (m, 1) dimension, and 𝚯 has the (m, m) dimension. Note that 𝚯₀ is an identity matrix. aₜ is a sequence of the m-dimensional Gaussian white noise process VWN(0, Σ). The VMA(1) process has the following properties.

The VMA(1) process with

The mean of the VMA(1) process is always 𝜇 because it is comprised of the summation of the Gaussian white noise, which has a mean of 0. On the other hand, the covariance matrix seems a bit tricky. Let’s make it clear by detailed derivation. Firstly, 𝛤(0), which is equivalent to the variance, can be derived as follows:

The variance matrix of the VMA(1) process

The same procedure can be used to calculate the following.

The lagged 1 covariance matrix of the VMA(1) process

We include the negative sign to the 𝛩 in the 𝛤(1) for convenience.

The lagged k covariance matrix of the VMA(1) process

As you can see, the mean and autocovariance are almost analogous, but the difference is that the VMA has a matrix form of the MA process’s parameters. So far, so ambiguous. Let’s check the concrete example with visualization. We assume that we have the following conditions.

The example of the VMA(1) process

For simplicity, we set the mean vector to zero. We try four example coefficient cases (B1 ~ B4). The figure below shows the result of 100 samples in each coefficient case.

The VMA(1) process for each coefficient matrix

In the B1 coefficient case, they are independent MA(1) processes. In the B2 and B3 coefficient cases, one series is the independent MA(1) process, but the other series is correlated to the one of the independent MA(1) process. In the B4 coefficient case, both of the series are correlated with each other. Again, the VMA process is just the multivariate version of the MA process, but since there are more variables, we have to consider correlations between components.

Now, let’s expand the topic to the VMA(q) process. The m-dimensional VMA(q) process is given by:

The formulation of the VMA(q) process

aₜ is a sequence of the m-dimensional Gaussian white noise process VWN(0, Σ). The VMA(q) process has the following properties.

The properties of the VMA(q) process

The VMA(q) process mean is always 𝜇 because the VMA(q) is comprised of the VWN process with a mean of 0. On the other hand, we can calculate the covariance matrix function of the VMA(q) process as follows.

The covariance matrix of the VMA(q) process

Thus, the VMA(q) process has stationarity whenever it is, and the covariance matrix will cut off after lag q. Similar to the MA process, we can utilize the correlation matrix or AIC to determine the order of q. The AIC is more convenient in defining the order, but please note that the AIC needs to calculate all the patterns of orders, so it needs a lot of computation.

In the last of the VMA section, there is an important concept called invertibility. The VMA(q) process is invertible if we can write it as an autoregressive representation as follows:

The invertibility of the VMA(q) process

Note that 𝚯+(B) is an adjoint matrix. We can derive the equation (4) as follows:

The AR representation of the MA process

Thus, if the stochastic process (Zₜ) is invertible, it has an infinite autoregressive representation (AR(∞)). If all roots of the determinant equation |𝚯q(B)| = 0 are satisfied with the outside of the unit circle, then the series is invertible.

3. Vector autoregressive process (VAR)

The vector autoregressive (VAR) process is the multi-dimensional variable version of the autoregressive (AR) process, similar to the VMA process. Let’s quickly recap the AR process. The autoregressive (AR) process uses the previous step values to forecast future values. AR(p) process can be written in the formula below.

The AR(p) process

Uₜ is assumed to be the white noise. The second equation uses the backshift operator to express the AR(p) process. The AR(p) process is weak-stationary if the modulus of all the roots of the determinant equation |𝚽(B)| = 0 is satisfied with the outside of the unit circle.

Back to the topic of the VAR, again, it is just the vector form (=multivariate) version of the AR process! To understand the VAR process intuitively, let’s consider the m-dimensional VAR(1) example.

The VAR(1) process formulation

where Zₜ and aₜ has the (m, 1) dimension, and 𝚽 has the (m, m) dimension. aₜ is a sequence of the m-dimensional Gaussian white noise process VWN(0, Σ). The second equation uses the backshift operator to refer to the VAR(1) equation. Clearly, the model is invertible since it is a VAR model. If all the roots of the determinant equation |I — 𝚽₁B| = 0 lie outside of the unit circle, the VAR(1) process is stationary. Furthermore, we can also convert the equation as follows:

The stationary formula of the VAR(1) process

Now, let’s check the concrete example with visualization. We assume that we have the following conditions.

The VAR(1) process for each coefficient matrix
The VAR(1) process for each coefficient matrix

In the B1 coefficient case, they are independent AR(1) processes. Meanwhile, one series follows the other in other coefficient cases. Clearly, the last case is not stationary. To ensure the time series is properly stationary, you need to calculate the eigenvalues of the formula (6). The result is shown below.

# sample coefficients of VAR(1) process
B1 = np.array([[0.5, 0.0], [0.0, 0.5]])
B2 = np.array([[0.5, 0.5], [0.0, 0.5]])
B3 = np.array([[0.5, 0.0], [0.5, 0.5]])
B4 = np.array([[0.5, 0.5], [0.5, 0.5]])

# calculate the eigenvalue
for i, B in enumerate([B1, B2, B3, B4]):
X = np.eye(2) - B
w, v = np.linalg.eig(X)

print(w)
The eigenvalues of each coefficient matrix

Now, we can verify the B1, B2, and B3 are stationary, but B4 is non-stationary. You can also check whether each time series is stationary using Augmented Dicky Fuller test, like univariate time series. However, please note that it is not enough to check the stationarity of the VAR process.

Again, the VAR process is just the multivariate version of the AR process (likewise the VMA process), but since there are more variables, we have to consider correlations between components.

Now, let’s expand the topic to the VAR(p) process. The m-dimensional VAR(p) process is given by:

The VAR(p) process formulation

where Zₜ and aₜ has the (m, 1) dimension, and 𝚽 has the (m, m) dimension. aₜ is a sequence of the m-dimensional Gaussian white noise process VWN(0, Σ). Since the AR(p) process is invertible, and if the roots of the |Φₚ(B)| = 0 lie outside of the unit circle, the process is stationary. It is the same as the AR(p) process, but VAR(p) is the vectorized version.

When the AR(p) process is stationary, it has the following mean and covariance.

The properties of the AR(p) process

Firstly, we can derive the mean as follows:

The derivation of the mean of the AR(p) process

Deriving the covariance is tricky. For the first step, we need to derive the 𝛉 value.

The derivation of the bias term

We can derive the second equation because the 𝜇 is always constant. Next, we need to transform the VAR(p) equation.

Haven’t you already seen the formula similar to the last equation? We’ve seen this in the VMA section. If the VAR(p) process is stationary, it can be written as the VMA representation.

The VMA representation of the VAR process

Then, the covariance matrix is computed as:

The covariance matrix of the VAR(p) process

In the formula, it is derived by the generalized Yule-Walker matrix equations. Although I skip the explanation in this blog, you can refer to this lecture pdf [3]. In summary, if the VAR or VMA satisfies the specific condition, they can be converted to each other.

Similar to the AR process, we can utilize the partial correlation matrix to find the order. Likewise, the AIC can also be used and is more convenient. Please note that the AIC needs to calculate all the patterns of orders, so it needs a lot of computation. Therefore, if there are a lot of variables and it seems to have many lags, using the correlation matrix is still a good method.

We’ve learned about the VMA and VAR processes so far. In the univariate time series, there is the ARMA process, which is combined with the AR and MA processes. Yes, we also have VARMA for multivariate time series. In the final theoretical section, we will learn the VARMA process.

4. Vector autoregressive moving average process (VARMA)

VARMA process is the combination of the VAR and VMA processes. The m-dimensional vector autoregressive moving average (VARMA) process has orders p and q, which correspond to the VAR and VMA process, and it can be described as:

The VARMA(p, q) process formulation

We call it VARMA(p, q). We can reformulate this equation as:

Likewise, where Zₜ and aₜ has the (m, 1) dimension, and 𝚽 and 𝚯 have the (m, m) dimension. aₜ is a sequence of the m-dimensional Gaussian white noise process VWN(0, Σ).

Since the VMA process is stationary, the stationarity depends on the VAR term. Thus, when the roots of the |Φₚ(B)| = 0 lie outside of the unit circle, the VARMA process is stationary. Meanwhile, the invertibility depends on the VMA term. If all the roots of the determinant polynomial |𝚯q(B)| = 0 lie outside of the unit circle, the VARMA process is invertible.

Let’s recap the VMA, VAR, and VARMA processes. They have an analogous idea as the univariate time series, but they have multi-dimensional data in the multivariate time series. Thus, we need to think not only about the correlation between the current time step and the previous time step but also about the correlation among variables. This is the end of the theoretical sections of the multivariate time series. Now, let’s implement them using concrete examples!

5. Application: US monthly retail sales revenue

In the final section, we implement the multivariate time series in Python. For the first example, we will use the US monthly retail sales revenue referred from [1]. It contains the five industries, which are AUT (automobile), BUM (building materials), GEM (general merchandise), COM (consumer materials), and HOA (household appliances), from June 2009 to November 2016, with n = 90. Each time series data looks like this:

The graphs of each time series

Clearly, they are non-stationary. Thus, let’s try to take differencing them.

The graphs of each time series taken differencing

Now, it’s a little better. There seems to be seasonality, which is a regular, periodic dataset change, but I’ll ignore it for simplicity in this blog. Moreover, the series is not completely stationary, so I will use VAR and VARMA models. In Python, you can easily implement VAR and VARMA modeling using statsmodels (if you want to build VMA, you can also use statsmodels).

VAR modeling

For the VAR modeling, we can select the max order to be calculated, and the model automatically choose the best model based on given criteria, such as AIC. In the following example, I chose 5 as the max lag.

var = sm.tsa.VAR(diff_df)
result = var.fit(maxlags=5, ic='aic')
result.summary()
The part of the VAR estimation result
The correlation matrix of residuals

In this case, the best model is the five-lagged one. The result plot is below. The correlation matrix shows correlations between variables, such as COM and GRO. Thus, it is still not stationary. The result plot of the autocorrelation function for residuals is below.

The autocorrelation plot for residuals

As you can see, there are some correlation. Therefore, there is a room to improve the model by adjusting parameters.

VARMA modeling

For the VARMA modeling, there is not built-in function to select the best order for us. So, we need to iterate over the combination of orders.

# modeling
results = pd.DataFrame(columns=['p', 'q', 'AIC'])

for p in range(1, 5):
for q in range(1, 5):
model = sm.tsa.VARMAX(diff_df, order=(p, q))
result = model.fit(maxiter=1000, disp=False)

results = results.append({'p': p, 'q': q, 'AIC': result.aic})

res_df = pd.DataFrame(results, columns=['p', 'q', 'AIC'])
res_df.sort_values(by=['AIC']).head()

It takes several minutes to complete. Note that VARMA calculation is often unstable, so VAR model is better in practice. Here is the results.

The result of the VARMA model

As you can see, the VARMA(4, 1) is the best model in this case, and it is the same order as the VAR model. Based on the VAR and VARMA results, the smaller order q is better. If we want to create a more fitted model, we need to consider the seasonality component. Meta’s Prophet library seems convenient for building a model and one good option. Let’s try it once.

The code I used is here:

This is the end of this blog. Thank you for your time to read my blog!

--

--

Yuki Shizuya
Intuition

Data Scientist in Japanese IT company. I write blogs about machine learning/deep learning/statistics.