Cointegration of Time-Series in R

Using the possible relationship between vaccination rate and excess mortality as a show case.

15 min readSep 6, 2022

--

This post is not about making a case for the causal relationship between Covid-19 vaccines (booster series) and excess mortality. What it will do is show how to analyze such data using time-series methods like co-integration. Despite the heaviness of the topic, this is not the first time, nor will it be my last, that I have posted analyses regarding vaccines and excess mortality, covid-19 and mortality, ivermectine and mortality, vaccines and stillbirth, variants, and lockdown measures and important epidemiological outcomes. I am sure I am not the only one.

Also, this is not my first post about time-series nor is it about cointegration, but I will dig a bit deeper into this particular case because it is a challenging case. Since there are plenty of good and nice cases, which give you some nice examples of techniques, it is for the best if there are also sufficient dirty cases. Cases in which data science, and modelling and statistics does not always do what it should do.

So lets take a look at trying to connect two time-series, vaccine uptake and excess mortality, via cointegration methods.

First, we load in the libraries.

rm(list = ls())
library(readxl)
library(ggplot2)
library(lattice)
library(tidyverse)
library(lubridate)
library(stringr)
library(xts)
library(urca)
library(dynlm)
library(tseries)
library(quantmod)
library(lmtest)
library(sandwich)
library(vars)

Then we load in the data, which looks like this.

oversterfte <- read_excel("OVERSTERFTE MARC JACOBS AUG 2022.xlsx")
oversterfte%>%print(n=40)# A tibble: 27 x 4
    Year  Week    VAX Death
   <dbl> <dbl>  <dbl> <dbl>
 1  2022     7      0   -81
 2  2022     8      0   -86
 3  2022     9  20880  -168
 4  2022    10 104770    23
 5  2022    11 196340   302
 6  2022    12 250430   411
 7  2022    13 271740   416
 8  2022    14 251170   505
 9  2022    15 174670   515
10  2022    16 138460   279
11  2022    17  81470   231
12  2022    18  53040   245
13  2022    19  42050   262
14  2022    20  45300   218
15  2022    21  42050   112
16  2022    22  12820   169
17  2022    23  29230   166
18  2022    24  39390   230
19  2022    25  71770   370
20  2022    26  66660   330
21  2022    27  51750   204
22  2022    28  81440   326
23  2022    29  71770   408
24  2022    30  42050   365
25  2022    31  58220   199
26  2022    32   4520   218
27  2022    33  37530   441

To make it a full-fledged time-series, we need to add a date variable, so lets do that.

oversterfte_tbl <- oversterfte %>% 
  mutate(beg = ymd(str_c(Year, "-01-01")),
         date_var = beg + weeks(Week))

Since the data operate on different scales, and GGplot has a ridiculous way of showcasing two y-scale I will turn to the lattice package and plot two time-series in one. Next to that, I will draw a simple correlation plot between the two series. Now, there is a lot to criticize on doing a simple correlation on data that is already correlated, but it does offer some assistance. In the end, it is just points of data you can plot on an x- and an y-axis.

VAX <- xyplot(VAX~ date_var,
               data = oversterfte_tbl, type = "l")
Death<- xyplot(Death~ date_var,
              data = oversterfte_tbl, type = "l")
latticeExtra::doubleYScale(VAX,Death,style1 = 0, style2 = 3, add.ylab2 = TRUE)ggplot(oversterfte, 
       aes(x=VAX, 
           y=Death))+
  geom_point()+
  geom_smooth(fill="blue", alpha=0.1)+
  geom_smooth(method='lm', formula= y~x, se=FALSE, col="red",lty=2)+
  theme_bw()

Now, time-series is all about finding patterns and disentangling the data. It is all about making sure the data is stationary which means that the mean of the data is not dependent on the time.

“A stationary time series is one whose properties do not depend on the time at which the series is observed”.

What it basically means is that your data does not show a trend, or a seasonality (but cycles are ok). Personally, I have never really liked this, but then again I do not like normalizing, centering, scaling and any form of transformation either. However, in time-series, the assumption of stationarity cannot really be ignored, since it will affect your predictions, although you see more and more machine learning algorithms being deployed on time-series data that do not care about the assumption.

So, one of the first things we need to check is if the data is stationary or not, and that also has to do with the definition of co-integration. What co-integration means is that the time-series are basically correlated. If one goes up, the other goes as well, like two tiny boats sailing the waves but with an invisible cord connecting them.

Officially, two series A and B are cointegrated if the difference between them is stationary. Which means that each series is not, it shows a trend. If the series are not cointegrated, the difference is non-stationary (and thus shows a trend or a seasonality). This is an assumption that can be tested using a unit root test, which we will do below.

Now, there is more than one way to test for co-integration, and we will explore those, but first lets go with the definition used above. Hence, what we are looking for are two non-stationary series of which the difference is stationary.

The easiest way to check for non-stationary data is to check for auto-correlation, which means that one value in time depends on a previous value in time. Normally, the way to delete this kind of auto-correlation is to difference, so I will do that and add the auto-correlation plot as well. Do note that stationary does not mean that there is NO autocorrelation. For a stationary series, the autocorrelation drops quickly, and is slow for a non-stationary series.

par(mfrow = c(3, 4))
VAXts<-oversterfte_tbl%>%
  dplyr::select(VAX)%>%
  ts(., start=c(2022, 7), 
     end=c(2022, 33), frequency=52)
plot(VAXts)
acf(VAXts)
plot(diff(VAXts))
acf(diff(VAXts))
Deathts<-oversterfte_tbl%>%
  dplyr::select(Death)%>%
  ts(., start=c(2022, 7), 
     end=c(2022, 33), frequency=52)
plot(Deathts)
acf(Deathts)
plot(diff(Deathts))
acf(diff(Deathts))
VECM_ECT <- Deathts - VAXts
plot(VECM_ECT)
acf(VECM_ECT)
plot(diff(VECM_ECT))
acf(diff(VECM_ECT))

First column shows the original time-series of the death, vax, and difference(death-vax). Second column shows the auto-correlation for each of the time-series. Third and fourth column of graphics shows the differenced time-series and the auto-correlation plot that goes with it.

From above there is already a hint that the first definition of cointegration will not suffice here. Although both definitely shows a trend, their difference does as well. This type of auto-correlation persists even after differencing which is really because of that strange first bump and then the fade off, which will cause quite some difficulty analyzing later on. Another thing I can do, which I did earlier graphically, is to look at the correlation, and the cross-correlation functions.

cor(VAXts, Deathts)
ccfvalues<-ccf(as.numeric(VAXts), as.numeric(Deathts)); ccfvalues

As you can see, the correlation is not really high, and the ccf values are done by lag. They are strongest very close, but fade away very fast as well, which is what you may expect if you look at the original graphic again. In the end, it is very difficult to predict one step ahead by looking multiple steps back.

Clearly shows that the majority of the auto-correlation happens in the vicinity of the value itself. Then, there are some things happening 5 lags apart, but that does not really show from the graph. It would mean that a value now is dependent on five values back.

There are of course multiple ways of plotting cross-correlated data, and in the astsa package we find one more.

astsa::lag2.plot (as.numeric(VAXts), 
                  as.numeric(Deathts), 10)
astsa::lag2.plot (as.numeric(Deathts), 
                  as.numeric(VAXts), 10)

Below you see, in both ways, the correlation between a value in the Death or VAX time-series and a lagged value in the other series, respectively.

Now the time has come to used the infamous Dickey-Fuller tests, and many others, to look for a unit root. What a unit root means is that if a time-series has such a thing, it is NOT stationary, and needs to be differenced. The Dickey-Fuller, or here the Augmented Dickey-Fuller test, is the go-to test to check this out. The zero hypothesis is that a unit-root exists, meaning that if the p-value is lower then a given threshold, say 0.05, the data is stationary or trend-stationary. The latter means that a series becomes stationary once the trend is removed. Hence, the test can be done in multiple ways, removing nothing, a constant, or a trend. We will remove the trend to test for stationary.

The ur.df function is used, and I will base the number of lags on the previous acf plots, although an automatic selection function is also available.

ADF_VAXts<-ur.df(VAXts, 
      lags = 2, 
      selectlags = "AIC", 
      type = "trend")      
summary(ADF_VAXts)
plot.ts(ADF_VAXts@res, ylab = "Residuals")
abline(h = 0, col = "red")
tsm::ac(ADF_VAXts@res, max.lag = 20)

The summary coming from the **ur.df** function.

Now, what the above shows, is the Augmented Dickey Fuller test for a unit-root. The three test statistics are -5.28, 9.37 and 14.06. These are the respective test statistics for tau3, phi2, and phi3, which are the respective values for the trend, drift and none. What you can see is that all three are significantly different from the critical value, which means that the data IS stationary (remember, we assumed non-stationary data). The observation that all three are significant means that we have the correct test, and we have a stationary dataset. This agrees with the earlier observation that the VAX time-series has a quickly declining auto-correlation.

Autocorrelation of residuals coming from the test. No remaining auto-correlation to be detected.

Of course there are multiple way to test for a unit root, and we will apply this one as well. Here, the critical value is lower then the test-statistic at the 1% level, so also here the test hints at stationary data.

summary(ur.ers(VAXts,
       model = "trend", 
       lag.max = 2))

And then, of course, there are many more tests one can apply. Since this is R, there is so much you can test and do, in multi-fold. Personally, I prefer graphics over statistical testing, but since this should be an informative blog I will show you both.

tseries::adf.test(VAXts)     
tseries::kpss.test(VAXts, null="Trend")
Box.test(VAXts, lag=2, 
         type="Ljung-Box")

The null-hypothesis of a unit-root at lag 2 has been beaten.

**KPSS**: using this test statistic, the p value is greater than 0.1, which is in agreement with the rest, because the null-hypothesis assumes a stationary series. So, here, H0 means trend stationary. Using the ADF, H0 means unit root and so non-stationary. They do not make this easy on you.

**Box-L jung**: This test statistic looks at the presence of auto-correlation. A low p-value means that the data are not idd — not independently distributed. There is serial correlation.

So, now we know how to do it, lets apply it to the rest as well.

ADF_Deathts<-ur.df(Deathts, 
                 lags = 2, 
                 selectlags = "AIC", 
                 type = "trend")      
summary(ADF_Deathts)
plot(ADF_Deathts)
tsm::ac(ADF_Deathts@res, max.lag = 20)
tsm::gts_ur(ADF_Deathts)
summary(ur.ers(Deathts,
               model = "trend", 
               lag.max = 2))         # lag at 2
tseries::adf.test(Deathts)     
tseries::kpss.test(Deathts, null="Trend")
Box.test(Deathts, lag=2, 
         type="Ljung-Box")

Test statistic exceed critical values at the 5% at least.

Test statistic not exceed at the 5% level meaning that the data is non-stationary.

The null-hypothesis is not rejected, so non-stationary.

The null hypothesis is not rejected, so stationary.

Now, this time-series, the deaths is quite difficult to interpret. There is for sure auto-correlation, it drops off quickly and the differencing cancels any auto-correlation. Hence, one would say that the data is non-stationary and has to be differenced to be so, but not all tests agreed. Of course, statistical tests are sample size dependent and this time-series is really not that long and it is quite funky.

So, what is up next is to check and assess the difference between the two univariate time-series. Now, I do know that they are widely different in scale, but that should not necessarily matter. It is the difference that matters.

VECM_ECT<-Deathts - VAXts
ADF_DeathtsVAX<-ur.df(VECM_ECT, 
      lags = 2, 
      selectlags = "AIC", 
      type = "trend")
summary(ADF_DeathtsVAX)
plot(ADF_DeathtsVAX)
tsm::ac(ADF_DeathtsVAX@res, max.lag = 20)
tsm::gts_ur(ADF_DeathtsVAX)
summary(ur.ers(Deathts - VAXts, 
       type="DF-GLS",
       model = "trend", 
       lag.max=2))         
adf.test(VECM_ECT, k=6)  
adf.test(VECM_ECT, k=2)   
Box.test(VECM_ECT, lag=6, 
         type="Ljung-Box")

Test statistics beat all the critical values — this is a time-series which is stationary with a constant and a trend.

However, at lag 2 it is. So, it really depends on what kind of lags you test on and I chose my lags based on the acf plots shown way above, the ones of the raw time-series.

The data is serial correlated, which is to be expected looking at the plots.

So, the first test of co-integration said that two non-stationary data would have a stationary difference if they are cointegrated. That did not happen. However, looking at the graph, there definitely is some co-integration in the beginning and it seems some is happening at the end as well, but it becomes fuzzy quite quickly. Luckily, there are multiple ways of testing for co-integration, one of which is dynamic regression. If you want to read more about co-integration, please take a look at this excellent book. It talks about the three methods of co-integration, the first of which we have just done. Another blog that talks about the same three ways, actually using data from the book, can be found here. A quick comparison will reveal that I have certainly lent codes from this blog.

Now, of to the second method, which involves dynamic linear regression. If two univariate time-series are co-integrated, then the OLS estimator of the coefficient in the cointegrating regression is consistent. Because of the inherent properties of time-series, the OLS is not sufficient and one has to revert to dynamic OLS in which lags and leads can be incorporated.

So lets revert to dynlm package and try and cointegrate VAX and Death time-series. Below you see a simple example of a dynamic regression, without lags included, and the errors. The errors itself are of particular interest, and especially if they adhere to the property of a stationary process, which is what we would expect.

VAX_Death <- dynlm(Deathts ~ VAXts)
summary(VAX_Death)
par(mfrow = c(2, 2))
plot(VAX_Death)
z_hat <- resid(VAX_Death)
dev.off();plot(z_hat)
acf(z_hat)

The summary coming from the time-series regression.

Evaluation of the model shows the same attributes that one might expect from a normal linear regression.

tsm::gts_ur(z_hat)

The residuals are only stationary if we include a constant and a trend, else they are not.

From the aTSA package, we can also ask for a cointegration test in which the Engle-Granger(or EG) test is performed. Here, the null hypothesis is that two or more time series, each of which is I(1), are not cointegrated. So, to do this test, we need to first integrate the time-series, via d=1, and then look at the result.

aTSA::coint.test(Deathts, VAXts, d = 1, nlag = 2, output = TRUE)

It seems that, at lag 2, the data are cointegrated at the 1% level at no trend.

If we want to compare these results to the previous procedure, we need to augment that procedure by including the differencing and by including the lags, which were set at d=1 and lag=2, respectively. To make at even more fun, we can do this analysis in two ways:

regressing Death on VAX
regressing VAX on Death

The first one, of course, has our main interest.

VECM_EQ1<-dynlm(d(Deathts)~L(d(VAXts),1:2)+L(d(Deathts),1:2)+ L(VECM_ECT))VECM_EQ2<-dynlm(d(VAXts)~L(d(VAXts),1:2)+L(d(Deathts),1:2) + L(VECM_ECT))

Now, the main interest of both models is, and always will be, the residual part. Everything we have done so far has focused each and every time on the error part, and that is what we will do again with the coeftest function.

names(VECM_EQ1$coefficients) <- c("Intercept", "D_VAXts_l1", "D_VAXts_l2", "D_Deathts_l1", "D_Deathts_l2", "ect_l1")names(VECM_EQ2$coefficients) <- names(VECM_EQ1$coefficients)coeftest(VECM_EQ1,vcov.=NeweyWest(VECM_EQ1,prewhite=F,adjust=T))
coeftest(VECM_EQ2,vcov.=NeweyWest(VECM_EQ2,prewhite=F,adjust=T))

One can see from the output that the first lag of VAX and the second lag of Death influence Death. At least, in the first model. In the second model, one can also see that the first and second lag of VAX influence VAX, that Death almost influences VAX and that the error has an influence as well. From here, it really is not that clear that VAX and Death are cointegrated, and for sure not that a change in VAX will lead to a change in Death.

In the first model, the error is not significant from zero, but in the second it is. Not sure yet how to best interpret this, besides perhaps that the relationship is not per se of VAX on Death but vice versa. Biologically, that could make sense, as well as sociologically, although that would mean that the death rate as communicated would directly influence the tendency to vaccinate, and the ability to do so.

Another what to assess coeintegration is via the car package, which has a linearHypothesis function. In this form, one can test if the lag values are relevant or not, based on frequentist statistical testing. So, in the first part, I am testing if the lagged values of VAX are of influence on Death, by building a restricted model and taking them out. In the next model, we do the same, but then on the relationship between Death and VAX. Remember, each time I am building a restricted model.

car::linearHypothesis(VECM_EQ1, 
                 hypothesis.matrix=c("D_VAXts_l1", 
                 "D_VAXts_l2"),
                 vcov. = sandwich)car::linearHypothesis(VECM_EQ2, 
                      hypothesis.matrix=c("D_Deathts_l1",
                      "D_Deathts_l2"),
                      vcov. = sandwich)

The restricted model is not better, meaning that the lags of VAX should be included.

The restricted model is better, meaning that the lags of Death on VAX do not add much. One could already see that from the previous model summary

We could also test if the error was zero or not, but last move on the Granger test, which is often dubbed the Granger Causality test. If have used this test before when trying to connect lockdown measures to important Covid KPIs. Here, I will add lag=2, since that is what we have done all the time.

grangertest(Deathts~VAXts, order = 2)
grangertest(VAXts~Deathts, order = 2)

The first test tries to compare a model in which there is a lag of VAX included, and one in which it is not. It seems that model 2 (which is the reduced model) is not better. So, the lag of VAX is of influence on the lag of death. In the second test, we build a lagged and recuded model again, and see that the lag of death does not add much to VAX. Meaning that the reduced model, only containing lags of VAX to forecast VAX is sufficient. This is finding in favor of VAX levels influencing death levels.

You see we have two test, looking at two models each.

Another test for cointegration is the Johansen Procedure for VAR via ca.jo. This is not an easy test and involves ranks. Since we only have two univariate time-series, we can only have two ranks: r=0 or r≤1. This means that either there is no cointegration (r=0) or there is(R≤1). Lets specify such a model, using the same number of lags as used before.

jotest=ca.jo(data.frame(VAXts,Deathts), 
             type="trace", 
             K=2, 
             ecdet="none", 
             spec="longrun")
summary(jotest) 
plot(jotest)

The model clearly shows that there is no cointegration present.

Last, but not least, we can build a VAR, or vector autoregressive model, using the VAR function. To do so, we need to build a multivariate time-series object, difference it, and run it at lag=2. The VAR model does not have a preference of regressing death on VAX or VAX on death. No, it does it simultaneously and you will see that in summary coming next.

combined<-ts.union(diff(VAXts), diff(Deathts))
plot(combined)
VAR_est <- VAR(y = combined, p = 2)
summary(VAR_est)

And the summary of the VAR model, looking at VAX on death and Death on VAX together on differenced time-series containing two-fold lags. The VAX model does not offer much, but the Death model does in which the first lag of VAX is once again significant but very small in strength. Of course, the time-series have different magnitudes, but still, when comparing to the constant the exogenous variables are small in terms of influence. The biggest influence, really, comes from the constant and thus also death itself.

Alright, so this concludes the blog. Now, the main aim was not to necessarily connect VAX to death, but to show how difficult the cointegration analysis is. We used multiple methods and I could not find a steady pattern of cointegration, meaning that the difference of the two-series shows a stationary form. Also, many metrics applied and vector modelling does not show an appealing connection, which is something that was made very clear from the beginning when looking at the plot. That first curve is strange, but the rest fades off.

If there truly is a connection, the next round of vaccines should have to lead to another bump.

Hope you enjoyed this post, let me know if something is amiss!

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Cointegration of Time-Series in R

Using the possible relationship between vaccination rate and excess mortality as a show case.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Dr. Marc Jacobs