Statistical Modeling of COVID-19 Vaccination Coverage and Effect in Lund

Ali Bakly
8 min readMar 12, 2022

--

Introduction

Vaccination against Covid-19 is going well in Sweden, and on the 12:th of November 2021, 81.5% of everyone over the age of 16 were vaccinated with at least two doses (Public Health Agency of Sweden, 2021). On the other hand, there are considerable differences in vaccination coverage between the different regions. This report tries to create a model for estimating vaccination coverage in a municipality given data for various societal factors (explanatory variables). In addition, we analyze the effect of vaccination by looking at the reduction in the number of patients at the predicted vaccination coverage for the given municipality. Specifically, the issues for this report will be:

  • Which societal factors play a role in vaccination coverage?
  • Given data of societal factors and vaccination coverage of each municipality, what is a suitable model for vaccination coverage?
  • What will be the prediction for vaccination coverage in Lund with this model?
  • How high is the risk reduction for disease in Lund?

The Dataset

The dataset consists of 17 columns, the name of the municipality, the degree of vaccination (0–1), and 15 societal factors: Median income, Unemployment, Median age, Pensioners, Immigrants, Rural(0=no, 1=yes), and election results for each political party M, C, L, KD, S, V, MP, SD and the turnout in the last election.

Figure 1: Display of dataset.

Method and Theory

The approach of this article is through statistical analysis “from scratch” rather than using an existing library or such (except some MATLAB functionalities) to help the reader with understanding the fundamentals.

Multiple linear regression is utilized to create a model with the explanatory variables that seem to have an impact. The model can then be validated through residual analysis. The vaccination coverage p lies in the interval 0≤p≤1; therefore, the data can be transformed with a logit-transform:

To analyze the risk reduction we use the value p* = 0.96 with a 95% confidence interval [0.86, 0.99] on the relative risk reduction efficacy for Pfizer-biotech N against “Severe disease: Delta’’ after two doses based on studies from The United Kingdom (IHME, 2021). With the logit-transform we obtain x* = 3.18 and the confidence interval [1.82, 4.86] or in other words xN(3.18, 0.73). If e stands for efficacy, S stands for sick, and V for vaccinated, then efficacy is given by

and one can derive

The risk reduction is therefore given by

Analysis

Model

The model for multiple linear regression, in the matrix formulation, is given by

wherein this particular case Y (n x 1) is the data series of the vaccination coverage, X (n x (p+1)) is the data series for each explanatory variable, β ((p+1) x 1) are the coefficients, and E (n x 1) are the deviations. By (n x 1) we mean the dimensions for the Y matrix, and so on. In this analysis, we work in particular with logitY = Xβ +E. As is shown in figure 1, we have access to 15 explanatory variables, but not all of them are necessarily impactful when it comes to vaccination coverage. We plot each explanatory variable against corresponding vaccination coverage in figure 2 to get an idea of which variables might be of some interest.

Figure 2: Plot of vaccination coverage against the different explanatory variables. Left: probability, right: logit.

Out of figure 2, for example, we see an apparent correlation between immigrants and vaccination coverage. To specifically select which variables to use in the model, stepwise regression is utilized, a somewhat controversial method but will fit our needs nonetheless. In stepwise regression, explanatory variables whose β-coefficients are not significantly different from zero are discarded. Variables that are good to use are usually highly correlated with Y, for instance, immigrants. Stepwise regression can be done with MATLAB’s stepwise(). One can also use the correlation matrix to identify multicollinearity, which can be problematic for linear regression. After stepwise regression, significant explanatory variables with corresponding β-coefficients are obtained by solving the system of equations beta = X \ logitY in MATLAB,

not forgetting the intercept β = 3.2053. To further examine if the obtained variables provide a reasonable model, the residuals logitY — Xβ are analyzed. If the residuals are normally distributed, the model is reasonable. Residuals are plotted both “as they come” and against each explanatory variable in figure 3 below.

Figure 3: Residuals in a coordinate system.

The residuals seem to be randomly distributed around zero for all plots. To further ensure normality of residuals, we plot a so-called normal probability plot in figure 4.

Figure 4: Normal probability plot of the residuals.

We see that the residuals fit a normal distribution quite well.

Prediction for vaccination coverage in Lund

A point estimation of the logit transformed vaccination coverage is given by

Note that x is the logit transformed vaccination coverage from equation 1, and with the transform, we obtain the predicted vaccination coverage in probability:

The variance for the prediction and the sample variance is needed to form an appropriate prediction interval:

where Q₀ is given by the sum of the residuals squared. Furthermore, one can argue for the model’s validity by assuring s is small enough. A 95% prediction interval is given by:

With the data for Lund and the transformation, we finally find the prediction interval

The interval seems somewhat reasonable because 217 out of 289 municipalities have a vaccination coverage in Lund’s interval.

Effect of vaccination

To estimate the risk reduction (equation 2) in Lund, we simulate 10⁵ vaccination coverages with a normal distribution with the expected value from equation 3 and variance given in equation 4. The efficacy is simulated the same number of times with N(3.18, 0.73). Via the transformation and equation 2, we obtain 10⁵ possible risk reductions for Lund. The risk reduction mean is 0.2172; therefore, our point estimation is RR* = 0.2172. If we plot the risk reductions in a histogram, it might seem like the simulated risk reductions are normally distributed for the untrained eye. However, a normal probability plot shows clearly that this is not the case.

Figure 5: Simulated risk reductions.

In fact, normal distribution should not be expected because in equation 2, normally distributed variables are multiplied, and the product of normally distributed variables is not necessarily normal. Hence we can not use the otherwise applicable formula

for confidence intervals, but instead, let the quantiles for 2.5% and 97.5% directly be the bounds for the interval. A confidence interval for the risk reduction in Lund is given by

and is marked blue in figure 5.

So what does the risk reduction tell us? We remind the reader with the point estimation of the risk reduction that the probability for severe disease is

Before the vaccination, the probability of severe disease for a random individual was P(S|V̅). The probability of severe disease for a random individual at the predicted vaccination coverage is instead 0.2172 ⋅ P(S|V̅). That is, the risk for severe disease is reduced by 78.3%. A common logical fallacy is illustrated by republican U.S representative Marjorie Taylor Greene’s Twitter post:

40% of all covid deaths last week were vaccinated. Stop vaccine mandates & forced masking (Loiaconi, 2021).

This statement does not account for the share of vaccinated people in the community she is referring to. A simple thought experiment: imagine our community consists of 90 vaccinated individuals and 10 unvaccinated. Four vaccinated individuals die of the disease, while six vaccinated die. Indeed 40% of the covid deaths were vaccinated in this example, but evidently, 4/90 is a lot better than 6/10. In fact, it is rather likely that many lives have been saved with a risk reduction in the proposed interval.

Conclusion

We used data from 289 Swedish municipalities to retrieve an estimation of the vaccination coverage in Lund, Sweden. The data consisted of 15 possible explanatory variables for each municipality. However, the vaccination coverage was modeled with multiple linear regression using only eight explanatory variables (unemployment, median age, pensioners, immigrants, L, V, SD, turnout). Validation of the model consisted of analyzing residuals, colinearity, and the estimated standard deviation. The estimated vaccination coverage and the given efficacy were used to simulate the risk reduction. With the help of calculations in MATLAB, we found the predicted vaccination coverage in Lund: 82.54% with a 95% prediction interval [77.06% ; 86.93%]. A point estimation for the risk reduction in Lund was found to be 0.2172 with a 95% confidence interval [0.1561 ; 0.3082].

References

[1]: Blom, Gunnar; Enger, Jan; Englund, Gunnar; Grandell, Jan; Holst, Lars. 2017. Sannolikhetsteori och statistikteori med tillämpningar. 7th edn. Lund: Studentlitteratur AB.

[2]: Folkhälsomyndigheten. 2021. Statistik för vaccination mot covid-19. Solna: Folkhälsomyndigheten. https://www.folkhalsomyndigheten.se/folkhalsorapportering-statistik/statistikdatabaser-och-visualisering/vaccinationsstatistik/statistik-for-vaccination-mot-covid-19/ (2021–11–12).

[3]: Institute for Health Metrics and Evaluation. 2021. COVID-19 vaccine efficacy summary. Seatle: IHME. http://www.healthdata.org/covid/covid-19-vaccine-efficacy-summary (2021–11–25).

[4]: Loiaconi, Stephen. 2021. More vaccinated people are dying of COVID-19. Here’s what that means. NBC Montana. 28 October. https://nbcmontana.com/news/coronavirus/rise-in-breakthrough-deaths-should-not-cast-doubt-on-vaccines-experts-say (2021–12–01).

[5]: MATLAB code: https://github.com/AliBakly/Covid-19/blob/main/Covid19.m

[6]: Data: https://github.com/AliBakly/Covid-19/blob/main/dataEng.csv

--

--