Predicting vaccine uptake with publicly available data.

Datamarinier
7 min readAug 11, 2022

Tim Sanderse did this research as part of his master in Data Science and Society at Tilburg University.

Picture by CDC

I tried to predict vaccine coverage rates in sub-national areas based on socio-demographic variables using machine learning models.

Vaccination coverage rates of different countries show that there are differences in sub-national areas. These differences could be explained by looking at the type of inhabitants of a sub-national area. Trying to explain the effect of socio-demographic variables on vaccine hesitancy could provide interesting information for policy makers.

In this blog post I would like to share my recent findings on how I predicted the vaccine coverage rate of every municipality in Flanders, Belgium, by using machine learning models with an average error of 0.93% on the best model.

The used machine learning models are: Random Forest, Lasso Regression and K-Nearest Neighbors.

Feature Selection

The procedure starts with the feature selection process. I collected the relevant socio-demographic variables from the open data portals of statbel and provincie in cijfers. These are governmental organizations in Belgium. These variables were aggregated, for instance by taking average education level instead of including every level. This resulted in the following set of socio-demographic variables.

Features

As socio-demographic variables tend to be highly correlated, the dataset was tested on multicollinearity by performing a Farrar-Glauber test. Literature recommends deleting every variable scoring above a threshold of 10 on the VIF-score as this implies high multicollinearity, which can bias your models. A Pearson’s R correlation matrix is performed to identify which variables should be removed to eliminate multicollinearity. After this process, the Boruta algorithm classified the remaining variables as important or unimportant in relation to the outcome variable, vaccine coverage rate. This resulted in some variables being classified as unimportant, which will then be excluded from the dataset.

Pearson’s R correlation matrix

The correlation matrix showed that age seemed to be highly correlated with the proportion of elders above 65, inhabitants without a Belgian passport seemed to be highly correlated with not being a member of social security, and income seemed to be highly correlated with education. The variable importance of the random forest and Boruta algorithm was used to decide which of these variables should be removed. As not being a member of social security, the population of elders over the age of 65 and education level had the lowest variable importance, it was decided to remove those three variables from the dataset. After removing these variables, the Farrar-Glaubar test was performed again. This showed no more high multicollinearity present in the dataset as every variable scored below the threshold on the VIF-score.

Boruta Algorithm

The last feature selection method that I used was the Boruta Algorithm. The Boruta showed that the variables CO-2 emission, life expectancy, traffic accidents and rest homes elders are unimportant variables related to the outcome variable vaccine coverage. Because of this, these variables were removed from the dataset. Removing these variables showed a decrease in the error rates of all the models. The machine learning models were also trained and tested on only the most important variables coming out of the Boruta algorithm. Keeping only the features that scored above a threshold of 5% importance resulted in higher errors on the models, showing that it was better to keep the entire selection.

Model Training:

Our data has been split into a training set containing 70% of the data and a test set containing 30% of the data. The first model was the random forest. For the random forest the tuned parameters are the mtry, the number of trees and the maximum nodes. Cross-validation showed that the best results were obtained by mtry = 12, ntrees = 500 and maxnodes = 22. The second model I applied is the lasso regression. The regularization parameter was tuned using cross-validation and gave the best results on the default score of 1. The last model is the KNN. The parameter K was tuned using a grid-search, resulting in K = 24.

Model Results:

The different models were trained, validated, and tested on the data. An essential part before reading these results is understanding the scale of the outcome variable: vaccination coverage. These outcomes are displayed in the table below (this data refers to 15 September 2021 and includes 283 municipalities in Flanders after the case deletion).

As the table shows, the scale is small, showing a standard deviation of 3.84, half of the data laying between 90–94% and a difference of 22 in the minimum and maximum. This will result in potentially very low error rates. Performing a random forest, k-nearest neighbors and a lasso regression leads to the following training and test set outcomes.

When looking at the results of the models on the test set, the conclusion is that socio-demographic variables can predict the COVID-19 vaccination coverage for sub-national areas very well. Although the vaccination coverage scale is small, the random forest managed to predict the vaccination coverage with an average difference of 0.93% off the true vaccination coverage in Flanders. Showing that the data on socio-demographic variables to predict vaccination coverages could be beneficial for locating areas with potentially low vaccination coverages for future pandemics or extra vaccination programs on COVID-19.

Looking at the differences in error rates on the train and test set, all the models perform slightly worse on the test set than on the training set. This could indicate that the models are overfitting. To counter the overfitting, several methods were applied. The parameters of the models were tuned, and the best parameters of the models were selected using cross-validation. Lowering the feature set by selecting the best features in the feature selection process should also decrease overfitting. A final test was made by performing the models on several train/test sets. This showed minimal differences in outcomes showing that our models would still perform around the same on different sets of unseen data. Because of all the discussed methods and the results on unseen data were still accurate, no further action was taken.

Results Interesting for Policy-Making:

This study provided information relevant to the data science sector, however, the results of this study also provide useful information for policy-making on improving vaccine coverage for COVID-19. In the table below, the effect of an increase of the variable is reflected on the vaccine coverage. It is projecting a potential risk model for sub-national areas in a possible future pandemic.

This table shows that a sub-national area would probably have low vaccine uptake when: the average age is low, incomes are low, the number of people in a household are high, education level is low, the unemployment rate is high, criminal activity is high, debts are high, house prices are low, the density of the population is high, the share of waste is high and the general practitioner,and dentist get a low amount of visits. Furthermore, coverage will also be lower when there are a high amount of people without a Belgian passport, a small amount of rest homes, more females than males, many newly inhabitants, a small amount of social tenement houses, a high building degree, a high-level electricity use, a low rate of agriculture emission and a small proportion of people with a handicap or a chronic disease.

Another vital aspect for policymakers is that the relevance of socio-demographic variables vary. The variable importance of our best-performing machine learning model (random forest) is projected in the table below. The higher the importance, the higher the impact of the variable on vaccination coverage.

It is important to keep in mind that we are predicting vaccine uptake on a municipal level, not an individual level. For instance, age seems to have a low impact in this analysis, but may be more important on an individual level.

Concluding Remarks:

With current planning of future vaccination drives against COVID-19 or other diseases, predicting vaccination uptake can be an invaluable tool for policymakers and healthcare workers. This blog has attempted to predict vaccine uptake with publicly available data. The best model was able to predict uptake with an average error of 0.93%.

For those interested to work with this data, I include the dataset and the code to replicate the correlation matrix to get you started.

library(dplyr)
library(GGally)
library(readr)
set.seed(123)dataset <- read_csv2("https://storage.googleapis.com/public_dm/data.csv") %>%
select(-municipalities)
#Correlation matrix
ggcorr(dataset, method = c("everything", "pearson")
, size = 3
, layout.exp = 3)

--

--