Predicting Covid-19 Mortality using Logistic Regression in SAS
The analysis of Covid-19 data has a natural appeal that I believe only a few can truly resist. Besides the actuality of the issue, there is quite some data available. For those of use who have the skill to analyze an ongoing pandemic it is often clear cut that those skills should also be used. If only to make some sense of the world.
The Dutch government tries hard to provide open data, although what is provided only scratches the surface due to GDPR. It’s biggest dataset contains around 2 million observations of people tested positive, their age, sex, the province where they were tested, and if they had to be admitted to the hospital and / or died.
What is unclear in this dataset is who was vaccinated or not, if an observation is recurring — many of us have tested multiple times — and if the test was later confirmed via another route. So, we have to take the data at face value.
From an analysis point of view the dataset contains ‘events’. An admission to the hospital or death is an event. However, because we do not have time, we cannot model it using Survival Regression techniques. Instead, we have to rely on good old Logistic Regression to model the events.
To model the influence of vaccines I split the data up in quarters. Since the data started in 2020 and I pulled it from the database in September, I have seven quarters. This is NOT really a substitute for vaccin efficacy, but at the very least probability tables should shift over time.
Let’s get started.
Next up is a lot of code to plot the data. I am looking at new-cases, hospitalizations, and deaths over time per province. The graphs should really speak for themselves.
Personally, for this kind of data, I really like heatmaps.
Finally, the analysis part. I kept it straightforward. Considering the data, and the number of events, I feel very safe about the capability of each model to create probability tables and plot probability curves. The risk of sparsity and thus model non-convergence was basically non-existent.
First the results showing the predicted probabilities for hospital admission.
Next the predicted probabilities for death, based on hospital admission, age, sex, and time. The ROC curve almost seems to perfect, but is just a function of the number of events.
So, in a nutshell, Logistic Regression is for sure not out of the picture when it comes to modelling probabilities.