Predicting Covid-19 Mortality using Logistic Regression in SAS

Dr. Marc Jacobs
4 min readOct 12, 2021

The analysis of Covid-19 data has a natural appeal that I believe only a few can truly resist. Besides the actuality of the issue, there is quite some data available. For those of use who have the skill to analyze an ongoing pandemic it is often clear cut that those skills should also be used. If only to make some sense of the world.

The Dutch government tries hard to provide open data, although what is provided only scratches the surface due to GDPR. It’s biggest dataset contains around 2 million observations of people tested positive, their age, sex, the province where they were tested, and if they had to be admitted to the hospital and / or died.

What is unclear in this dataset is who was vaccinated or not, if an observation is recurring — many of us have tested multiple times — and if the test was later confirmed via another route. So, we have to take the data at face value.

From an analysis point of view the dataset contains ‘events’. An admission to the hospital or death is an event. However, because we do not have time, we cannot model it using Survival Regression techniques. Instead, we have to rely on good old Logistic Regression to model the events.

To model the influence of vaccines I split the data up in quarters. Since the data started in 2020 and I pulled it from the database in September, I have seven quarters. This is NOT really a substitute for vaccin efficacy, but at the very least probability tables should shift over time.

Let’s get started.

As always, data has be imported and wrangled to make it suitable. The FREQ output to the right is the dataset I used for the majority of the analyses, containing the age groups I want.
The sample size of the data allows for sufficient events to model using Logistic Regression.
The number of observations per quarter. You can see a heavy test increase after the 3rd quarter.

Next up is a lot of code to plot the data. I am looking at new-cases, hospitalizations, and deaths over time per province. The graphs should really speak for themselves.

Personally, for this kind of data, I really like heatmaps.

The last two plots clearly show that for the majority of times, people who test positive do not get admitted nor die.

Finally, the analysis part. I kept it straightforward. Considering the data, and the number of events, I feel very safe about the capability of each model to create probability tables and plot probability curves. The risk of sparsity and thus model non-convergence was basically non-existent.

First the results showing the predicted probabilities for hospital admission.

Probability for hospital admission is a function of sex, age, and quarter. But above all, age.
Here you can see that the predicted probabilities start to decreases in the first three quarters until the age / sex effect remains.

Next the predicted probabilities for death, based on hospital admission, age, sex, and time. The ROC curve almost seems to perfect, but is just a function of the number of events.

Predicted probabilities clearly show a relationship between age, sex, hospital admission, and quarter. Funny enough, the hospital admission gives larger prediction intervals which might be a result of comorbidity, which is not part of this dataset. Hence, it is just speculation.

So, in a nutshell, Logistic Regression is for sure not out of the picture when it comes to modelling probabilities.

--

--

Dr. Marc Jacobs

Scientist. Builder of models, and enthousiast of statistics, research, epidemiology, probability, and simulations for 10+ years.