Public misconceptions about survival analysis and COVID

TLDR

Fedor Galkin
Longevity Algos
6 min readJul 26, 2021

--

  • The overall accuracy of a survival classifier is a confusing measure. Pay attention to sensitivity-specificity in the original report;
  • LogReg models lose a lot of information by ignoring the duration of the observation;
  • The duration of the observation can significantly change the reported metrics to lower specificity and higher sensitivity;
  • Time-to-death models are not assumption-free;
  • CPH assumes that patients’ survival probability is smoothly declining with no sudden drops. In real life, an unsuccessful treatment may rapidly reduce the survival chances and a successful one may completely stop the decline;
  • While an algorithm may be theoretically sound, any implementation has to be empirically verified. Data scientists should be akin to engineers, who subject their models to meticulous real-life testing if an element of their design has a non-zero chance of harming a human.

Introduction

Earlier this week Deep Longevity released a paper called “Increased pace of aging in COVID-related mortality”. It features a time-to-death COVID risk calculator developed with a collection of over 5,000 patients.
Media frequently cite the accuracy metrics of such models while ignoring the limitations or failing to provide the necessary background. Some survival analysis concepts are common knowledge among scientists, but in the meantime, are not well understood by an average reader.
This blog entry is intended to help them critically approach the articles featuring odds ratio, hazard ratios, and other types of survival statistics.

Sensitivity and specificity

Any mathematical model relies on a set of assumptions and ground truths to make any calculations possible. The world is complex and unpredictable. The very fact that we can somehow use strict algorithms to anticipate the occurrence of random events is a miracle in itself.
Survival analysis treats any event as a random event. In the case of the COVID risk calculator, this event is a patient’s death. And one way to model it is with logistic regression (LogReg).

The basic assumption of LogReg is that there is no time. You observe a patient at one point, put a mark in your notepad (1 for dead, 0 for alive), and never take another look at them.

A LogReg model takes in a vector of parameters and returns the probability of an event happening. Note that the model does not return a solid “Yes” or “No”. It returns a number between 0 and 1. Intuitively, the closer the output to 1 the more likely a patient will not survive. But it is ultimately your task to find the threshold above which a float turns into a hard “1”.

[Logistic regression is trained using labeled data: 0 for no event observed, 1 for event observed. Upon training a sigmoid function is obtained that shows how a probability of an event depends on a variable X (or many variables X, Y, Z…). To predict the labels for new samples, a cutoff probability value has to be chosen. One can try many different cutoffs to find the one that maximizes both sensitivity and specificity of the model. ]

Any threshold you choose is a tradeoff between sensitivity and specificity. Sensitivity is the ability of a model to recognize lethal cases correctly, while specificity is its ability to recognize survivors correctly. Yet again, this distinction seems counterintuitive. Shouldn’t a truly accurate model be good at both these things? It rarely is so, unfortunately. Statistical models are dry algorithms and their results are in no way obliged to conform with our notion of ethics, beauty, or common sense. It is very easy, in fact, to pick a threshold that increases one of these metrics at the expense of the other. If the threshold is set to 1 sharp, the models will predict any patient a survivor: its specificity will be 100%, but its sensitivity will be 0%. Not so good for a model that tries to simulate a life-and-death scenario.

At the same time, making the model 100% sensitive is not good either. In a clinical setting, treating all patients as those in need of urgent help is a sure way to quickly exhaust all the available resources.

Apparently, the optimal cutoff is inside the (0;1) region. One may naively assume that setting the threshold to 0.5 is the best solution but it is hardly ever the case. Various biases or class imbalances will make it so that the final model you receive will also be imbalanced. How do you choose the threshold then?

[It is always possible to sacrifice either specificity or sensitivity to maximize the other. The trick to LogReg is finding the optimal threshold]

One popular way is to score all possible cutoffs according to the geometric mean (G-mean) of their sensitivity and specificity. In our case, it is equivalent to finding the cutoff at which sensitivity and specificity are the closest to each other. Yet even if G-mean is used to find the optimal cutoff, specificity and sensitivity may be far apart. For example, LogReg in our paper showed 57% sensitivity and 88% specificity. This imbalance is partially caused by survivors outnumbering lethal cases. But another important contributor is censoring. LogReg completely ignores the time dimension of observation: a person who survived a week after admission and a person who survived a month after admission are equal for a LogReg model. Since a living patient can die still but dead patients can’t get back to life, in reality, the number of survivors in such uncontrolled settings is usually exaggerated. If the observation was continued until all infection cases resolved, a LogReg model such as the one we reported would be expected to increase its sensitivity and lower its specificity.

Time-to-death survival analysis

Luckily, there are alternatives to LogReg that account for how long a patient has been hospitalized. After all, there is a difference between a patient who died one week after their hospitalization and a patient who died after one month. LogReg will assign equal significance to these two observations. Likewise, for patients who have been marked as “alive” one week or one month of being observed.

By far, the most popular model that can account for the duration of stay is Cox proportional hazards (CPH). Although it loses less information than LogReg, it also operates on a set of assumptions that limit its real-life use.
One, there can be only one event. Some extensions allow you to define competing events but they are not nearly as widespread as the classical CPH.
The second assumption is featured in the name of CPH — the proportionality assumption. All individuals share the same baseline hazard function h(t) that determines their probability to remain alive by time T. To obtain individual hazard functions, h(t) is multiplied by coefficients derived from their risk factors. In other words, all individual hazard functions are scaled versions of the same baseline curve. Thus, neither two individuals can have intersecting survival curves.
Thirdly, baseline h(t) is one particular function. A model may not be applied to other populations whose baseline hazard is of a significantly different shape.

When CPH is used, it is preferred to control the setting in such a way that ensures the satisfaction of these assumptions. For example, in clinical trials of a new drug, any confounders are kept to a minimum and all people receive the same treatment. In a real-life clinical setting, these rules are impossible to upkeep. COVID patients are very diverse, but more importantly, their treatment may differ. Thus, the survival curves of some cohorts may intersect, which violates the proportionality assumption.

Final word

Our COVID risk calculator was derived from a CPH model trained in a collection of 3987 admitted COVID patients, among them 1269 confirmed lethal cases. The model was verified in a collection of 1328 admitted COVID patients, among them 421 lethal cases.

[The online risk calculator we report is based on a CPH model. Unlike LogReg, CPH considers the time at which an event has been observed. The final model can be used to obtain a survival curve: the probability of each individual to survive at time T. The probability is always equal to 1 at T=0 and is equal to 0 at T=inf. The COVID Risk Score is a linear approximation of the CPH model that can be used to compare the risk between any two individuals.]

As far as COVID research projects go, we consider this to be a sufficiently large data set. However, it is not free from biases. One may notice that the original model features a binary variable called “Is black”. We used this definition since “Black” is the second most common race reported during admission, the first is “Other”, by the way. One can only guess why this is the case.

We would also like to emphasize once again that the reported COVID Risk Calculator is unreliable when applied to non-admitted or non-infected people outside of the US healthcare system.

Nonetheless, we are interested in seeing how this calculator performs in other populations and if other researchers can verify our findings. For these purposes, we publicly release the survival model as a free online tool.

--

--