Introduction to Survival Analysis (in python)

Pablo Ibieta
DataLab Log
Published in
21 min readOct 26, 2022
DALL-E output for the word “survival”

In this post we will go through a quick review of the collection of statistical techniques known as survival analysis. In survival analysis, we are generally interested in the time until an event happens. Thus, the response variable is usually called survival time, failure time or event time. Historically, the areas where most application of survival analysis was made are biomedical studies and industrial life testing. Some examples of situations where we could be interested in studying the time until an event occurs are:

  • The survival time of a patient after a surgery or treatment,
  • the time until a former prisoner goes back to jail,
  • the time taken for a farm to experience its first case of an exotic disease,
  • … or the time until a person dies.

Censoring

When analyzing (accompanying) subjects on a survival experiment (such as the ones listed above) typically some subjects will have censored survival times. That is, the survival times of such subjects are actually not observed. This is the case when dealing with death as the event, for instance. In other words, censoring occurs when we are unable to collect a complete set of data for a subject (because he dropped out of the study, for instance).

  • Time-to-censoring C, is the duration between the start event and the censoring. Censoring typically occurs when the study in question ends.
  • right-censoring: censoring that occurs before the event, this is, no information could be collected about the event of interest for that subject.
  • left-censoring: occurs when a person’s true survival time is less that or equal to the time when information was observed, e.g., time until COVID exposure.

To visualize this, we can construct an artificial censored dataset as follows:

png

So, we have generate a right-censored survival dataset. Let us visualize this synthetic dataset:

png
Figure 1: Synthetic survival dataset visualization

In the above picture, subjects 2, 5 and 6 had actually suffered an event while subject 7’s track was lost. This is, we do not know whether subject 7 has suffered an event or just his/her records disappeared for some reason (e.g. moving away from the the location where the study is being performed). Note that in the case of studying a death event, the censoring considers unregistered deaths, such as COVID-19 sub notification issues around the world.

Describing time to event

Let us start by defining the basic quantities in survival analysis:

Definition [Survival Time]: Let T ∈ ℝ⁺ be a non-negative random variable (survival time) and t a particular arbitrary value for T. This quantity is often called failure time or time to event and it is the central quantity in survival analysis.

When dealing with survival (failure) times special care has to be given to having a clear and unambiguous definition of the time origin from which the survival time is actually being measured. For instance, we can consider the age as the survival time and the time origin is naturally the time of birth of subjects. Another example can be found in medical studies, in such cases the time origin might be defined by a specific event such as entering into a study or having being diagnosed with some disease. Needless to say, the definition of what constitutes an actual event (or failure) has to be clear.

Let us go back to Figure 1, where we constructed a very small synthetic survival dataset, there are subjects with no registered event, this could happen either because the individual has not experienced the event or because we lost track of the subject. This kind of observations are said to be right censored, more precisely:

Definition [Censor variable]: Let d ∈ {0,1} be the response variable such that d=1 if the event has happened while d=0 in the case of censoring; This is, the event has not been observed whether because it has not happened or because we have lost track of the subject in the experiment.

A right censored mechanism is said to be independent if the event (failure) rates of subjects for any arbitrary survival time t>0 are the same as those had the censoring never occurred.

Death density

When considering the time until an event happens (e.g. death) we can construct a histogram of the count of events as a function of time. In simple terms, we could fit a curve to this histogram (e.g. using Kernel Density Estimation) and produce a death probability density function f(t). After normalization, at any given time t the proportion of observed events in the population is given by:

the cumulative death distribution function. Essentially, we are assuming T (time-to-event) to be a random variable with cumulative distribution function:

To illustrate this, we will use a dataset that contains information about countries from 1946 to 2008. Thus, this is a right censored dataset. Among the information from countries there are two special columns named bornyear and endyear which stand for the years that a specific country (with that name) was born and was ended. We can use the countries' duration time as our survival time random variable.

png
png
Countries survival histogram

Survival function

Definition: The survival function S(t) gives the probability that a subject survives longer than some specified time t, in other words:

Given a death density function f(t). The area under the curve to the right of time t is the proportion of subjects in the population who have survived past time t. We call this quantity S(t) and it is given by:

The survival function is an instantaneous risk rate which we introduce next.

Instantaneous Hazard function

Definition: The instantaneous rate at which a randomly-selected individual known to be alive at time t will die at time t+Δt is called the conditional failure rate or instantaneous hazard, λ(t). Given by:

where Δt is an infinitesimal time interval.

Note that, in contrast to the survival function the instantaneous hazard rate focus on the event rather than surviving to the event. The above expression can be written as:

Where in the second line we have used the following fact:

and then we used the fact that for very small Δt we can write:

In the last line: f(t) is the death density and S(t) is the survival function.

Furthermore, by noting that:

we can write for the hazard function:

Example: This relation allows us to analyze a very simple scenario: the one where the hazard risk does not change with time t. This is: λ=c, where c is a constant value. Which leaves for the survival function:

The above is straightforward to illustrate with a few code lines:

png
Figure 2: Constant hazard rate

Definition: It is also natural to define the cumulative hazard function as:

We can rewrite the survival and death density functions in terms of this cumulative hazard function

Note that the hazard function might be of more intrinsic interest than the p.d.f. to a patient who had survived a certain time period and wanted to know something about their prognosis. There are also other reasons for introducing the hazard functions λ(t) and Λ(t):

  • Interpretability: Suppose T denotes time from surgery for breast cancer until recurrence. Then when a patient who had received surgery visits her physician, she would be more interested in conditional probabilities such as “Given that I haven’t had a recurrence yet, what are my chances of having one in the next year” than in unconditional probabilities (as described by the p.d.f.).
  • Analytic Simplifications: When the data are subject to right censoring, hazard function representations often lead to easier analyses. For example, imagine assembling a cohort of N patients who just have turned 50 years of age and then following them for 1 year. Then if d of the men die during the year of follow-up, the ratio d/N estimates the (discrete) hazard function of T=age at death. We will see that Λ(·) has nice analytical properties.
  • Modeling Simplifications: For many biomedical phenomena, T is such that λ(t) varies rather slowly in t. Thus, λ(·) is well-suited for modeling.

The Expected Survival Time or life expectancy is given by:

after integrating by parts, and making use of the fact that

we get:

The Likelihood Function for Censored Data

Suppose we have a cohort of N observations governed by a survival function S(t) with death density f(t) and hazard λ(t). If the i-th subject is observed at time tᵢ and the subject died, its contribution to the likelihood function is:

If the subject is alive at time t=tᵢ, all we know is that the survival time T of subject i exceeds tᵢ. The likelihood of this event is:

the contribution to the likelihood of a censored observation.

Let dᵢ∈ {0,1} be the death indicator variable for the i-th observation, that takes the value 1 if the subject died and 0 otherwise. Then, we can write the likelihood function as:

This leaves for the log-likelihood:

Example: Let us go back to our good old exponential distribution example. The hazard rate function is constant λ(t)=λ. Then, for the cumulative hazard we have Λ(t)=λt. The log-likelihood then reads:

if:

is the total number of deaths and

is the total observation time. We can write the log-likelihood as:

We obtain a score function by differentiating this expression with respect to λ:

setting this score to zero gives the maximum likelihood estimator of the hazard:

the total number of deaths divided by the total duration time, a.k.a. death rate. The expected information is obtained from:

using the maximum likelihood estimate λₘₗ we get an estimator for the variance:

which is used as a confidence interval for λ.

Approaches to Survival Analysis

Estimating the Survival function

We could be tempted to estimate this quantity as:

where nₛ(t) is the number of subjects surviving beyond T=t and n is the total number of subjects. But note that in the presence of censoring this estimator cannot be used since the numerator is not always defined. To see this consider the data in the example below: the subject indexed as 13 has a survival time T=25 days but the response variable status has a value of 0 which means the observation was censored at the 25th day. For this subject, we do not have information about the occurrence of the event. The only thing we know is that the event has not occurred for the 25 first days of observation, point at which we have lost track of the subject. This right censoring does not allow us to calculate the real number of surviving subjects, the numerator of the

estimator.

To illustrate the survival function estimation, we will use a dataset concerning a Veteran’s Administration Lung Cancer Trial. It consists of a randomized trial of two treatment regimes for lung cancer. The dataset contains the following variables:

  • Treatment: denotes the type of lung cancer treatment; standard and test drug.
  • Celltype: denotes the type of cell involved; squamous, small cell, adeno, large.
  • Karnofsky_score: is the Karnofsky score.
  • Diag: is the time since diagnosis in months.
  • Age: is the age in years.
  • Prior_Therapy: denotes any prior therapy; none or yes.
  • Status: denotes the status of the patient as dead or alive; dead or alive.
  • Survival_in_days: is the survival time in days since the treatment.
df = pd.read_csv('veterans.csv')
df.iloc[[11,5,32,13,23]]
png

Consider our naive estimator

From the above table, we can compute:

But we cannot compute:

because we do not know whether subject number 4 (indexed 13) is still alive at t=30, all we know is that the subject was still alive at t = 25.

Kaplan-Meier Estimator

An estimator that can be used for right censored data (as it is our case) is the so called Kaplan-Meier estimator.

Definition: [Kaplan-Meier] Let dᵢ be the number of events at time T=tᵢ. Let also nᵢ be the number of subjects known to be survived, this is, the number of subjects with no event or that are censored up to time tᵢ. The Kaplan-Meier estimator for the survival function is defined as:

Figure 3: Kaplan-Meier curves for age and time since treatment

Patients in this study were given two treatments, this information is stored in the trt column of the dataset. We can use this flag to investigate the impact of the two different treatments on the above Survival curves:

df['trt'].value_counts()1    69
2 68
Name: trt, dtype: int64
png
Figure 4: Kaplan-Meier estimates by age and time (together with the treatment variable)

The above Kaplan-Meier estimates for the survival curves allow to study the effect of covariates on the survival function. This approach of splitting the dataset into smaller groups according to a variable is useful but not feasible when we want to consider more than a couple of variables, because subgroups will become very small very quickly. For this kind of situations we need models! BAM!

Cox Proportional Hazard Models

This family of models introduced by Cox (1972) focuses on the hazard function instead of focusing on the survival function. The simplest member of this family of models is the proportional hazards model. For this model, the hazard at time t for an individual with covariates xᵢ is assumed to be:

where λ₀(t) is a baseline hazard function that describes the risk of individuals with xᵢ=0, that serve as a reference. The multiplicative term

is the relative risk, a proportionate increase or reduction in risk that depends on the set of covariates xᵢ. Note that this increase or reduction of risk does not change with time. In fact, the proportional hazard assumption is satisfied by two samples x₁ and x₂ when:

is constant with time. Taking the logarithm we obtain:

which yields a linear model for the logarithm of the hazard. Returning to the expression for the hazard, we can integrate it to obtain the cumulative hazard functions:

exponentiating we obtain the survival functions:

Thus, the effect of the relative risk factor on the survival function is to raise it to a power. The likelihood of an event to be observed occurring for subset i at time tᵢ can be written as:

that represents the probability of subject i having an event at time tᵢ among those at risk at time tᵢ. Note that the baseline hazard gets canceled out in the above expression, so we can write for the likelihood

The above expression has to be corrected for tied events, this is, when the number of deaths d at time tᵢ is greater than 1. This yields:

Example: Let us go back to the lung cancer dataset and try to estimate the risk scores and survival curve.

Now, we can call the predict_survival method and plot the survival curves:

png
Figure 5: Cox Survival model Survival curves estimates

Exponential and Weibull models

Different kinds of proportional hazard models are obtained depending on the different choices for the baseline survival function. For example, we could (for some reason) set the baseline hazard function λ₀(t)=λ₀ to be a constant value. Then the hazard of subject i at time t reads:

Another common model is obtained by assuming the death density function follows a Weibull distribution from which the survival function is:

and hazard function

where λ>0 and p>0 are parameters. If p=1, this model reduces to the exponential model and has constant risk over time. If p>1, then the risk increases over time. If p<1, then the risk decreases with time. This can be seen by taking the log of the above expression:

Time-Varying Covariates

The nature of the model allows for considering time dependent covariates. Let xᵢ(t) denote the value of a covariate vector for subject i at time (or duration) t. The proportional hazard model now reads:

Time-dependent “Effects”

The model can also be generalized so the coefficients β(t) are now a function of time. Example: it is possible that certain social characteristics might have a large impact on the hazard for children shortly after birth, but may have a relatively small impact later in life. So, we would like something like:

The General Hazard Rate Model

Naturally, we can let everything depend on time to obtain the most general version of a hazard rate model:

Or… more generally:

where fₘₗ stands for your favorite machine learning model or any other function approximator.

Discrete Time Models

There are cases where the time variable is actually a “discrete” value, which basically means that is not continuous. One example is using the age as a predictor for death risk. Having a discrete description is useful. In practice, it is the discrete description that is used to approximate the functions using our computers.

Let the survival time T be a discrete random variable that takes the values t₁ < t₂ < … < tₙ with density function:

The survivor function at time tⱼ is then:

Next, we need to define the hazard at time tⱼ as the conditional probability of dying at that time given that one has survived to that point, so that:

Another result of interest in discrete time is that the survival function at time can be written in terms of the hazard at all prior times t₁, t₂, …, tⱼ₋₁ as:

which makes sense, since one needs to survive t₁, then one must survive t₂ given that has survived t₁, and so on, finally surviving tⱼ₋₁up to that point.

Measuring the Performance of Survival Models

Usual metrics such as RMSE or correlation might not be suitable for censored variables. The most frequently used evaluation metric of survival models is the concordance index (c-index, c-statistic). It is a rank correlation between predicted risk scores f and observed time points t that is closely related to Kendall’s τ. It is defined as the ratio of correctly ordered (concordant) pairs to comparable pairs.

This is, two samples i and j are said to be comparable if the sample with lower survival time t experienced an event, i.e., if tⱼ > tᵢ and dᵢ=1. A comparable pair is concordant if the estimated risk f by a survival model is higher for subjects with lower survival time, i.e.,

otherwise the pair is said to be discordant. Essentially, we are talking about the following quantity:

Harrell proposed an estimator of this quantity that is implemented in scikit-survival. Known as Harrell's concordance index, c-index or c-statistic. The interpretation is identical to our usual ROC curve.

While Harrell’s concordance index is easy to interpret and compute, it has some shortcomings:

  1. it has been shown that it is too optimistic with increasing amount of censoring
  2. it is not a useful measure of performance if a specific time range is of primary interest (e.g. predicting death within 2 years).

The first point is addressed by a weighing procedure known as IPCW (Inverse Probability of Censoring Weights).

The second issue is addressed by extending the well known ROC curve to possibly censored survival times. Given a time point t, we can estimate how well a predictive model can distinguishing subjects who will experience an event by time t (sensitivity) from those who will not (specificity). The resulting AUC for this time-dependent cumulative/dynamic ROC curve is:

An estimator for this quantity is implemented in scikit-survival, defined as:

where ωᵢ are inverse probability of censoring weights (IPCW). To estimate IPCW, access to survival times from the training data is required to estimate the censoring distribution. Note that this requires that survival times survival_test lie within the range of survival times survival_train. This can be achieved by specifying times accordingly, e.g. by setting times[-1] slightly below the maximum expected follow-up time. IPCW are computed using the Kaplan-Meier estimator, which is restricted to situations where the random censoring assumption holds and censoring is independent of the features. Finally, the function also provides a single summary measure that refers to the mean of the AUC(t) over the time range (τ₁,τ₂):

where S(t) is the Kaplan-Meier estimator of the survival function.

Let us load another built-in dataset from scikit-survival. The dataset has 7874 samples and 9 features:

  • age: age in years
  • sex: F=female, M=male
  • sample.yr: the calendar year in which a blood sample was obtained
  • kappa: serum free light chain, kappa portion
  • lambda: serum free light chain, lambda portion
  • flc.grp: the serum free light chain group for the subject, as used in the original analysis
  • creatinine: serum creatinine
  • mgus: whether the subject had been diagnosed with monoclonal gammapothy (MGUS)
  • chapter: for those who died, a grouping of their primary cause of death by chapter headings of the International Code of Diseases ICD-9
  • The endpoint is death, which occurred for 2169 subjects (27.5%).

Let us treat some missing values:

We need to be a little bit careful when selecting the test data and time points we want to evaluate the ROC at, due to the estimator’s dependence on inverse probability of censoring weighting. First, we are going to check whether the observed time of the test data lies within the observed time range of the training data.

When choosing the time points to evaluate the ROC at, it is important to remember to choose the last time point such that the probability of being censored after the last time point is non-zero. Here we use a more conservative approach by setting the upper bound to the 80% percentile of observed time points, because the censoring rate is quite large at 72.5%.

png

The plot shows the estimated area under the time-dependent ROC at each time point and the average across all time points as dashed line.

We can see that age is overall the most discriminative feature, followed by κ and λ FLC. That fact that age is the strongest predictor of overall survival in the general population is hardly surprising (we have to die at some point after all). More differences become evident when considering time: the discriminative power of FLC decreases at later time points, while that of age increases. The observation for age again follows common sense. In contrast, FLC seems to be a good predictor of death in the near future, but not so much if it occurs decades later.

Evaluating model’s predictions

Cox Proportional Hazards model

Fit the Cox proportional hazard model to the train data:

Using the test data, we want to assess how well the model can distinguish survivors from deceased in weekly intervals, up to 6 months after enrollment.

png

The plot shows that the model is doing moderately well on average with an AUC of ≈ 0.72 (dashed line). However, there is a clear difference in performance between the first and second half of the time range. The performance on the test data increases up to 56 days from enrollment, remains high until 98 days and quickly drops thereafter. Thus, we can conclude that the model is most effective in predicting death events in the medium-term.

Random Survival Forests

The downside of Cox proportional hazards model is that it can only predict a risk score that is independent of time (due to the built-in proportional hazards assumption). Therefore, a single predicted risk score needs to work well for every time point. In contrast, a Random Survival Forest does not have this restriction. So let’s fit such a model to the training data. Details of the inner workings of Random Survival Forests can be found here.

For prediction, we do not call predict, which returns a time-independent risk score, but call predict_cumulative_hazard_function, which returns a risk function over time for each patient. We obtain the time-dependent risk scores by evaluating each cumulative hazard function at the time points we are interested in.

png

Indeed, the Random Survival Forest performs slightly better on average, mostly due to the better performance in the intervals 25–50 days, and 112–147 days. Above 147 days, it actually is doing worse. This shows that the mean AUC is convenient to assess overall performance, but it can hide interesting characteristics that only become visible when looking at the AUC at individual time points.

Conclusions and further references

IMHO Survival Analysis is about asking the right questions:

  • Does it make sense to tackle this or that problem as a usual classification problem?
  • What are we trying to model?
  • Is time an important variable for prediction?

There is a considerable amount of (open source) resources out there that are written in python such as pySurvival, scikit-survival and lifelines. Also, the people at Loft made a clever use of XGBoost and some other techniques to tackle the survival analysis problem know as XGBoost-survival-embeddings.

The source code used to write this article can be found at: https://github.com/pibieta/blog-posts/blob/main/survival-analysis/intro-survival-analysis.ipynb

--

--

Pablo Ibieta
DataLab Log

Data Scientist, Physicist interested in machine learning and information theory