Survival Analysis | An Introduction

A detailed introduction on Survival Analysis.

Sushma Dhamodharan
Analytics Vidhya
9 min readMar 1, 2020

--

In logistic regression, we are interested in studying how risk factors were associated with presence or absence of disease. Sometimes though we’re interested in how a risk factor or treatment affects time to disease or some other event. Or we may have study dropout, and therefore subjects who we are not sure if they had disease or not. In these cases, logistic regression is not appropriate.

Survival analysis is used to analyze data in which the Time until the Event is of interest. The response is often referred to as a Failure Time, Survival Time, or Event time.

What is Survival Analysis ?

Survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs.

In layman terms, we are basically trying to analyze the data we have in hand until the event of interest has occurred.

There are two important terminologies to be taken into account, which are discussed below :

Time and Event

Time also called the Survival Time can be in month, years, age etc., as it gives the time that an individual has survived over some follow-up period.

Event usually refers to failure or death (for example, event is “developing heart disease,” and the outcome is “time in years until a person develops heart disease.”), it can also be a positive event (for example “time to return to work after an surgical procedure”, in this case failure is a positive event.). Often the event may not be just one event, in those cases we will consider them as competing events.

One another main advantage of survival analysis is taking into consideration the censored data.

Censoring

It occurs when we have some information about individual survival time, but we don’t know the survival time exactly.

Censoring during study of truck maintenance times

Why censoring occurs ?

Example : Leukemia patients followed until they go out of remission.

  • Study ends while patient is still in remission
  • Person goes out after the study ends
  • Withdraws from the study

In general data is said to be censored if,

A person does not experience the event before the study ends.

A person lost to follow-up during the study period.

A person withdraws from the study because of death.

Different types of Censoring

Right Censoring
Right Censoring

Right Censoring : true survival time is equal to or greater than observed survival time. A person whose lost to follow-up, withdrawn from the study or study ended while he still didn’t get to the event. For these data, the complete survival time interval, which we don’t really know, has been cut off (i.e., censored) at the right side of the observed survival time interval.

Left Censoring

Left Censoring : true survival time is less than or equal to the observed survival time. Suppose we are following people until they become HIV+, though at a certain period of time when we test the person shows +, we may not know when the person was actually affected first. In this case the data is left censored.

Interval Censoring
Interval Censoring

Interval Censoring : true survival time is within a known time interval. A person had two HIV tests, on in which he/she has got negative (time t1) and the other positive (time t2), in this case we know the exact time of he/she getting HIV is between (t1,t2).

Left censoring: t1 = 0, t2 = upper bound

Right censoring: t1 = lower bound, t2 = infinity

Even though censored observations are incomplete, in that we don’t know a person survival time exactly, we can still use the information we’ve on the censored person up to the time we’ve lost track of him or her. Rather than simply throwing away the information we have.

Censoring Assumptions

  1. Independent : censoring is independent provides its random with any subgroup of interest. i.e., censoring of one shouldn’t be dependent on the other.
  2. Random : failure rate for subjects who are censored is assumed to be equal to the failure rate for subjects who remained in the risk set who are not censored.
  3. Non-Informative : distributions of survival time T provides no information about the distributions of censorship at time C.

In order to identify whether its non-informative or not, we have to understand the distribution of time-to-event and time-to-censorship.

In a drug study, some people have side effects after taking the drug. Hence, many people who have side effects will stop taking the drug (withdraws from the study). In such a case the risk of people with side effects is not as same as other people in the study. If that’s the case we are overestimating survival.

Terminologies and Notations

  • T : random variable
  • t : Specific value for T. (for example : Surviving > 5 yrs after cancer therapy (t = 5) )
  • d : dichotomous variable {0 , if failure (event has happened) or 1, for censored }
  • Survivor Function S(T > t) : Probability that a person survives longer than some time ‘t’.

Properties of S(t):

Theoretical S(t)
  • Non-increasing, as t increases S(t) decreases.
  • Initially at t = 0 no one has got the event, hence S(t) = 1
  • At t = +Inf, study period increased without limit hence nobody would survive,S(t) -> 0.
Practical S(t)
  • But in reality, we obtain graph as step function, as the study period is small. And since not everyone studied gets the event, the survivor function may not go all the way up to 0.
  • Hazard Function h(t):
Hazard Function

The hazard function gives the instantaneous potential per unit time for the event to occur, given that the individual has survived up to time t.

Lets take the example of a speedometer of car, it shows 50 km/hr. This means that if we maintain the same speed for 1 hour we would cover 50 km, but in reality we may have ups and downs. Thus the speed shown gives you the instantaneous potential at the moment you have looked at your speedometer, i.e., how fast you are at the moment (instantaneous potential/velocity). Similar to the idea of velocity, a hazard function h(t), gives the instantaneous potential at time t for getting an event.

h(t) numerator : in the form of conditional probability P(A/B), i.e., the probability that a person’s survival time, T, will lie in the time interval between t and delta t, given that the survival time is greater than or equal to t. Because of the given sign here, the hazard function is sometimes called a conditional failure rate.

Why h(t) is called rate?

As we see the expression is a ratio of two quantities, the numerator is the conditional probability, the denominator is a very small time interval.By this division, we obtain a probability per unit time, which is no longer a probability but a rate.

In particular, the scale for this ratio is not 0 to 1, as for a probability, but rather ranges between 0 and infinity, and depends on whether time is measured in days, weeks, months, or years, etc.

Conditional failure rate/ hazard function gives the instantaneous failure rate or hazard function h(t) gives the instantaneous potential for failing at t per unit time, given survival up to time t.

  • The hazard function can go in any direction.
  • Always non-negative.
  • Has no upper bound.

Different types of hazard functions

Note: the y axis of the graphs is h(t)

When S(t) in survival analysis seems more apt about describing the survival experience of a study cohort why do we need h(t).

It’s a measure of instantaneous potential, whereas a survival curve is a cumulative measure over time.

It’s used to identify a specific model form such as exponential, weibull or log-normal curve that fits the data.

On a usual basis, the survivor model is written in terms of hazard function (helps to mathematically model the data).

If one knows on of the values we can determine the other

Goals Of Survival Analysis

  1. Estimate and interpret survivor and/or hazard functions.
  2. Compare survivor and/or hazard functions.
  3. Assess the relationship of explanatory variables to survival time.

Data Layout

Data can be represented in two formats

  1. Computer Layout
  2. Counting Process : Applies for complex data, where the event may happen multiple times for the same individual. (For example: defaulting on loan repayments. A person defaults multiple times.)
Computer Layout
CP Format

Descriptive measures of Survival Analysis

Where T1 bar = 359/21 , T2 bar = 182/21
  • Because several of the treatment group’s times are censored, this means that group 1’s true average is even larger than what we have calculated. Thus, it appears from the data (without our doing any mathematical analysis) that, regarding survival, the treatment is more effective than the placebo.
  • In our example, the average hazard for the treatment group is smaller than the average hazard for the placebo group. Thus, using average hazard rates, we again see that the treatment group appears to be doing better overall than the placebo group; that is, the treatment group is less prone to fail than the placebo group.

Descriptive measures (T bar and h bar) give overall comparison. They do not give comparison over time. For time to time analysis we use survivor Function/curves.

Note: For comparing two groups we have to consider confounding and interaction.

Confounding

Confounding is the distribution of true relationship between exposure and disease by the influence of one or more other factors. These other factors are known as confounders. Confounding variables are nuisance variables, in that they get in the way of the relationship of interest. It is therefore desirable to remove their effects.

Interaction

The effect of one explanatory variable on the outcome depends on the particular level of another explanatory variable.

Example : The effect of treatment may be different depending on the level of wbc.

For example, suppose that for persons with high log WBC, survival probabilities for the treatment are consistently higher over time than for the placebo. This circumstance is illustrated by the first graph at the left. In contrast, the second graph, which considers only persons with low log WBC, shows no difference in treatment and placebo effect over time. In such a situation, we would say that there is strong treatment by log WBC interaction, and we would have to qualify the effect of the treatment as depending on the level of log WBC.

In case of confounding or Interaction, other alternative strategies include stratifying the data/using proportional hazards model.

A simple example

Below shown is the life table, calculated using Kaplan meier survival function and its survival curve, which will be explained in detail in the future posts.

References

[1] Statistics for Biology and Health Series Editors M. Gail, K. Krickeberg, J.M. Samet, A. Tsiatis, W. Wong

[2] http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Survival/BS704_Survival_print.html

--

--