Survival Analysis- What is it?
Part 1- And How can it solve my business problems?
The Origin Story
As ironic as it may sound, but one of the AI modeling methods powering social media behavior of millennials, has been around for almost half a millenia!
This curve created in 1669 (without matplotlib :), tried to answer few simple questions viz.; how many people would survive till old age? What was a person’s chance of surviving past 20? Past 36?
This is survival analysis! We are trying to estimate this curve — only the outcome can be any binary event, not just death.It tells us “when to worry?”
“When to Worry” Modelling
The survival analysis is also known as “time to event analysis”. It’s all about when to start worrying? Only if I know when things will die or fail then I will be happier …and can have a better life by planning ahead !
It is also used to predict when customer will end their relationship and most importantly, what are the factors which are most correlated with that hazard ?
Key concept here is tenure or lifetime. How long something will last? And if I know that then I may be able to calculate how valuable is something?(naturally something that last longer is more valuable than something that dies soon…be it a phone or a customer relationship). And this way I can reward reliability or loyalty! Factors that make customers stay or leave and quantitative effect of various factors on tenures is all part of survival analysis. This has application in other areas too, like how long people will keep on swiping tinder?Or when social media user who has gone offline , after how many days user is no longer likely to come back? It’s a big tool in customer retention as well as warranty forecasting problems.
The mouthful definition of survival analysis can be written as “Statistical methods for analyzing longitudinal data on the occurrence of events. Events may include death, injury, onset of illness, recovery from illness (binary variables) or failure of device or termination of relationship or attention etc.”
Below examples of time to event problems may help understand it better-
- Estimate time-to-event for a group of individuals, such as time until second heart-attack for a group of MI (Myocardial infarction) patients.
- To compare time-to-event between two or more groups, such as treated vs. placebo MI patients in a randomized controlled trial.
- To assess the relationship of co-variables to time-to-event, such as: does weight, insulin resistance, or cholesterol influence survival time of MI patients?
But Why Not Regression?
1. Why not compare mean time-to-event between your groups using a t-test or linear regression?
— ignores censoring
People under a cancer drug survival trial may die due to accidents . Such cases needs to be handled differently. This special type of missing data requires treatment to avoid bias in observed data.
2. Why not compare proportion of events in your groups using risk/odds ratios or logistic regression?
— ignores time
A regression will mean (assuming an exponential decay) that tenure of customer had no effect on the rate of failure or leaving. And customer who are around longer means they may never tend to leave.
Retention vs Survival Curves
Retention curve — The stop and start curve which shows proportion of customers that are retained at particular period of time. This can be helpful as it can tell us half life — how long before half of customers leave or fail. But retention curve don't accurately tell time as its jagged.
So one can use survival curve which tells the probability of customer surviving at time t. It’s accumulation of all conditional survivals up-to that point by multiplying them together. This is much better, one can calculate percent survival vs. tenure in months and observe a steady decrease. Mathematically average hazards doesn’t make sense and so we can use this. It is useful to describe relationship of factors of interest in presence of various co-variates like age, gender etc.
- Parametric- This is what we learnt in school days. A clean and nice equation that defines how long something will last as with radioactive decay equation. Popular distributions are exponential , log or Weibull distribution. Some form of maximum likelihood estimation is then used to evaluate models.
- Non Parametric- This makes no assumptions about underlying distribution of data. Kaplan Meier is widely used method to estimate & graph survival probabilities as a function of time. It can obtain univariate statistics for survival data, including the median survival time, and compare the survival experience for two or more groups of subjects. Various popular tests like chi-square test or log rank test can be used to compare Kaplan-Meier curves.
- Semi-Parametric- This combines best of above two worlds by making very few assumptions. Specifically it makes no assumption about shape of so called hazard function (more below). Most popular method here , named after famous statistician from England Sir Cox is called as “Cox Regression” modelling. With Cox Regression one can easily identify relationship between hazard functions and predictors. One can test differences in survival times for various groups and adjust by various co-variegates of interest.
But which is the most relevant approach in business problems?
No Life is Ever Normal !
In the case of human longevity, T is unlikely to follow a normal distribution, because the probability of death is not highest in the middle ages, but at the beginning and end of life.This is the key reason , why parametric approaches (while very intuitive ) doesn't work in practice. And this is true for business problems beyond human lifetime analysis. For example see the curve below ; popularly known as “Bathtub Curve” ! .
Originally created for human life , it is useful in business problems as well. .This curve shows that at beginning of period, we can see some spike in failures rates. For example, many customers may cancel subscription in first 30 days to avail 30 initial return guarantees or simply because they don’t like service or product. The curve then remains flat until seeing another spike when approaching end of customer contract or warranty periods. Now many may choose to take advantage of this time to replace product or services. So spike at two ends and flat in the middle is common curve. (Keen eyes would notice that modern human life curve is very different from 1669 shown above ; thanks to advances in medicine)
Censoring & Hazards
Censoring tries to address the problem about people who were initially part of study but then drops out.
- First type- Stopped- Customers who stopped for any reason are no longer included in population counts.
- Second type- Still Active- If a customer is still active then we don’t know how much longer it will survive and so its censored for THAT PARTCULAR hazard at t=5 (until observation period) but included upto t=4 (previous time quanta).
There are many cases when censoring is needed. Example — Deaths due to unrelated events (people under cancer drug treatment group dies due to accident). Such failure is not included and it’s censored. Business example is involuntary termination of customers or new regulations that makes company replace all products.
This is a method to deal with non parametric approaches. It’s function for
If a something is alive at time t whats probability that it will die before t+1
Calculating hazard means taking two data points…something that stopped at time t and second those who did not (population at risk). Hazard probability is ratio of these two!
Some common hazard types are below
- Constant hazards — Hazards of customers leaving remains same no matter how longer they were there!
- Bathtub Hazard — Starts high..flattens & goes up finally. Explained by contracts or manufacturing defects.
- M shaped curves- Peaks at two intervals. Spikes up for some time for second instance and then gentle decline . The gentle decline is good thing as it shows reliability and loyalty.
One challenge comes in counting customers who stopped at year 2 but also counted in year 1 calculation. That makes initial hazards too low. We will address this problem below.
Let’ s start with an example! Say ; we want to prove that the risk of cancer is 2 times higher for smokers than who don’t . How can we do so? Now, if a person is smoker or not is initial condition (or a risk factor). The genius of Sir Cox was that he was able to measure this effect for different factors. This example is an actual example where a modeling algorithm was used to prove that smoking kills & shortens human life and thus had huge impact public policy about tobacco. Effects of time zero co-varieties (initial conditions) on hazards is such a powerful method that if there is ever a hall of fame for algorithms then this should belong there !
Cox regression model provides useful and easy to interpret information regarding the relationship of the hazard function to predictors. While a nonlinear relationship between the hazard function and the predictors is assumed, the hazard ratio comparing any two observations is in fact constant over time in the setting where the predictor variables do not vary over time. This assumption is called the proportional hazards assumption and checking if this assumption is met is an important part of a Cox regression analysis. It is by far the most popular model for survival data analysis and is implemented in large number of statistical software packages.
Now we can answer the question is what effect does all initial conditions have on hazards using Partial Likelihood. If one customer stops at t then partial likelihood that exactly that particular customer has stopped. Divides value of hazard for specific customer by sum of all hazards for all customers who might have stopped.
Key assumption made here is that initial conditions have same constant effect on all hazards ..regardless of time & hazards. This means in equation the hazards themselves cancel out and we have product of partial likelihood which gives likelihood of customers stopping when they did. This then use MLE (maximum likelihood estimates) approach to get final results.
We can now compare say marketing channel acquisition (email vs direct mail) and see it’s impact. Or what prompted shopping triggers for customers or warranty claims for a product from different factories. These are examples of categorical variables as risk factor but can be converted to continuous co-variates as well.
Limits of Cox
- It is designed around continuous time period and needs tweaking for discrete time.
- The assumption is that hazards models itself doesn’t have time component.
However it still works in practice to calculate which co-variates works best for given effect. This can also work for non time zero co-variates like when customer goes for upgrade
While it started out as modeling for survival timelines for humans , Survival Analysis modelling has wide range of applications in calculating customer lifetime value, hopping behavior , how long someone will stay active on social media sites and predicting warranty forecasting.
Part 2 — Survival in Action
That is the example which we will see next part of this blog. This notebook will explain how to code survival models for a use case of warranty forecasting. My colleague @Yushu(Jade) Zhou has written a blog with a code walk-through of survival analysis.