Managing Attrition. An introduction to Survival Analysis

Gaurang Mehra
6 min readFeb 26, 2024

--

Survival analysis is a class of models and techniques used for time to event analysis. The event can be a machine failure, loss of a customer (churn) or in the case of medical studies death after diagnosis. This class of models finds wide application in subscription businesses where knowing the lifetime of a customer, the time to loss, is critical in determining Customer Lifetime Value (CLV). For the purposes of this article we will assume an event to be a customer that is lost.

While working on survival analysis we need to keep in mind 2 key concepts

  1. Censoring: Survival studies have a defined end point with customers or observations entering the study at various points. Customers/observations that have reached the end of the study without the event taking place are called censored. In the figure below the 3rd and 4th observations are censored as the event, in this case customer loss has not yet occurred. For observation 4 all we can say is that the duration/lifetime is at least as long as 15 months.
Fig 1.1 showing censoring

2. Event table: Knowing the censored observations we can build an event table as shown below. We order all the observations/customers by duration. We observe that all 4 observations have duration=0 and at duration=0 the population at risk is still 4. At duration =3 we observe our first event or loss and the population at risk still remains 4 (population at risk is defined at the beginning of the duration period). At duration =5 we have our 2nd loss and our population at risk drops to 3 since we already had a loss at duration=3. At duration=10 we have censored observation which reduces the population at risk but the number of events at duration=10 are still 0.

Fig 1.2 Event table

Once we have built this table we can calculate the hazard or the risk of loss happening at duration=t, or customer tenure=t given that they have survived till t. This is calculated as

h(t) = number of events at t/pop at risk at t

We can then calculate the momentary survival as

Momentary Survival = (1-h(t))

The full survival function till time t can be calculated as follows

Survival Function =(1-h0)*(1-h1)*(1-h2)…..*(1-h(t))

We can then plot the survival function column against the event at column to get a survival curve. We just built our first curve.

Fig 1.3 First Survival Curve

This curve clearly has 2 major problems

  • We do not have enough data to estimate a survival for the entire population. Typically the dataset you use to build a survival curve should have at least a few hundred observations that are representative of the underlying population
  • The proportion of censored observations is too high (50%). Typically the proportion of censored observations should be less than 40% for this kind of survival analysis to be useful

This class of survival analysis where we build an event table from the data and calculate the survival function from the event table is called Kaplan Meier analysis. Fortunately we do not need to build the event table manually. The Python lifelines library has an inbuilt class called KaplanMeierFitter. We need to only provide a durations column and an event observed column to an object of the KaplanMeierFitter class.

This type of model has a few pros and cons

Pros

  1. Simple and intuitive to understand
  2. Very commonly used across industries
  3. Leads to simple graphical representation

Cons:

  1. Descriptive not predictive. Since there is no learning involved it cannot predict on new data. At best you can match new data to a broad segment from the existing data
  2. Is not very suitable for sensitivity analysis. For eg. it cannot answer questions like keeping all else constant how will survival be impacted if I change price by 1 cent?

Now lets see how we can apply this to a real world dataset. The dataset we are using is a dataset of regimes across countries over time from Data Camp. The duration column is the duration of a particular regime in a particular country

from lifelines import KaplanMeierFitter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# importing data
import os
os.chdir("/content/drive/MyDrive/Colab Notebooks/Datacamp/Survival Analysis (1)")

regime = pd.read_csv('dd.csv')

regime.head()
Fig 1.4 Head of regimes data

The head of the data frame shows us how the data is organized. We have an event column (observed) which describes if the regime has reached its end and a duration column which describes the duration. If observed = 0 and we have a duration that means the observation was censored.

#Instantiate a KaplanMeierFitter object
kmf = KaplanMeierFitter()

#Fit it to the data
kmf.fit(regime['duration'],regime['observed'])

# Plot the survival Function
kmf.plot_survival_function()
plt.show()
Fig 1.5 Kaplan Meier survival curve for regimes data

Looking at this curve we can see that the median survival time is ~5 years. Fortunately we can just calculate this handily using the median_survival_time_ attribute. We can also get the survival function as a table.

print("The Median survival time of a regime is",kmf.median_survival_time_)
kmf.survival_function_
Fig 1.6 Survival function

We can also compare the survival of 2 sub-groups in the data, in this case the regimes of Western Europe with those of Southern Asia. We simply filter the base dataset to these subgroups using the pandas loc method. We then fit these sub-groups using the KaplanMeierFitter liek earlier

# Instantiate 2 Kaplan Meier fitter objects for W Europe and S Asia
kmf_w_europe = KaplanMeierFitter()
kmf_s_asia= KaplanMeierFitter()

# Filter to Western Europe fit to the kmf_w_europe
kmf_w_europe.fit(regime.loc[regime.un_region_name=='Western Europe','duration'],
regime.loc[regime.un_region_name=='Western Europe','observed'])

# Filter to Southern Asiafit to the kmf_s_asia
kmf_s_asia.fit(regime.loc[regime.un_region_name=='Southern Asia','duration'],
regime.loc[regime.un_region_name=='Southern Asia','observed'])

# Plotting the results
fig,ax = plt.subplots()
kmf_w_europe.plot_survival_function(ax=ax,label="Survival curve Western Europe")
kmf_s_asia.plot_survival_function(ax=ax,label= "Survival curve Southern Asia")
Fig 1.7 Comparing Western European regimes to Southern Asian regimes

Clearly we can see that Western European regimes last for a shorter duration than Southern Asian regimes. This makes sense given that democratic regimes have regular elections and power shifts from one party or leader to another. Southern Asia over time has had long running dictatorships/monarchies which shows up as higher regime survival on the survival curve.

As we can see the Kaplan Meier curve is not smooth and as the segment size gets smaller the curve becomes more stepped and discontinuous. This is generally not great for business forecasting.

Another set of models called parametric models addresses this issue. Not only this but parametric models can easily run sensitivity analysis and estimate the effect of individual factors on survival. More of this in the next article

github link: https://github.com/gmehra123/data_science_projs/blob/main/Survival_Analysis.ipynb

--

--

Gaurang Mehra

Deeply interested in Data Science, AI and using these tools to solve business problems.