Survival Analysis

Jade Zhou

Published in

Inside Machine learning

8 min readOct 2, 2019

Part 2 — An Example: How to Predict Future Repair Amount?

Introduction

The idea of survival analysis comes from a businessman, John Gaunt. He built the life table including 3 columns (Age, Died, Survived) to analyze mortality statistics in London. Thirty years after Mr.Gaunt publishing his book, Edmund Halley started to apply mathematics representation to life table and survival analysis gradually became a mature research field.

Survival Analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death of patients and failure in mechanical systems. It can answer questions like after certain time, what percent of patients can still be alive? However, we don’t need to limit the usage of it to only survival analysis problem. Continuing on the blog from my colleague, Kunal, we will show you how to reshape a repair amount prediction question, usually considered as time series forecasting problem, into survival analysis structure and how to fit survival models to get predictions. Data sets come from Kaggle — ASUS Malfunctional Components Prediction. The goal is to predict future repair amount for each module-component pair at monthly level.

We use lifelines package for fitting survival models.

0. Survival Analysis Basics

Please refer to this page to get familiar with key concepts in survival analysis. Make sure you understand following 3 concepts and 2 probabilities before reading this notebook:
- Event
Traditional: If a patient is dead.
Current: If a component needs to be repaired.
- Duration
How long before the event happens.
- Censoring
After observed period, we call those subjects which we didn’t observe event happening on as right-censored data.
- Survival Function
It defines the probability the death event has not occurred yet at time 𝑡.
S(t) = Pr(T > t)
- Hazard Function
The probability of the death event occurring at time 𝑡, given that the death event has not occurred until time 𝑡.

Fig1. Hazard Function

1. What data we have?

Basically, we have three tables:
1) SaleTrain.csv
It contains information about date of sales and number of sales for each module-component pair. We need to sum over number of sales group by module, component and date of sale.
2) RepairTrain.csv
It contains information about date of sales, date of repair and amount of repair for each pair. We need to sum over number of repairs group by module, component, date of sale and date of repair. Then we will merge “SaleTrain” table and “RepairTrain” table to get one input table.

**Fig 1. Input Table after Preprocessing**

3) Output_TargetID_Mapping.csv

**Fig 2. Output Table after Preprocessing**

As shown in above pictures, we defined some new columns including “sale_date” and “repair_date”, converting actual dates into month index. I defined earliest month in data sets as my month 1 and then calculate the difference in how many months between sale/repair date and month 1. This step will simplify our later modeling and predicting process.

2. Analyze current problem under survival analysis setting?

It’s natural to consider our example as time series forecasting problem, so how to make up for the gap between time series and survival analysis? Actually, in every survival analysis problem, we have key concepts like right-censor waiting to define. Following this structure, we start to fill in blanks.

Based on above table, we can rephrase the problem in matchmatical way. Actually, we need to predict future monthly cumulative hazard ratio for each component. Then apply following formula to get future repair amount of components regarding specific module and specific sale date:

**Fig 5. Math Representation of Our Problem**

Finally sum over the amount by component, module and repair date to map into output table Kaggle needs.

Back to our data, event is if the component was sent back to repair, so we need to add one column to called “event’ and values are all 1. Then duration, which is called “months_after_sold”, is how long the repair of specific component was requested after sale. As for right-censor, it means we didn’t observe the event during experiment period, so that we cannot rule out the possibility of event beyond time frame. A common misunderstanding about right-censor is the target for each survival model is event or not, but actually right-censored or not. Therefore, it’s really important to find our right-censored data which requires a trick. For each module-component pair, we already know the amount of sales in total and how many repairs were requested during observed period, so the deduction of them can get how many components are still “alive” after observed period. Therefore, we define right-censored data to be the number of components which didn’t require repairing during observed period here and the corresponding value in “event” column is 0. Finally, we define the number of events happening in one month as frequency.

final_repair_date = round(((end_date - start_date)/np.timedelta64(1, 'M')))def add_censor(x):    new_row = pd.DataFrame(x.iloc[-1, :]).transpose()    new_row['repair_date'] = final_repair_date    new_row['months_after_sold'] = new_row['repair_date'] - new_row['sale_date']    new_row['event'] = 0    new_row['freq'] = new_row['number_sale'] - sum(x['freq'])    return x.append(new_row)df_input = df_input.groupby(['module_category', 'component_category', 'sale_date']).apply(add_censor).reset_index(drop = True)

**Fig 6. Input Table after Adding Right-Censored Data**

3. What models we will use?

In general, there are two kinds of models for survival analysis. The first type is univariate models, only taking information about time, right-censor and frequency. The second one is regression models which can incorporate influences from exogenous variables as well. In this blog, I’ll start with a very simple non-parametric model, Nelson-Aalen fitter, first and go through steps in detail and then mention cox regression not that deep because you can follow similar steps for Nelson-Aalen fitter.

Modeling and Predicting — Nelson-Aalen Fitter

Modeling: Component Level Fitter

Since missing data is all around, we will come into a problem that there are not enough observations for specific component-module pair to fit a model. Therefore, we can fit data at component-level first, which means fitting a model for every component, and then use predictions of repair amount for corresponding component to fill in missing pieces.

from lifelines import NelsonAalenFitter
naf = lifelines.NelsonAalenFitter(nelson_aalen_smoothing = False)
comp_hazard = pd.DataFrame(columns = ['component_category', 'months_after_sold', 'hazard'])
for component in df_input['component_category'].unique():
    print('Current component %s' % component)
    cur_df = pd.DataFrame(columns = ['component_category', 'months_after_sold', 'hazard'])
    df_input_specific = df_input[df_input['component_category'].str.match(component)]
    naf.fit(durations = df_input_specific['months_after_sold'], event_observed = df_input_specific['event'], weights = df_input_specific['freq'])
    cur_df['months_after_sold'] = naf.cumulative_hazard_.index
    cur_df['hazard'] = naf.cumulative_hazard_.diff().fillna(0).values
    cur_df['component_category'] = component
    comp_hazard = pd.concat([comp_hazard, cur_df])
    naf.plot_hazard(bandwidth=3)
plt.show()

From above codes, you can feel it’s pretty simple to fit a survival model. You just need to specify duration, event and weight. That’s why we spent much time explaining how to find corresponding pieces in second part.

**Fig 7. Estimated Monthly Cumulative Hazard Ratio at Component Level**

To give you a rough image of specific hazard ratio, I set bandwith = 3 for visualization:

**Fig 8. Estimated Hazard Ratio for Component P04**

In above figure, the unit of X axis is month and Y axis represents hazard ratio. This picture indicates a pattern that the risk of repair gets higher and higher at the beginning and then fluctuates within small range until 24 months after being sold. Then the risk decreases sharply and finally stabilizes around 0. I found similar patterns from most of components. Some people in Kaggle community said the reason might be warranty period is 2 years, so after 24 months, repair costs money so that people stop it.

Modeling: Module-Component Level Fitter

Run similar codes as above but at module-component level.

**Fig 9. Estimated Monthly Cumulative Hazard Ratio at Module-Component Level**

Predicting

Once we have two above tables, we will first merge output table with the hazard ratio table for module-component pairs and then merge results from component level model with output table to fill in missing values from hazard column.

df_final_output = df_output.merge(mod_comp_hazard[['module_category', 'component_category', 'months_after_sold', 'hazard']], left_on = ['module_category', 'component_category', 'months_after_sold'], right_on = ['module_category', 'component_category', 'months_after_sold'], how = 'left')df_final_output_nona = df_final_output[~df_final_output['hazard'].isnull()]
df_final_output_na = df_final_output[df_final_output['hazard'].isnull()].drop('hazard', axis = 1)
df_final_output_na = df_final_output_na.merge(comp_hazard, left_on = ['component_category', 'months_after_sold'], right_on = ['component_category', 'months_after_sold'], how = 'left')
df_final_output_na.isnull().sum()

**Fig 10. Number of Missing Values for Each Column**

However, there are still lots of missing values, so we need to look into them and find ways of dealing that. After some exploration, we can conclude patterns for module-component pairs or components who have missing values:

Duration is beyond the range of fitted models
There are not enough observations to fit a model whose time frame is long enough
Missing monthly cumulative hazard ratio for specific months

To finish predictions simply, we propose preliminary ways to deal with above cases correspondingly:

Replace missing values with 0 because the estimated hazard ratio plot shows ratio will stabilize around 0 in far future for these components.
Fit linear models on monthly cumulative hazard ratio based on historical data.
Directly interpolate hazard ratios.

Lastly, we multiply number of sales by monthly cumulative hazard ratio and sum over that by module, component and repair date to get final predictions.

df_final_output['pred_repair'] = df_final_output['number_sale'] * df_final_output['hazard']
df_final_output = df_final_output.groupby(['module_category', 'component_category', 'repair_date'])['pred_repair'].sum().reset_index()
df_final_output.head()

Beginning of Cox Regression

After playing with univariate models, we start to think about incorporating other information like how many sales were made at specific time into our model. Now it’s time to introduce cox proportional hazard model. To put it simply, this model thinks the log-hazard of an individual is a linear function of their static covariates and a population-level baseline hazard that changes over time. You can refer to this page here for better understanding.

We can use “LabelEncoder” from sklearn package to convert categorical variables like “module_category” to be integers and fit one model at first.

from sklearn import preprocessing
le_comp = preprocessing.LabelEncoder()
le_mod = preprocessing.LabelEncoder()df_cox_input = df_input[['module_category', 'component_category', 'number_sale', 'sale_date', 'months_after_sold', 'event', 'freq']]le_comp.fit(df_cox_input['component_category'])
df_cox_input['component_category'] = le_comp.transform(df_cox_input['component_category'])le_mod.fit(df_cox_input['module_category'])
df_cox_input['module_category'] = le_mod.transform(df_cox_input['module_category'])

**Fig 12. Input Table for Cox Regression**

One thing worth to mention is we need to check if proportional assumption holds for every cox regression model. There is a function called check_assumptions in lifelines to check that.

**Fig 13. Results from Check_assumptions**

Following above results, we have to go back and fit component-level models again. Now it’s your turn to finish whole modeling and predicting flow!

Summary

Congrats! Now you know how to analyze time series problems into survival analysis structure and corresponding codes to fit models and make predictions. To conclude, there are many models to deal with time series forecasting, but best model type is always case by case. If you are stuck in time series, trying to change a way of thinking and starting to apply survival analysis in your case might be helpful. Hope this series of blogs about survival analysis can provide you with a new view when dealing with forecasting problems.

Click here to learn more about the IBM Data Science Elite.