Converting a Probability to Fail Into a Time to Failure Metric
Getting the most out of your Probability to Fail Models
With predictive maintenance problems, there are two common metrics that represent the health of your asset.
The first is a probability to fail. That is, at a given moment in time, what is the probability that your machine will fail. Sometimes, this is represented by a health score. Typically, the health score is one minus the probability to fail times 100.
The second metric is the time until failure. That is, how many days, weeks, months, hours, minutes or seconds do you have until the asset in question stops working.
There are many different ways to calculate these metrics. Probably the most common way to gauge the probability to fail is a machine learning algorithm like logistic regression, random forest or gradient boosted tree (Note, there are many more).
Time to failure models typically rely on some type of survival model. A survival model is a family of techniques based on measuring and predicting expected lifetimes, given certain attributes of an individual or population. For example, will a drug treatment increase or decrease the life of a cancer patient? Or, how much longer will a machine operate if we service it every three months instead of every six months?
What if you have a probability to fail and want to convert it into a time to failure? Is this possible?
Of course it is possible. In fact, there are many ways to do it. In this article, I focus on my favorite technique. Not saying it is the best, just my favorite.
You probably won’t see this in a textbook, but my approach has served me well over the years. My technique is pretty easy and guarantees that your time to failure and probability to fail metrics are perfectly in-synch.
I should also mention that, although this is an equipment failure problem, this technique has many other applications. For example, the probability to churn and expected customer lifetime. Anytime you need to convert a probability of death to a prediction of lifespan, this should work.
1.0 Getting Set-Up
Install all of the relevant Python Libraries
!pip install --upgrade numpy
!pip install imblearn --upgrade
!pip install plotly --upgrade
!pip install chart-studio --upgrade
Import required libraries
import chart_studio.plotly as py
import plotly.graph_objs as go
import numpy as np
import numpy.dual as dualimport types
import pandas as pd
import statsmodels.api as sm
import plotly as plotlydef __iter__(self): return 0
Import the data from GitHub
#Remove the data if you run this notebook more than once
!rm P_HAT_DATA.csv
#import from github
!wget https://raw.githubusercontent.com/shadgriffin/phat_to_ttf/master/P_HAT_DATA.csv# Convert csv to pandas dataframe
pd_data = pd.read_csv("P_HAT_DATA.csv", sep=",", header=0)
2.0 Data Exporation
pd_data.head()
Our data set is pretty simple. Note that we have panel data. This means we have multiple entities (machines in this case) measured each day for a period of time.
Here is a description of the fields.
ID — ID field that represents a specific machine.
DATE — The date of the observation.
FAILURE_DATE — Day the machine in question failed.
P_FAIL — The probability a machine will fail on a given day. The result of Gradient Boosted Tree Model.
We also need to create another field that indicates the number of days between the day of record and the failure date of the machine. This is the actual time to failure from the historical record.
#convert dates to date format
pd_data['DATE'] = pd.to_datetime(pd_data['DATE'])
pd_data['FAILURE_DATE'] = pd.to_datetime(pd_data['FAILURE_DATE'])pd_data['C'] = pd_data['FAILURE_DATE'] - pd_data['DATE']
pd_data['TIME_TO_FAILURE'] = pd_data['C'] / np.timedelta64(1, 'D')
pd_data=pd_data[['ID','DATE','FAILURE_DATE','P_FAIL','TIME_TO_FAILURE']]
pd_data.head()
Examine the number of rows and columns. The data has 171,094 rows and 5 columns
pd_data.shape
There are 421 machines in the data set
xxxx = pd.DataFrame(pd_data.groupby(['ID']).agg(['count']))
xxxx.shape
Check for dups. There are none
df_failure_thingy=pd_data
df_failure_thingy=df_failure_thingy.drop_duplicates(subset=['ID','DATE'])
df_failure_thingy.shape
Look for null values in the fields — There are none
pd_data.isnull().sum(axis = 0)
3.0 Data transformations and Feature Engineering
Our data is currently at a daily level. We really don’t need that level of detail for this exercise. In fact, if the data has any non-normal properties (it does and this is almost always the case) aggregating a bit usually yields better results.
Next, we will convert the daily data to monthly data.
#create fields that represent the month and year
pd_data['MONTH'] = pd.DatetimeIndex(pd_data['DATE']).month
pd_data['YEAR'] = pd.DatetimeIndex(pd_data['DATE']).yearpd_data=pd_data[['ID','MONTH','DATE','YEAR','FAILURE_DATE','P_FAIL','TIME_TO_FAILURE']]
pd_data=pd_data.copy()# aggregate by machine, month and year
YY=pd_data
tips_summed = pd.DataFrame(YY.groupby(['ID','YEAR','MONTH'])[['P_FAIL', 'TIME_TO_FAILURE']].mean())tips_summed=tips_summed.sort_values(by=['ID','YEAR','MONTH'], ascending=[True,True,True])
plotter=tips_summed
plotter.reset_index(level=0, inplace=True)
plotter.reset_index(level=0, inplace=True)
plotter.reset_index(level=0, inplace=True)pd_data=plotter
pd_data.shape
We now have 5,844 records after aggregating.
In case you haven’t guessed already, we are about to build a model that uses the probability to fail to predict the time to failure. Before we do that, however, let’s examine the relationship betwwen the two variables graphically.
A scatter plot with 5,844 records will probably not be that meaningful, so first we will aggreate the data by using P_FAIL to create groupings. This should give us the essence of the relationship between P_FAIL and the TIME_TO_FAILURE without a bunch of clutter.
#Create percentiles based on P_FAIL
pd_data['P'] = pd.qcut(pd_data['P_FAIL'], 100, labels=np.arange(100, 0,-1))
pd_data.head()
#aggregate the variables by grouping
YY=pd_data
tips_summed = pd.DataFrame(YY.groupby(['P'])[['P_FAIL', 'TIME_TO_FAILURE']].mean())tips_summed=tips_summed.sort_values(by=['P'], ascending=[True])
plotter=tips_summed
plotter.head(10)
Create a plot with Plotly
x1 = plotter['P_FAIL']
y1 = plotter['TIME_TO_FAILURE']trace = go.Scatter(
x = x1,
y = y1,mode='markers',
name='Time to Failure')layout = go.Layout(
title='Time to Failure and the Probability to Fail',
xaxis=dict(
title='Probability to Fail',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='Time to Failure',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
showlegend=True,
)
data=[trace]
fig = go.Figure(data=data, layout=layout)#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='lines')
The relationship between P_FAIL and TIME_TO_FAILURE looks fairly linear. Visually, there does appear to be inflection points around .13,.30, .50 and .65.
Here are a few transformations that may be useful when we build our model.
pd_data['LN_P_FAIL'] = np.log((pd_data.P_FAIL*100+1))
pd_data['LN_TTF'] = np.log((pd_data.TIME_TO_FAILURE+1))
pd_data['SQ_P_FAIL'] = pd_data['P_FAIL']*pd_data['P_FAIL']
Also, a few dummy variables based on our chart above.
pd_data['DV_LT_p13'] = np.where(((pd_data.P_FAIL <= 0.13)), 1, 0)
pd_data['DV_GT_p65'] = np.where(((pd_data.P_FAIL >= 0.65)), 1, 0)
pd_data['DV_GT_p30_LT_p50'] = np.where((((pd_data.P_FAIL <= 0.5) & (pd_data.P_FAIL>.3))), 1, 0)
pd_data['DV_GT_p13_LT_p30'] = np.where((((pd_data.P_FAIL <= 0.3) & (pd_data.P_FAIL>.13))), 1, 0)
pd_data['DV_GT_p65'] = np.where(((pd_data.P_FAIL >= 0.65)), 1, 0)
Create interactive or slope dummies.
pd_data['DV_p13_LN_PFAIL'] = pd_data['DV_LT_p13']*pd_data['LN_P_FAIL']
pd_data['DV_p65_LN_PFAIL'] = pd_data['DV_GT_p65']*pd_data['LN_P_FAIL']
pd_data['DV_GT_p13_LT_p30_LN_PFAIL'] = pd_data['DV_GT_p13_LT_p30']*pd_data['LN_P_FAIL']
pd_data['DV_GT_p30_LT_p50_LN_PFAIL'] = pd_data['DV_GT_p30_LT_p50']*pd_data['LN_P_FAIL']
4.0 Create the Testing and Training Groupings
Because we are dealing with a panel data set (cross-sectional time-series) it is better to not just take a random sample of all records. Doing so would put the records from one machine in both data sets. To avoid this, we’ll need to randomly select IDs and place all of the records for each machine in either the training or testing data set.
#Get a Unique List of All IDsaa=pd_datapd_id=aa.drop_duplicates(subset=’ID’)
pd_id=pd_id[[‘ID’]]
Create a new variable with a random number between 0 and 1
np.random.seed(33)
pd_id['wookie'] = (np.random.randint(0, 10000, pd_id.shape[0]))/10000pd_id=pd_id[['ID', 'wookie']]
Give each record a 50% chance of being in the testing and a 50% chance of being in the training data set.
pd_id['MODELING_GROUP'] = np.where(((pd_id.wookie <= 0.50)), 'TRAINING', 'TESTING')
This is how many machines fall in each group.
tips_summed = pd_id.groupby(['MODELING_GROUP'])['wookie'].count()
tips_summed
Append the Group of each id to each individual record.
pd_data=pd_data.sort_values(by=['ID'], ascending=[True])
pd_id=pd_id.sort_values(by=['ID'], ascending=[True])pd_data =pd_data.merge(pd_id, on=['ID'], how='inner')
This is how many records are in each group.
tips_summed = pd_data.groupby(['MODELING_GROUP'])['wookie'].count()
tips_summed
Create a separate data frame for the training data. We will use this data set to build the model.
df_training=pd_data[pd_data['MODELING_GROUP'] == 'TRAINING']
df_training=df_training.drop(columns=['P'])
Create a separate data frame for the testing data set. We will use this to validate our modeling results.
df_test=pd_data[pd_data['MODELING_GROUP'] != 'TRAINING']df_test=df_test.drop(columns=['P'])
5.0 Build an OLS Regression Model.
Select the variables that are theoretically relevant and statistically significant.
independentx = df_training[['LN_P_FAIL','DV_LT_p13','DV_GT_p65','DV_p13_LN_PFAIL','DV_GT_p13_LT_p30_LN_PFAIL',
'DV_GT_p13_LT_p30','DV_GT_p30_LT_p50','DV_GT_p30_LT_p50_LN_PFAIL']]
independent = sm.add_constant(independentx, prepend=False)
dependent=df_training['LN_TTF']#build the model
mod = sm.OLS(dependent, independent)#create a predicted value
res = mod.fit()# Examine the anova table
print(res.summary())
Add a prediction of “Time to Failure” in the original data set.
df_ypred = pd.DataFrame(res.predict(independent))df_ypred.columns = ['P_LN_TTF']
df_ypred['P_TTF']=(np.exp(df_ypred['P_LN_TTF'])-1)df_training = pd.concat([df_training, df_ypred], axis=1)
plottery=df_training
Compare the predictions to actuals graphically.
x1 = plottery['P_FAIL']
y1 = plottery['P_TTF']x2 = plottery['P_FAIL']
y2 = plottery['TIME_TO_FAILURE']trace = go.Scatter(
x = x1,
y = y1,
mode='markers',
name='Predicted TTF')trace2 = go.Scatter(
x = x2,
y = y2,
mode='markers',
name='Actual TTF')layout = go.Layout(
title='Time to Failure and the Probability to Fail',
xaxis=dict(
title='Probabilty to Fail',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='Time to Failure',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
showlegend=True,
)
data=[trace,trace2]
fig = go.Figure(data=data, layout=layout)#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')
6.0 Apply the Model to the Testing Data.
#define the variables
independentx = df_test[['LN_P_FAIL','DV_LT_p13','DV_GT_p65','DV_p13_LN_PFAIL','DV_GT_p13_LT_p30_LN_PFAIL',
'DV_GT_p13_LT_p30','DV_GT_p30_LT_p50','DV_GT_p30_LT_p50_LN_PFAIL']]
independent = sm.add_constant(independentx, prepend=False)#score the test set
df_ypred = pd.DataFrame(res.predict(independent))df_ypred.columns = ['P_LN_TTF']
df_ypred['P_TTF']=(np.exp(df_ypred['P_LN_TTF'])-1)#append y-hat
df_test= pd.concat([df_test, df_ypred], axis=1)plotterx=df_test
After applying the model to the testing data set, examine the actual and predicted graphically.
x1 = plotterx['P_FAIL']
y1 = plotterx['P_TTF']x2 = plotterx['P_FAIL']
y2 = plotterx['TIME_TO_FAILURE']trace = go.Scatter(
x = x1,
y = y1,
mode='markers',
name='Predicted TTF')trace2 = go.Scatter(
x = x2,
y = y2,
mode='markers',
name='Actual TTF')layout = go.Layout(
title='Time to Failure and the Probability to Fail',
xaxis=dict(
title='Probability to Fail',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='Time to Failure',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
showlegend=True,
)
data=[trace,trace2]
fig = go.Figure(data=data, layout=layout)#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')
The plot for the testing data looks similar to the modeling data set, but this doesn’t give us much insight. Let’s look at the R-squared of the testing data and compare it to the training data. Remember that the pearson statistic between a predicted variable and an actual variable gives you R. Squaring R gives you R-Squared.
df_corr=df_training[['P_LN_TTF','LN_TTF']]df_corr.corr(method='pearson')
Squaring R gives us a R-Squared of .224
Let’s do the same for the Testing Data Set.
df_corr=df_test[['P_LN_TTF','LN_TTF']]
df_corr.corr(method='pearson')
The Testing data has an R-Squared of .208. Very similar to the Training Data.
7.0 The final step is to apply the model to the original daily data.
# Convert csv to pandas dataframe
pd_data = pd.read_csv(“P_HAT_DATA.csv”, sep=”,”, header=0)
pd_data=pd_data.copy()#convert dates to date format
pd_data['DATE'] = pd.to_datetime(pd_data['DATE'])
pd_data['FAILURE_DATE'] = pd.to_datetime(pd_data['FAILURE_DATE'])pd_data['C'] = pd_data['FAILURE_DATE'] - pd_data['DATE']
pd_data['TIME_TO_FAILURE'] = pd_data['C'] / np.timedelta64(1, 'D')
pd_data=pd_data[['ID','DATE','FAILURE_DATE','P_FAIL','TIME_TO_FAILURE']]pd_data['LN_P_FAIL'] = np.log((pd_data.P_FAIL*100+1))pd_data['DV_LT_p13'] = np.where(((pd_data.P_FAIL <= 0.13)), 1, 0)
pd_data['DV_GT_p65'] = np.where(((pd_data.P_FAIL >= 0.65)), 1, 0)
pd_data['DV_GT_p30_LT_p50'] = np.where((((pd_data.P_FAIL <= 0.5) & (pd_data.P_FAIL>.3))), 1, 0)
pd_data['DV_GT_p13_LT_p30'] = np.where((((pd_data.P_FAIL <= 0.3) & (pd_data.P_FAIL>.13))), 1, 0)
pd_data['DV_GT_p65'] = np.where(((pd_data.P_FAIL >= 0.65)), 1, 0)pd_data['DV_p13_LN_PFAIL'] = pd_data['DV_LT_p13']*pd_data['LN_P_FAIL']
pd_data['DV_p65_LN_PFAIL'] = pd_data['DV_GT_p65']*pd_data['LN_P_FAIL']
pd_data['DV_GT_p13_LT_p30_LN_PFAIL'] = pd_data['DV_GT_p13_LT_p30']*pd_data['LN_P_FAIL']
pd_data['DV_GT_p30_LT_p50_LN_PFAIL'] = pd_data['DV_GT_p30_LT_p50']*pd_data['LN_P_FAIL']independentx = pd_data[['LN_P_FAIL','DV_LT_p13','DV_GT_p65','DV_p13_LN_PFAIL','DV_GT_p13_LT_p30_LN_PFAIL',
'DV_GT_p13_LT_p30','DV_GT_p30_LT_p50','DV_GT_p30_LT_p50_LN_PFAIL']]independent = sm.add_constant(independentx, prepend=False)df_ypred = pd.DataFrame(res.predict(independent))df_ypred.columns = ['P_LN_TTF']
df_ypred['P_TTF']=(np.exp(df_ypred['P_LN_TTF'])-1)df_final= pd.concat([pd_data, df_ypred], axis=1)df_final=df_final[['ID','DATE','FAILURE_DATE','P_FAIL','TIME_TO_FAILURE','P_TTF']]df_final.head(150)