Converting a Probability to Fail Into a Time to Failure Metric

Getting the most out of your Probability to Fail Models

Shad Griffin
The Startup
9 min readAug 4, 2020

--

With predictive maintenance problems, there are two common metrics that represent the health of your asset.

The first is a probability to fail. That is, at a given moment in time, what is the probability that your machine will fail. Sometimes, this is represented by a health score. Typically, the health score is one minus the probability to fail times 100.

The second metric is the time until failure. That is, how many days, weeks, months, hours, minutes or seconds do you have until the asset in question stops working.

There are many different ways to calculate these metrics. Probably the most common way to gauge the probability to fail is a machine learning algorithm like logistic regression, random forest or gradient boosted tree (Note, there are many more).

Time to failure models typically rely on some type of survival model. A survival model is a family of techniques based on measuring and predicting expected lifetimes, given certain attributes of an individual or population. For example, will a drug treatment increase or decrease the life of a cancer patient? Or, how much longer will a machine operate if we service it every three months instead of every six months?

What if you have a probability to fail and want to convert it into a time to failure? Is this possible?

Of course it is possible. In fact, there are many ways to do it. In this article, I focus on my favorite technique. Not saying it is the best, just my favorite.

You probably won’t see this in a textbook, but my approach has served me well over the years. My technique is pretty easy and guarantees that your time to failure and probability to fail metrics are perfectly in-synch.

I should also mention that, although this is an equipment failure problem, this technique has many other applications. For example, the probability to churn and expected customer lifetime. Anytime you need to convert a probability of death to a prediction of lifespan, this should work.

1.0 Getting Set-Up

Install all of the relevant Python Libraries

Import required libraries

Import the data from GitHub

2.0 Data Exporation

Our data set is pretty simple. Note that we have panel data. This means we have multiple entities (machines in this case) measured each day for a period of time.

Here is a description of the fields.

ID — ID field that represents a specific machine.

DATE — The date of the observation.

FAILURE_DATE — Day the machine in question failed.

P_FAIL — The probability a machine will fail on a given day. The result of Gradient Boosted Tree Model.

We also need to create another field that indicates the number of days between the day of record and the failure date of the machine. This is the actual time to failure from the historical record.

Examine the number of rows and columns. The data has 171,094 rows and 5 columns

There are 421 machines in the data set

Check for dups. There are none

Look for null values in the fields — There are none

3.0 Data transformations and Feature Engineering

Our data is currently at a daily level. We really don’t need that level of detail for this exercise. In fact, if the data has any non-normal properties (it does and this is almost always the case) aggregating a bit usually yields better results.

Next, we will convert the daily data to monthly data.

We now have 5,844 records after aggregating.

In case you haven’t guessed already, we are about to build a model that uses the probability to fail to predict the time to failure. Before we do that, however, let’s examine the relationship betwwen the two variables graphically.

A scatter plot with 5,844 records will probably not be that meaningful, so first we will aggreate the data by using P_FAIL to create groupings. This should give us the essence of the relationship between P_FAIL and the TIME_TO_FAILURE without a bunch of clutter.

Create a plot with Plotly

The relationship between P_FAIL and TIME_TO_FAILURE looks fairly linear. Visually, there does appear to be inflection points around .13,.30, .50 and .65.

Here are a few transformations that may be useful when we build our model.

Also, a few dummy variables based on our chart above.

Create interactive or slope dummies.

4.0 Create the Testing and Training Groupings

Because we are dealing with a panel data set (cross-sectional time-series) it is better to not just take a random sample of all records. Doing so would put the records from one machine in both data sets. To avoid this, we’ll need to randomly select IDs and place all of the records for each machine in either the training or testing data set.

Create a new variable with a random number between 0 and 1

Give each record a 50% chance of being in the testing and a 50% chance of being in the training data set.

This is how many machines fall in each group.

Append the Group of each id to each individual record.

This is how many records are in each group.

Create a separate data frame for the training data. We will use this data set to build the model.

Create a separate data frame for the testing data set. We will use this to validate our modeling results.

5.0 Build an OLS Regression Model.

Select the variables that are theoretically relevant and statistically significant.

Add a prediction of “Time to Failure” in the original data set.

Compare the predictions to actuals graphically.

6.0 Apply the Model to the Testing Data.

After applying the model to the testing data set, examine the actual and predicted graphically.

The plot for the testing data looks similar to the modeling data set, but this doesn’t give us much insight. Let’s look at the R-squared of the testing data and compare it to the training data. Remember that the pearson statistic between a predicted variable and an actual variable gives you R. Squaring R gives you R-Squared.

Squaring R gives us a R-Squared of .224

Let’s do the same for the Testing Data Set.

The Testing data has an R-Squared of .208. Very similar to the Training Data.

7.0 The final step is to apply the model to the original daily data.

--

--

Shad Griffin
The Startup

Economist, Data Scientist and Data Wrangler. Opinions expressed and funny jokes are exclusively mine. Seeking truth from my treehouse on Idiot’s Hill, Texas