All you want to know about determining remaining useful life (RUL) of industrial equipment

11 min readJun 15, 2023

Artificial intelligence and machine learning are applied in heavy industry less often than in other areas of economic activity such as banking, retail, and telecom. However, modern industrial facilities generate and collect large amounts of data, and machine learning methods can ensure that these data are efficiently used to perform various routine tasks, e.g., identifying faults and failures, predicting product quality or determining the remaining useful life of equipment.

While the focus here is on machine learning and data analysis, it should be noted that the problem can also be posed and solved in the classical theory of dependability. Now let us get to the problem at hand.

Technical diagnostics

According to international and Russian national standards, the process of technical diagnostics includes four steps: (1) detection of anomalies in operation or malfunctions, (2) localization of malfunctions, or identification of specific signals that contribute most to the detection of malfunctions, (3) diagnosing, i.e., establishing the root causes of the identified malfunctions, and (4) prediction of the malfunction development, or RUL estimation for the diagnosed equipment. If all these tasks are performed adequately, thanks to the effective implementation of data analysis methods, the equipment operator can switch over to the condition-based maintenance strategy. The flowchart below shows a typical cycle of equipment diagnostics.

The outcomes of the first three steps above may either be used in the RUL estimation or dismissed, depending on the selected approach to technical diagnostics.

Remaining useful life estimation problem

In International Standards about condition monitoring and diagnostics of machines, RUL is defined as remaining time before system health falls below a defined failure threshold (or before the system passes into a state in which it needs to be repaired or replaced).

This definition can be illustrated with the figure below.

This means that, at each point in time, one can estimate the time left until a critical state, indicated by the red dotted line. The time can be expressed in days, cycles, runs, casts, or some other units, proceeding from the problem statement and available data.

Solving the RUL estimation problem also helps identify the factors that reduce RUL (i.e., if the red dotted line in the example above moves to the left and the distance to it decreases), and the equipment operator can eliminate the undesirable impact of such factors in the present and future. By the way, the mathematical model of the RUL estimation problem may factor in the occurrence of anomalies, e.g., in the form of time that equipment operates in an anomalous state (not to be confused with an inoperable state because equipment with minor anomalies can sometimes work for years).

On the side note, it is also important to understand the terminology as various synonyms to RUL appear in literature, only to mean about the same thing. These are:

remaining useful life (RUL),
Time to Failure (TTF),
residual operation time,
residual useful life,
remaining lifetime.

Now then, why is the RUL estimation problem important?

If process personnel, engineers, operators, repair teams, technical diagnostics teams know how long RUL is, they can perform better in many ways:

plan M&R activities more efficiently,
improve the equipment maintenance strategy (replace large-scale repairs with smaller ones, reduce the number of activities and manipulations with equipment, etc.),
optimize the operating modes and equipment load,
reduce the number and duration of unscheduled shutdowns or avoid them altogether.

That is why the RUL estimation is a fundamental problem in technical diagnostics and an indispensable prerequisite for the transition to the condition-based maintenance strategy.

Data

When facing the RUL estimation problem, one may have access to various data and find only certain approaches and methods applicable. Hence, it is necessary to categorize the potentially available data first, so that when presenting the methods, we could immediately refer to the required data categories. These are:

data on equipment operation, i.e., process parameter values, sensor signals for the entire period of operation from the moment of start-up to the moment of failure,

data on operation time to failure (i.e., the duration of operation before the failure occurred),

information on thresholds (permissible values) for individual signals or health indicators that point to failure if achieved.

Approaches to solving the RUL estimation problem

1. Statistical evaluation

In statistical evaluation approach, the time-to-failure distribution function is built on historical data to estimate the RUL of equipment; refer to the figure below.

*Determining RUL by statistical evaluation*

This is one of the simplest methods that requires only a set of data on operation time to failure. The characteristics of the survival function (or survival model) can be calculated as 1 — cdf. By adding some additional (indirect) data on the equipment operation, it is possible to improve the method performance, e.g., by identifying different modes and building a distribution function (degradation rate) for each mode.

2. Parameter prediction

In parameter prediction method, the RUL estimation is based on the prediction of parameter values before the threshold is achieved. This approach is also called the degradation model approach, in which there are two main laws of degradation:

Linear degradation: the prediction is presented as a straight line while historical data determine its slope; it is usually applied if the system does not accumulate damage (degradation).
Exponential degradation: the prediction is presented as an exponent; it is usually applied if the system can accumulate damage cumulatively.

*Determining RUL by parameter prediction*

In this case, there are two options:

1. to predict sensor signals,

2. to predict health indicator.

Both options require data on equipment operation (i.e., process parameters, sensor signals), but for the second one, it is the health indicator that is calculated on the basis of available data and then predicted. The health indicator can be a principal component as in PCA, a result of aggregation of various indicators, a discrepancy between the normal operation model and real data, etc. It incorporates as much information as possible, drawing on far more than only one signal.

What is also necessary are thresholds, although having the values of operation time to failure and data on equipment operation, one can calculate the thresholds (given enough statistical data). The parameters can be predicted using different methods, and some of them are discussed in detail in the ODS lesson here (note that the code is available).

3. Regression model

In this case, we reduce the problem to the classical regression formulation. To do this, extract features from time series (process parameters or health indices) using, for example, the TSFresh library. The feature extraction is shown in the figure below.

As a result, we obtain a sample of features X, and we also need a sample of responses (time to failure) y. Thus, this approach requires data on equipment operation (process parameters, sensor signals) and data on operation time to failure, and the problem can be solved as a classical regression problem with tabular data using any SOTA (think ensemble) methods.

4. Similarity with patterns from previous periods

Another very common approach to RUL estimation, also known as the similarity model, is comparing current operation or condition with historical data. To do that, we can cut previous periods in operation at the same moment in time as the current period.

*Determining RUL based on similarity with previous patterns*

There are two main options to implement the similarity model:

Direct comparison of time series using proximity metrics such as Dynamic Time Warping (DTW) or proximity-based clustering / classification methods. Examples with code can be found here, you can also use ready-to-use libraries, e.g., tslearn.
Selection of features from a time series and further comparison of the obtained feature vectors (i.e., proximity metrics, clustering).

The desired RUL estimation will be the value of the most similar operation period from the history or the average (or any other aggregation) over a group / cluster of operation periods. To implement the similarity model, take data on equipment operation and data on operation time to failure.

Identification of factors affecting wear

As mentioned above, an important task that accompanies RUL estimation, is identifying factors that affect that RUL — negatively in the first place — causing increased wear. Such factors, first of all, mean specific signals indicating the localization of a malfunction leading to an abnormal state and equipment wear. This information can be passed on to personnel to indicate, for example, undesirable modes of operation. Then, together with experts in the domain area, the equipment operator can analyze and find out what causes the deviation of certain signals from normal values and probable equipment degradation based on these indications.

Here it is necessary to distinguish two concepts:

1. Features important for the model as a whole: factors that have the greatest influence on the result of RUL estimation (the fundamental features of the model).

2. Contribution of features in a given model indication: Factors that have had the greatest impact on (i.e., explaining) the current RUL estimation.

This means that, in terms of the first concept, the wear-affecting factors are those that reduce the model’s forecast, while in terms of the second concept, they are those that have affected the low value of the current forecast. Such libraries as Shap can produce both first and second factors for machine learning models.

For each of the approaches above, the identification of factors is carried out in its own way:

Statistical evaluation: identification is only possible with additional indirect data, e.g., different slopes of the distribution curve (i.e., degradation rate) for different modes of operation.
Parameter prediction: the signals that passed the thresholds earlier than others should be selected as factors affecting equipment wear.
Regression models: the wear-affecting factors can be identified by the feature importance for machine learning models, Shap and other methods for assessing the feature importance and explaining the models’ indications.
Similarity with patterns from previous periods: if machine learning models are built, it is the same way as in point 3 above, or the wear-affecting factors are those that have manifested themselves before the previous equipment failure during the operating cycle / run that is similar to the current one. Such information may be obtained by the technical diagnostics team.

Problem-solving cases

By way of illustration, let us consider three cases of solving the RUL estimation problem.

Case 1 — RUL of continuous-casting machine sleeves

A continuous-casting machine (CCM) is a unit that converts liquid steel into a solid billet with a given section that is rolled into various products, such as reinforcement bars.

The most critical and quick wearing part of CCM is a mold sleeve. It is a water-cooled round or section-shaped pipe made of copper. The molten metal that contacts with the sleeve walls, crystallizes and thus forms the primary hard shell of a billet.

The main problem with the sleeves is that defects appear on their surface and the profile of the sleeve mouth gets distorted. At that, the thermal conditions are disrupted, and this affects the quality of produced billets: there can occur shape irregularity (e.g., unequal diagonals in square ingots, rhombic form), wrong dimensions of the billets sides, cracks in the billets corners. These defects cause problems at the rolling stage that comes next, and hence the quality of rolled products decreases while the number of defects grows, which adversely affects the economics of production.

The sleeve dimensions are measured along the entire length at certain time intervals. If these dimensions deviate from the required parameters, they are rejected.

A shorter RUL the copper sleeves used in the process is allowed if associated with a change in the parameters of the CCM operation (e.g., incoming steel temperature, cooling water temperature, etc.), and hence these features were also included in the model. The model is built to estimate the RUL that is measured in tones or remaining melts.

Case 2 — RUL of power transformers

A large number of transformers are over 25 years old. This makes the task of early fault detection even more urgent, since maintenance and repair require efficient scheduling to reduce costs. As we know now, solving the RUL estimation problem is the most important aspect of correct maintenance planning, especially considering the considerable age of equipment that often exceeds the established limits (do not be alarmed though, the service life is extended only after a thorough diagnosis).

To learn more about solving the anomaly detection problem for transformers, refer to this paper.

The initial data were the results of CADG (chromatographic analysis of dissolved gases). The concentrations of four gases (H2; CO; C2H4; C2H2) were measured every 12 hours in transformer oil, which gave the data on equipment operation and the data on operation time to failure (duration of runs). The model was trained with a mean absolute error of 27 days.

Case 3 — RUL of exhausters in metallurgy

An exhauster is a centrifugal blower that sucks air through a batch layer lying on the grate of the sintering machine. Exhausters are critical components of sintering complexes in metallurgical production. A fault of exhauster leads to shutdown of the sintering machine and, as a result, to losses due to underproduction.

The main reason for exhauster faults is rotor wear that depends on various factors, and therefore, the RUL of exhauster is rather inconstant. Early prediction of exceeding the maximum permissible parameters makes it possible to replace the rotor during scheduled shutdowns of the sintering machine and eliminate (or significantly reduce) undesired downtime.

The task was to determine the time to shutdown on the horizon of a month and calculate the exact time to shutdown for each sampling point. The following hypotheses were formulated to test the applicability of different approaches to solving the problem:

Prediction of health index. The health index can be built based on normal operation models (semi-supervised approach), and then the health index is predicted until it intersects with a pre-computed setpoint, signaling the occurrence of downtime.
Regression. A regression model can be built with the time to downtime used as the target variable.

With this approach, the following machine learning pipeline was suggested:

allocate 60-day intervals before the onset of the malfunction (for train stage only),
using a sliding window, cut one 60-day interval into 7-day intervals, each corresponding to 1 number, i.e., the RUL before downtime,
using tsfresh, reduce the two-dimensional data set (7 days times the number of features) to a feature vector (1 point times the number of selected features), each corresponding to 1 number, i.e., the RUL before downtime,
collect all vectors to a general sample,
state and solve the problem of regression model training,
execute the model inference.

The initial data were signals from the process control system, M&R data from SAP, and some manual input data. Final root mean squared error was equal to 5 days on the 60 days before failure interval.

On a final note, refer to my overview repository for more cases of machine learning in heavy industry and training datasets.