Detecting anomalies in industrial equipment: an explainable predictive approach to maintenance.

Annagiulia Tiozzo
Eni digiTALKS
Published in
7 min readApr 4, 2022

Three steps to understand how data and machine learning algorithms can explain when and why an industrial equipment is going to fail.

Data Driven Predictive Maintenance: what and why

There exist different types of maintenance strategies, such as:

  • Planned Maintenance: allows to plan regular maintenance interventions to check the machines status and to avoid them reaching faulty conditions
  • Corrective Maintenance: interventions are applied only after the equipment reached a point of failure
  • Predictive Maintenance: results in interventions before the moment in which the equipment loses performances, thanks to continuous monitoring.

Here we speak about predictive maintenance because the great availability of remote sensors data from the field, allows data scientists to develop a data driven approach for a Predictive Maintenance of equipment in an industrial plant.

Hence, for this approach, we leverage on two key elements that are crucial: data, in particular the great availability of remote sensor data from the field and algorithms applied to them.

The Predictive Maintenance is becoming a high priority practice because it allows a very early detection of malfunctions. In this way, there is a consequent preservation of the machines and a reduction of the downtime associated to possible shut down.

Moreover, the data-driven approach continuously monitors the available data and supports the detections of deterioration of equipment condition, suggesting where to take tailored operative actions.

In this article, we describe how to combine a sequence of machine learning algorithms for an explainable approach to predictive maintenance.

We have identified three main steps to answer three main questions:

  1. First step — Anomaly detection: are we observing a deterioration in the equipment behavior?
  2. Second step — Fault Isolation: what is the source of this anomaly we are seeing in the data?
  3. Third step — Advisory model: which is the most probable failure mode associated to this anomalous condition?

Let’s deep dive into all the three questions.

First step — Anomaly detection: are we observing a deterioration in the machine behavior?

To identify whether there is a deterioration in the machine behavior, it’s possible to start from an anomaly detection step.

For this step, we need first to gather a set of historical sensor measurements from the industrial equipment. This set of measurements needs to include a wide set of sensors in order to have all the parts of the machine monitored by the algorithm.

After the data gathering, Normal Operating Conditions (NOC) are identified on the historical training dataset of the machine, as time intervals in which the machine was properly working.

To detect anomalies, we train a semi-supervised machine learning algorithm to identify whether the current machine condition differs from the Normal Operation Conditions.

The Anomaly Detection algorithm relies on the following common techniques:

The Principal Component Analysis transforms a set of correlated variables into a smaller set of new uncorrelated variables that contains the most important information of original data. The set of possible equipment measurements can be more than a hundred, therefore the PCA is needed to deal with a limited set of measurements that still guarantees the correct description of the machine status. This allows the dataset to be standardized and reduced.

Subsequently, it is proposed to apply the Kernel Density Estimation (KDE) technique. It estimates the distribution of the data to understand the health status of the equipment. The Kernel Density is a non-parametric method to estimate the probability density function of a random variable. KDE is applied on the reduced and standardized dataset.

Let’s better explain the training and scoring phase.

For the training of the model, only the NOC data is used, this is the whole reason why we have defined the algorithm as a semi-supervised one. The NOC training dataset is standardized and reduced via PCA and subsequently its distribution is estimated via the KDE. In this way, the algorithm learns the distribution of the normal dataset. Based on this, it is possible to define warning and alarm thresholds. Consequently, a normality, a warning and an anomaly region can be identified, as shown in Figure 1.

Figure 1 - Heatmap of probability density function (distribution) of Normal Operating Conditions. Example with 2 principal components

For the scoring of the model, the current equipment condition is compared with the distribution of NOC dataset. The current set of measurements of all the sensors is standardized and reduced via the PCA. Then the scoring of the KDE is applied. The output of the KDE scoring is the logarithmic transformation of the likelihood of the data. Let’s define this log-likelihood as Health Index (HI). The health index is a measure of how much data is likely to belong to the distribution of the normal data.

The value of the health index will identify the region in which the current condition falls. Whenever it reaches the warning or the anomaly region, an alarm is raised according to the severity.

Figure 2 - Health Index

We have now the answer to the first question: the trend of the health index describes a possible deterioration in the machine behavior.

Second step — Fault Isolation: what is the source of this anomaly we are seeing in the data?

The machine is now running under non-optimal condition and an alert is raised.

It could be interesting to add an explainability module in order to know the reason why the alert is raised.

For this reason, we propose to implement a second step, the Fault Isolation.

In the Fault Isolation step, the algorithm finds the root cause which gave rise to the anomaly. In other words, Fault Isolation detects the most relevant sensors which contributes to generate the anomaly. This step allows to estimate how different the patterns of the current machine condition are with respect to the patterns of the NOC data.

Figure 3 — Illustrative example of sensor contribution calculation

Figure 3 shows which is the idea of the sensor contribution computation. It is possible to localize the anomalous condition with respect to the distribution of the NOC data in the reduced space. Then, we reconstruct the real sensors measurements by mapping back the principal components. We can therefore fill a vector reporting the contributions of each sensor to the anomaly. Let’s refer to this vector as the sensor contribution.

We have now the answer to the second question: the value of the contributions of each sensor describes which is the source of the anomaly.

Third step — Advisory Model: which is the most probable failure mode associated to this anomalous condition?

The machine is now running under not safe conditions according to the value of some measurements that are abnormal. Let’s take a step forward and try to understand the possible consequences of having those set of measurements with abnormal values.

For this reason, there is the last step of the algorithm, the Advisory Model.

Let’s start from the sensors’ contributions. The Advisory Model identifies and ranks the most relevant failure modes for the anomaly. To reach this goal, define a Diagnostic Matrix (DM) as a matrix that maps the impact of the variation of each sensor to all the possible failure modes that can affect the equipment.

For each machine, the available sensors are listed on the rows of the matrix, with the potential failure modes in the columns, as represented with a simplification in Figure 4. The value of the cell of the matrix is a ranking value that describes how much the variation of one sensor can cause a certain failure, from no correlation to the maximum impact.

Figure 4 — Example of simplified Diagnostic Matrix

By combining the sensor contribution vector and the Diagnostic Matrix, we obtain the vector of weights of each failure mode on the specific anomaly. In this way, the algorithm highlights the most likely failure modes with the highest weights.

For the current abnormal situation, the algorithm explains the possible failure modes that can occur if the degradation of the equipment persists. As a consequence, early actions can be taken to restore the normal operating conditions.

Therefore, we have the answer to the third question: the vector of weights of each failure mode describes which is the most probable failure mode associated to the anomalous condition.

Take home messages

  • Machine Learning algorithms can significantly improve the maintenance operations of industrial equipment with a predictive perspective.
  • Data Driven algorithms provide great support and insights to domain experts to detect the deterioration of the equipment and to understand where to focus the inspections.
  • The combination of multiple algorithms can be effective in order to reach different levels of understanding of a problem.
  • It is important that next to the algorithms there is an explainability module if we, as Data Scientist, want that they are effectively used in real industrial plants.

If you want to deepen the predictive maintenance algorithms for rotating equipment, please refer to https://doi.org/10.2118/207657-MS [1].

[1] Optimizing Rotating Equipment Maintenance Through Machine Learning Algorithm. Beduschi F., Turconi F., De Gregorio B., Abbruzzese F., Tiozzo A., Amabili M., Prospero A. Paper presented at the Abu Dhabi International Petroleum Exhibition & Conference, Abu Dhabi, UAE, November 2021. doi: https://doi.org/10.2118/207657-MS

--

--