Part 2. The underlying metrics for On-condition and failure finding maintenance tasks
Metrics are measurements that help us manage a system. Metrics should be understandable, and business-led. We should also recognise metrics influences the behaviour of our people and lastly, too many metrics may blur our focus. The conclusion to be drawn from this is that metrics need to be selected very carefully.
This blog is the second in a thread that will discuss low-level metrics to ensure that the applicability, effectiveness and efficiency of maintenance is measured and reported.
The taxonomy of maintenance we discussed in a previous blog identifies on-condition and failure finding as types of task done as preventative maintenance. We have also described on-condition maintenance in some detail here.
There are two broad measures of effectiveness for on-condition maintenance tasks. The first measure is associated with diagnosing that a failure exists and the second is success in prognostics.
We split on-condition task types into two models in a previous blog, the first model is long term deterioration (LTD), and the second is incipient failure detection (IFD). Diagnostics has a looser association with LTD; if we are able to measure the condition at any time in the deterioration lifecycle, there is no point of diagnosis in this case and we apply prognostics. For LTD where it is only possible to measure condition beginning at some point in the deterioration lifecycle, we can say the diagnosis is applicable. Diagnosis is necessary for IFD to determine a component is failing.
Taking diagnostics first. Diagnostics is the activity that determines whether we are in a state of failure, where the consequences are so far minimal. Diagnostics has a set of degrees of success or accuracy, from where we can recognise something unusual is happening to recognise our asset is definitely failing. We introduce a set of keywords to classify the degrees of diagnostic certainty ranging from:
- Detecting a novelty. Novel behaviour that may or may not be symptomatic of failure. Novel behaviour may be indicative of a region of normal behaviour that is not known. This implies we have knowledge of what we regard as normal and abnormal asset behaviour.
- An anomaly that is recognised to be non-normal behaviour (a deviation from normal), which may be indicative of failure, but as yet is unknown.
- A symptom is a deviation from normal, which is a known indicator of failure. In some on-condition systems, a symptom is known as a condition indicator or CI
- A diagnosis that is made up of one or more symptoms. A diagnosis is indicative, with a high degree of confidence, of a failure mode on a component.
The goals of on-condition diagnosis are to:
- Detect the incipient failure, ideally isolated to a failure mode and a component(s) that are damaged by the failure mode (that may require repair or replacement).
- To detect the incipient failure within sufficient time to allow planning and predisposal of resources to gracefully withdraw the affected machine from service and recover it quickly and efficiently. This should minimise operational disruption.
Taking each goal and breaking them down further
We need to understand our requirements for isolating the fault condition. For example, what are our levels of line replacement for components? Do we replace a whole electric motor as a line replaceable unit, or would we replace the bearings on the motor in situ? The on-condition fault isolation may only be required at the whole motor level for the first situation or fault isolation is required at the motor bearing level if we had to recover the motor bearings. Isolating the failure mode helps inform us what type of damage is occurring and helps us to determine what spares may be necessary. This ‘level of recovery’ information needs to be recorded in the maintenance master data linked to the associated on-condition task. It is desirable to develop diagnostics so we can trust it enough to avoid having to manually troubleshoot or confirm the diagnosis.
The second goal makes us consider what the required recovery lead times may be for planning and predisposition for recovery. If we are to gain maximum benefit from on-condition maintenance, then the diagnosis ideally needs to detect the onset of failure before that lead time.
The degree of system conformance with these universal goals should form metrics for on-condition tasks.
Not all on-condition systems are optimal, partial value from on-condition may be realised by late detection of an anomaly and shutting down a machine to avoid the end point of failure thereby avoiding secondary induced damage. This situation is better than allowing the machine to run to final failure, partial value is achieved.
Looking at how we have broken down on-condition to varying degrees of accuracy:
If our system detects a novelty, we need action to confirm that this is either an unrecognised part of normal behaviour, or whether it is possibly an anomaly. Our action would be to recognise a novel area of benign normal operation or classify an anomaly.
If the detection is a known anomaly, then we may need to initiate further action to task a maintainer to troubleshoot the affected machinery to determine what may be going wrong. The resulting information that the troubleshooting may find needs to be captured and recorded in the maintenance master data about that machine, so improvements in the on-condition diagnosis may be made. In this case, an anomaly may be promoted to a symptom or Condition Indicator. The difference between a symptom and an anomaly is that we are sure the symptom is associated with known failure mechanisms or failure modes, where an anomaly is not.
If the detection is a symptom, we may still need a trouble-shooter to manually determine the condition of the affected part, or we may have sufficient confidence in the symptom to start to plan recovery. If we consistently trigger corrective work from single symptoms (effectively treating a single symptom as a diagnosis), we may suffer from a situation known as a ‘false positive’ A false positive is where we think we are detecting a failure, but none exists on the machine. This may result in taking nugatory action and possibly taking a machine offline, which is wasteful. We will discuss what we mean as a ‘false positive’ below when we cover the confusion matrix.
If we have a diagnosis, which may occur after a trouble-shooter has confirmed an anomaly or single symptom is a failure, or we trust a single symptom, or we trust multiple symptoms are a failure, then we should have a lot more confidence that we are indeed suffering from a failure and act accordingly. Symptoms may also present themselves at different times in the failure cycle, which helps us to quantify prognostics and the time to functional failure.
If we generalise, essentially Diagnosis is a type of classification problem. Diagnosis asks “are we in the normal or failing state?” The best way of measuring a classifier is with a confusion matrix. The aim of the system is to correctly classify the onset of failure, avoiding misclassifications.
The semantics associated with the confusion matrix can take a few minutes to understand especially when our classifier is positive in its ability to detect failure. Usually, we naturally regard a failure as a negative, but not in this case with a confusion matrix
Where on-condition involves manual inspections with expert and qualified inspectors, the chances of false positives is generally low. The confusion matrix is probably more useful in predictive maintenance where diagnostics and prognostics is data-centric and has more automation. It is more likely misclassification occurs in a predictive maintenance system.
Metrics can measure the occasions the classifier triggers, and whether this is a true or false detection, and the occasions where the classifier misses a failure (False negative). In the matrix above green are desirable results and red are not.
It is also advantageous to think in terms of specificity and sensitivity as you may tune a diagnostic classifier. Tuning is possible in a predictive maintenance situation (where sensor data is being used) for increased sensitivity to reduce the numbers of false negatives or tune to higher specificity to lower the number of false positives. In a system where failure consequences are high, you may tune to be more sensitive and accept more false positives.
Beyond diagnostics, we need to address prognostics. Prognostics might not have to follow on from diagnosis if we are dealing with LTD where we can measure condition at any time in the deterioration lifecycle. The goals of prognostics are to:
· Provide an accurate prediction of the remaining useful life of a failing component
· Provide a forecast of the likely impact and consequences of failure at any point during the remaining useful life, functional or final failure. This should include the time and effort required for recovery
Much seminal work on prognostics has been completed by NASA. Many peer-reviewed papers and textbook chapters may be found here.
To fully understand the scope of prognostics we can use a model that breaks down the types of prognostics possible:
The Type 1 models use historical age at the time of failure data to determine the probability of failure over the age of the component. Type 2 models are able to build on type 1 and introduce other variables that may influence the speed of deterioration. Type 3 models are able to utilise digital twins and other data-driven or physical models to infer prognostics. Type 4 models use sensing that can directly measure either the effects of or the failure mechanisms directly and use other models such as particle filtering to determine RUL.
When considering IFD, type 1 models can predict the onset of failure events, but the model is not relevant to the RUL after the inception of failure and diagnosis. Weibull could be used in a separate model for RUL, but IFD RUL Weibull models are not the same as Weibull’s modelling the full age of an asset and its probability of failure. When looking at the specificity of the model types, Type 1 is more general for a population of the same component for a given failure mode, where type 4 is specific to an instance of a failing component.
Where diagnosis was a type of classification problem, prognosis is a type of regression problem and different techniques and models are used to measure each problem.
Prognosis is difficult because of high degrees of uncertainty, and that we need to predict the future operating and environmental influences that may or may not occur. If we have historically recovered all previous instances of failure events before the point of functional failure, we have little historical failure data that may be used to decrease uncertainty.
The practical measurements for calculating RUL are based on the prognostic horizon where any prediction of RUL should be long enough to allow recovery to be prepared before end-of-life or functional failure. The allowable tolerance, alpha a is determined by the time needed to prepare recovery and reduces in a cone as operating time passes, with lambda reducing the allowable error the closer to EoL. The points of estimation using an algorithm should have uncertainty deriving from measurement and other factors to provide measures of uncertainty.
The second type of maintenance task we discuss in this blog is Failure Finding tasks.
Failure Finding tasks
A failure finding task is necessary where components may fail unnoticed by the operators or maintainers conducting their normal duties. Before LED lights, with lots of redundancy were introduced into cars, a good example of hidden failure was when a car rear brake light failed. The failure went unnoticed by a driver in their normal activity of driving. The braking lights function is to inform other drivers following you that you were slowing down by applying brakes, so rear-end collisions could be avoided. Such circumstances are called ‘hidden failures’ in the RCM process. Hidden failures are particularly important, where the component or parts are safety devices. A hidden failure leaves no or reduced protection that the component is intended to prevent. The purpose of a failure finding task is to determine whether the component is in the failed state or not. This does not rule out conducting on-condition monitoring if it is effective.
Note: if standard operating procedures are changed, or staff are reduced, then a review of the maintenance regime should be reviewed to ensure there are no new hidden failures resulting.
Examples of safety protection devices include pressure-relief valves where a valve failing to the shut position would be hidden.
Most safety protection devices are designed to fail in the ‘safe’ or ‘noticeable’ state, but some cant be. Alarms and warning systems should fail-safe. Redundancy is often employed to help improve responsiveness for safety protection. Pressure vessels are often fitted with two or more pressure relief valves. Other protection systems using sensors may also be configured using majority voting logic. An example is where four sensors may be used to detect rotating machinery overspeed, it will take three out of the four sensors to sense overspeed for the protection to trip the machine. If the overspeed sensors individually failed to the trip state (which should be indicated to the machine operator — so it is not hidden) then the protection system will be very reliable and also responsive. The supplies to the sensors and their infrastructure need to be separate and independent of each other to avoid common modes of failure that may affect all four sensors.
Failure finding tasks usually involve physically testing the trip functions. Overspeed tests can be conducted on rotating machinery if speed control is possible. Examples may include turbines. Other failure finding tasks may include removing some components and bench testing them. Pressure-relief valves may be checked in this way. Other protection devices may also be tested by isolating the device from its normal system, and applying local pressure, temperature or electrical signals that functionally tests the protection device. For example, an induction motor overcurrent trips may be tested for instantaneous overcurrent (protecting against a short circuit condition) or slow overcurrent trips (protecting against single phasing conditions) by injecting currents through the secondary coils supplying the trip devices, that are fitted around the primary supply cables.
Another example of hidden failures is with standby machinery. There is a concept of hot and cold standby where hot standby is where the standby machinery is always running and is ready to take the load if the primary machine trips. Cold standby is where the standby machine is normally shut down and automatically runs up to take the load of a tripped duty machine. The hidden failures for either are: will the standby machinery take the duty load, and in the case of the cold standby machine, will it fail to run upon demand. It is possible to exercise cut-ins and load taking by switching off duty machines and checking the standby runs and takes the load.
Failure finding task periodicities
The principles for failure finding tasks are to understand the probability of failure over time and choose a periodicity that reduces the risk of failure to an acceptable level. Often in a safety situation, Safety Integrity Levels (SIL) are used to specify the maximum required probability of failure, and the task intervals are adjusted accordingly. In some systems, international standards may be adopted that set required periodicities. I have worked in regimes where the American Society of Mechanical Engineers (ASME) codes for pressure relief valves were used. In steam systems, pressure relief valves were functionally tested once per year.
Note: How many maintenance systems explicitly identify safety-critical parts where maintenance periodicities have been set by standards, legislation or a safety case. How would you avoid a latent failure and inadvertently change the periodicity?
Many maintenance texts recommend using Mean Time Before Failure (MTBF) to determine failure finding periodicities. I am highly sceptical of this process unless you know that the underlying failure pattern is random. If you do not know the failure pattern or distribution using MTBF is not suitable and if the impact or consequences of failure are SHEL implicated, it could be dangerous.
Some methods of determining periodicity take into account both the probability of failure of the component plus the probability that the event that requires protection may occur whilst the protection is negated.
Thinking about master data and metrics
- In an FMEA or an equipment register have we classified equipment functional failures as having SHEL, Operational, Economic or low significance?
- In an FMEA or equipment register have we identified equipment as having hidden failure modes, and are there associated hidden failure tasks?
- Have maintenance items that are subject to SHEL or standards criteria been identified, and is the maintenance regime compliant with their stipulations?
- Has the CMMS got a warning flag that scheduled maintenance items, that are required by SHEL or by standards, that the periodicities must not be altered and that these maintenance items cannot be deferred?
It is worth introducing a segue here to discuss the classification of SHEL criticality. What we have so far discussed in this blog has been associated with machinery, how it has been integrated into systems as its intrinsic SHEL concerns. In other words, the SHEL concerns that the machinery has from its design, manufacture and integration in its plant or operation. This is not enough; we must also think about functional safety and how safety changes as we use equipment in its operating context and environment.
For example, if we take a mining truck, that transports ore from an open cast pit, then if a failure occurs that may normally be operationally implicated, that causes the truck to not be driveable, if the failure occurs whilst the truck is in the pit, then the consequences are probably promoted to SHEL. This may be because the exposure of the maintenance staff to recover the truck in a hazardous environment carries higher safety risks. This poses the question:
Has the operational context, environment or any abnormal operational procedures cause operational risks to increase, and have these been mitigated with plans or processes? Has this knowledge been captured in the maintenance and asset management master data?
In the previous blog, we saw the metrics required for timeliness of scheduled maintenance, using a timeliness matrix. The matrix is not applicable for failure finding, but It is important that failure finding tasks associated with safety functions are completed on time and are not deferred.
The timeliness is an important metric, as well as looking at the failure rate and making sure the statistics that assume acceptable reductions in the risk of failure and constantly validated is the other metric
The task is intended to discover failures. In the Weibull analysis, if we conduct a failure finding task and physically check the function and find the failure, we cannot determine the exact age of failure. We know the failure occurred at some time between the last task and the latest one. The analysis of data may involve increases in uncertainty.
In this blog, we have covered deriving metrics for on-condition failure finding types of maintenance tasks. We discussed the applicability of diagnostics and prognostics to the two models of on-condition we introduced in a previous blog here. We applied a confusion matrix to diagnostics which we presented as a type of binary classification problem and then summarised NASA’s work in measuring prognostic regression to predict End of Life. We discussed the importance of failure finding tasks especially when they are associated with a protection, alarm or warning system. Many safety-related failures finding task periodicities may be set by standards or with the organisation’s own safety rules. This means changes need to be justified with robust cases.
In the next blog, we will finish this series of three blogs to show how the metrics we have discussed in the first two blogs in this series. We will show how the metrics should be automated and can feed a more generalised higher-level metric that ensures we can avoid being overwhelmed with too many metrics.
We have focused on metrics associated with maintenance tasks, we would like to hear about other general metrics associated with maintainability, availability and reliability. Please share your stories and experience.