Why are there so many inspections with on-condition maintenance?
why can it seem so wasteful?
In my career, I have heard several people criticise on-condition maintenance as being wasteful. To paraphrase
“There are so many inspections and they hardly ever find a failure, especially after conducting an RCM project”.
In this blog, we will explore this issue by discussing some of the principles behind on-condition maintenance and showing why there may be so many inspections. We can also show techniques for their reduction whilst still retaining the integrity of the maintenance to avoid failure consequences.
In order to address this criticism, we need to establish the basic principles behind on-condition maintenance. I believe we need to use two models to describe On-condition maintenance. Most other textbooks do not differentiate between these two models, which I think can lead to misunderstandings. The first model is where deterioration of the component’s condition begins either after manufacture (such as rubber hoses curing) or when the component is introduced into its operating environment (such as a ship launch into seawater). We call this model long-term deterioration.
The second model is where rapid deterioration starts after an initiating failure event. The initiating event can occur at any stage during the service life of a component. The failure initiation event may be caused by a shock or sudden overload situation overstressing the component. An example may be a bearing failure that is caused by a sudden overload. A car wheel bearing failing after driving over a really deep pothole. We call this model incipient-failure detection. We can use two basic graphs to illustrate the two models, which we will reuse and add more detail to in the deeper explanations.
The major differences between the two models is the length of time the condition deteriorates over and the start of the degradation process.
The two diagrams above show degradation that accelerates over time, this is for illustrative purposes only, as some degradation may be constant and others noisier because of other changing physical properties. We should expect much noisier degradation trends in the real world. Before we dive deeper into each model, it is worth explaining some of the pre-requisites for applying on-condition maintenance to set the context for the rest of this blog.
On-condition maintenance is only applicable when:
- There is a time-lapse between either initiation of a failure or the start of degradation until functional failure.
- That the deterioration is detectable at a time before functional failure (Usually called Diagnostics) that allows:
a. the functional failure to be avoided.
b. the recovery of the failing part may be planned with resources predisposed
c. Additionally, if possible, the time to functional failure should give some leeway to time the recovery withdrawal of service to minimise operational impact
3. The detection:
a. Is highly certain (avoiding false positives or false negatives)
b. Isolates the failure modes
c. Isolates the components to those requiring direct corrective maintenance
4. The variance of times to functional failure, for different instances of the same failure mode on the same component is reasonably tight and dependable.
5. The prediction of remaining useful life (usually called prognostics) is possible
A subtlety in maintenance semantics is worth pointing out here. Although on-condition maintenance is in the family of preventative-maintenance, it does not prevent failures. On-condition enables the avoidance of failure consequences. On-condition still requires us to take corrective action at the later stages of deterioration.
Long term deterioration
We will first look at long-term deterioration. Where there is long term deterioration throughout the operational life, we need to consider two further sub-divisions of how and when the deteriorated condition can be detected, measured or inferred. The first subcase is when we can measure condition at any time but may need to consider other constraints. The second subcase is where we can only measure the condition late in the deterioration lifecycle, and there may be many earlier, seeming nugatory, inspections.
Regardless of when the condition measurement is possible it needs to be able to inform us what the remaining useful life (RUL) is before the component we are monitoring is unable to deliver its functions (functional failure).
Sometimes it is not possible to quantify the condition until it is late in the degradation lifecycle. This constrains what can be done. An example of where initial detection is only possible relatively late in the degradation cycle is the detection of fatigue cracking in metallic structures such as vehicle chassis, or aircraft fuselages and critical components like landing gear. If we draw the model diagram, we can introduce the idea of the P-F Interval.
We have introduced two terms that seem to describe similar things, Remaining Useful Life (RUL) and the P-F interval. What is the difference? The P-F interval is the period of time from being able to detect and diagnose an incipient failure, to the time of functional failure. The RUL is the time after diagnosis of a failure until functional failure. RUL counts down the time to functional failure as time goes by, the P-F Interval is fixed.
The failure mechanism of fatigue relieves stress by micro-cracking within the structure where the cracks join and then propagate until they reach a critical size where the structure is deemed unsafe. It is only possible to detect the cracks at a certain threshold size, and RUL is then usually calculated using crack-propagation models. Many inspection techniques may only detect cracks that appear on the surface of structures as well. Inspection techniques often referred to as NDE or NDT (Non-destructive examination or testing).
Techniques used to detect the cracking include ultrasonics, eddy-current, dye-penetrant and others. A useful blog can be found here
The RCM process then instructs that the inspection periodicity needs to be a fraction of the P-F interval. The result may be a large number of inspections that early on in the degradation lifecycle will have very little chance of detecting cracking. We also need to apply the pre-requisite rule #2 above, in order that the inspection is valid and will add value. This is where some of the criticisms of too many nugatory inspections may be valid.
An alternative to using the P-F interval to determine the frequency of inspections for metallic and most composite structures is to adopt the approach invented in aerospace after a Dan-Air cargo aircraft crash in 1977. This is termed the ‘damage tolerance’ approach, which relies on the frequency of robust inspections being able to detect cracking such that cracks cannot develop to critical lengths between the inspections.
The essential pre-requisites for a damage tolerance approach of inspections for structural cracking are:
- An understanding of crack growth rates in relation to utilisation and cyclic stressors in the operating environment (thermal cycling, vibration etc)
- An understanding of rogue manufacturing and accidental environmental defects
- An understanding of the smallest size of defects that can be reliably detected
- The determination of the largest defect size for safe operation
In any safety regulated industry or application, all of these factors would need to be demonstrated and validated.
This results in the type of inspection schedule illustrated below. This addresses the basic criticism of too many inspections but requires rigour in being able to model and justify the failure mechanisms and how they are influenced by the operating context.
We can also look at other examples where we can measure and quantify the deterioration at any time in the lifecycle, but in order to inspect we may need to interrupt operations or the sheer amount we have to inspect is extreme and this poses practical problems. An operational interruption example may be a gas turbine internal inspection to determine the rate of dissipation of the thermal barrier coating in the hot end. Thermal barrier coatings allow hotter combustion that improves thermal efficiency with temperatures exceeding the melting point of the guide vanes and blades. The inspection requires the engine to be shut down and cooled off enough to use borescopes.
Another example of a huge amount of infrastructure to inspect is pipelines. Pipes are often used to convey liquids or gasses for considerable distances. Trans Canada is an example. A pipe leak may be catastrophic and cause considerable damage to the environment. Among the pipe failure modes, pipe wall thickness tends to degrade and reduce over their economic life and are subject to internal and external corrosion. We will find it prohibitively expensive to continuously monitor all of the pipes. The use of pigging scrapers enables lengths of pipe to be measured whilst still maintaining flow. The use of drones can also be used for inspecting external issues, with drones being extensively used in other distribution infrastructures such as electricity transmission and distribution. With advances in technology advanced instrumentation and data systems are getting more effective for surveillance and inspections, but their use and scheduling has to be carefully planned to ensure the inherent rules of on-condition maintenance are adhered to.
Another consideration for inspections should be accessibility and the consequences of failure. An example may be inspecting pipes and the reactor pressure vessel inside a nuclear power station. Reactor pressure vessels are subject to work-hardening embrittlement because of the Neutron bombardment from the nuclear fission process. Cracking becomes more prevalent the more work hardened the pressure vessel becomes. Inspections are necessary for a radio-active environment where the ALARP principles (As low as reasonably practicable) apply to reduce the radiation dosage that workers may receive. Robotics for inspections are therefore commonly used.
How can we further seek to reduce the extent of what we inspect?
In such systems simulation and damage accumulation modelling that infers the condition are frequently used. These simulation techniques add a corroborative back-up to the inspection results and can be used in determining the risk of failure. Where models are used in conjunction with safety-critical components, the amount of validation & verification, testing and proof used is considerable, and may also be cited in licences to operate the plant. As software tools emerge the ability to utilise simulations become easier and their adoption may become more widespread.
We can also exercise engineering knowledge and use a risk-based approach to determine which parts of our infrastructure are most likely to fail first. In a vehicle chassis, we know where the most stressed parts of the structure exist, and we can focus inspections on these parts. Welding used in structures also increases failure risks with heat-affected zones. In pipelines, we may have bends to allow thermal expansion of contraction or low points where moisture may collect. These areas may be more prone to failure compared with straight runs of pipe and are selectively inspected because they should degrade more rapidly. The inspection system then needs to ensure that inspections are repeated at the same location.
Given the systems we can quantify condition at any time, there is no P-F interval. How can we schedule the inspections?
If we can quantify the variation of times to functional failure, from introduction into service, using techniques such as Weibull, we can determine worst-case deterioration in the left-hand tail of the Weibull distribution. If we can further back this up with modelling for damage accumulation, we should be able to determine a scheduling regime roughly based on the half-life of the remaining deterioration life. This is illustrated below:
The determination of when to switch from half-life to regularly spaced inspections can be based on a risk approach to determine what is the most credible shock event that could occur that accelerates the degradation of condition to functional failure. The periodicities may be based on the minimum time of propagation of the fault after this shock. It is at this time that remedial action should be taken when the risks of instant functional failure from ever smaller shock conditions arise.
The Half-life approach to scheduling where we have long term deterioration where the condition can be quantified at any time in the deterioration lifecycle ensures that the number of inspections is minimised, and partially addresses the criticisms addressed at on-condition maintenance. The RCM process does not explicitly describe this method of scheduling.
Now we turn to the second model, Incipient failure detection.
Incipient failure detection
Incipient failure detection is where components have a distinct failure initialisation event that may happen at any time in their service or operating lives. After the initiation event, the condition deteriorates over a relatively shorter time than Long Term Deterioration until functional failure or worse is arrived at. The physical time lag between initiation and functional failure aligns with the first pre-requisite for on-condition applicability quoted above.
In order to be effective, the other pre-requisites need to be met, very important is the detectability that we are failing, at a time before functional failure that enables us to prepare for taking the component out of service, minimising operational disruption and apply the corrective maintenance.
If we take a rolling element bearing failure as an example, we can illustrate the principles. If a bearing is specified so it has a margin of load bearing over the maximum loads including most credible overload situations for its intended use, that it is manufactured with high degrees of quality assurance, that lubrication and protection from environmental stressors (e.g. corrosion) is maintained appropriately, and operations stay within specified limits then the probability of bearing failure is minimal and rare. However, lapses in any of these may result in an initiation of failure.
Bearing failures may be detected by observing a number of symptoms and techniques throughout the degradation cycle. The symptoms may become observable earlier or later. This is illustrated below
It is worth pointing out that it is possible to use fixed sensors that may measure parameters such as vibration and/or temperature and using software to automate the processes we can avoid periodic inspections or sampling because the monitoring is continuous in a predictive maintenance system. Dependent on how data is acquired from the sensors the diagnosis and prognosis can be rapidly alerted to the system user. This helps enable gaining more RUL to plan and prepare for corrective maintenance.
Often oil-debris relies on taking samples from magnetic chip detectors or filters, and vibration may be captured using handheld portable equipment. If a fixed temperature sensor is not fitted, it may be possible to measure temperature using a thermal camera or portable touch temperature sensor. In all cases, the sampling equates to an inspection and needs to be done according to the RCM process at less than half of the P-F interval.
If we think about the failure rate of bearings, we may only experience one or two failure events a year, where we may have an inventory of hundreds of machines with bearings. In order to ensure we capture all of the failure events, we must sample each at half of the P-F. If we assume the P-F interval is two months, for fifty machines each with two bearings we would have to sample twelve hundred times to detect two or three failures. This is why on-condition might at face value seems so wasteful.
A common mistake made is to extend the periodicity of inspections, without reference to the P-F intervals. This implies only a fraction of the true failure events may be detected, and some of those may not provide enough RUL to be beneficial. This may result in on-condition maintenance becoming a cost, with little or no benefit.
The vibration and oil debris sampling regime might be too expensive, but the maintenance supervisor should also consider the benefit of having an experienced maintenance person or suitably trained experienced operator carry out walk-downs of all the plant on a shift or daily basis. The five human senses coupled with an experienced human brain are very powerful on-condition sensing devices. Although the RUL may be quite small when detecting a bearing failure by touching the bearing housing, it may be enough to avoid functional failure consequences. Engineers conducting walk-downs are not usually captured or formalised in a CMMS system, but they are a vitally important part of the overall maintenance regime and recognised as a best practice among most experienced maintenance engineers.
Cost justification for on-condition maintenance
Unlike the long-term degradation, there are no smart techniques to reduce inspections or sampling tasks in Incipient Failure Detection, except to adopt predictive maintenance. In order to justify the cost of frequently sampling or inspecting we need to conduct a cost-benefit analysis which should be done as part of the RCM process. Recording such an analysis and showing the benefit may be used to defend the inspection regime from those who may mistakenly think it is wasteful.
We need to capture the following data for the analysis
- The P-F interval and possible variation of the P-F
- The failure rate for the particular failure modes being addressed
- The cost of the inspections, samples, and their analysis
- The costs of incurring unexpected functional failure and unplanned corrective action
- The costs of avoiding functional failure consequences, but including planned corrective action
A spreadsheet could easily be used to determine the cost benefits. If there is a Weibull distribution available and current ages of all of the bearings, then the failure rate can be predicted over the next few years in a simple Discrete-Event-Monte-Carlo simulation using Python and some specialised libraries. This simulation should only take a few lines of Python script.
Where new equipment is involved or there is no existing data on failure rates or the P-F interval, the RCM process uses the knowledge of the most experienced maintainers and operators to glean these data, as these people best know their machinery. There are also other manufacturers or industry data that may be used as an initial value. As data is subsequently gathered, the business case may be re-evaluated.
It would also be prudent to think about the quality of the on-condition diagnosis, sometimes potential failures may be missed, and at other times a diagnosis may be made of failure, where none actually exists (commonly known as a ‘false positive’). A false positive may result in nugatory corrective work.
If the business case holds, and there is a benefit, then we should accept that the number of inspections to actually detecting failure is a natural, normal and acceptable consequence of conducting maintenance.
It may also be possible to set this business case against investment to develop a predictive maintenance system with fitting new sensors, where the initial investment can be set against the vastly reduced costs of the inspections themselves.
In summary, we have broken down on-condition monitoring into two basic models and described why the ratio of the number of inspections to detecting failures is high. We have explained some techniques where inspections can be reduced and outlined how on-condition may be cost-justified. The cost justification should be integral in the RCM process to demonstrate it’s a requirement for ‘practicality and cost-effectiveness. If RCM is done correctly then the many inspections for a few failures is justified.
Please let us know if you have tried RCM and found the process to be too onerous, have you experienced where the periodicities of inspections have been changed for cost reasons with no technical linkage to the driving principles of the P-F curve?
In the next blog, we will start a series looking at maintenance metrics and waste. This blog had feedback after we introduced the taxonomy of maintenance task types. One of the reasons this is necessary is that effectiveness, efficiency is different in every case, and we will need to measure different things. In the first of a thread, we will look at metrics for Scheduled Renewal or Replacement, and Scheduled Restoration maintenance.