Part 1. The underlying metrics for scheduled replacement or restoration type tasks.
In previous blogs, we defined a breakdown of maintenance task types and later on discussed how Weibull analysis helped us identify some of the conditions for applying the types of task. We will build on those foundations.
This blog will start to look at metrics for each maintenance task type in terms of their effectiveness and timeliness. Each different maintenance task type requires separate metrics to measure their applicability, effectiveness and efficiency. Where metrics are needed, we will mark these with a METRIC: with a small explanation in the text.
The taxonomy of maintenance task types we described in a previous blog is this:
We are going to look at scheduled restoration and replacement type tasks in this thread. (replacement is sometimes termed scheduled discard)
First, we are going to look at the pre-requisites for applying scheduled replacement tasks.
- The failure pattern needs to be aged, where the Weibull shape > 1.3, preferably > 2.6. With an age failure characteristic, there is a period of useful life, where the probability of failure is low before the probability of failure substantially increases with increasing age.
- Where “1.3 < shape < 2.6”, it is a ‘weak’ age characteristic and there may be an option to check the practicality and apply an on-condition task.
- Is the task practical and cost effective?
We discussed the Weibull methods used in Reliability engineering in a previous blog.
This immediately identifies the first possible metric.
METRIC: Are we doing the correct type of task? Is there a correlation between the Weibull shape and the type of maintenance we apply, for all maintenance tasks?
The next important consideration is for each candidate failure mode whether the consequences of failure are classified as:
- Safety — Health — Environmental or Legislative (SHEL),
- minor significance
We will mostly focus on operational or economic consequences as this is where there is most leverage that maintenance organisations can exercise, even with the understanding that safety is paramount.
If the failure consequences fall into the SHEL bracket, the aim of the associated scheduled tasks will be eliminating or reducing unplanned failure events to an acceptable minimum probability. This implies that the cost optimisation method discussed below may not be applicable.
With some safety-critical parts, the optimal time for changing out components may be determined by using a fraction of their validated useful life. Examples include the rotating components in gas turbines or landing gear in aircraft. This is where the probability of failure can be demonstrated to be lower than a set acceptable level. Safety Integrity Level (SIL) policies/standards that apply in many industries are used to determine what the acceptable limits are. Further information about these processes can be found here.
If the failure consequences are classed as operational or economic, we can look at two methods for determining periodicity involving Weibull analysis.
Note: As an aside, using MTBF as a measure to define periodicity without knowing the underlying distribution of failure events can lead to costly mistakes being made. MTBF assumes a constant failure rate.
The first, simpler method, is to use the Weibull Cumulative Distribution Function graph where the shape is greater than 1.3, and we are looking to apply scheduled replacement. We can choose an acceptable probability of failure and read off the age using the graph, and use this as the periodicity. It is common to talk about the “B-xx” age using Weibull where B-20 is the age where it is likely that 20% of the population of components may have failed. With the lack of cost optimisation capability, an organisation may choose to select the B-10 or B-20 age as their periodicity for scheduled replacements, depending on the severity of operational or economic failure consequences. The figure below illustrates the use:
The second method is preferable as it optimises maintenance through-life-costs using simple simulation. This approach samples the Weibull distribution for random failures against different scheduled replacement periodicities. If the random age exceeds the periodicity, the simulation generates a scheduled replacement event, if not it generates a failure event. The simulation aggregates and then summarises a number of runs calculating the cost difference between applying the preventative and corrective maintenance over a population of the same components. The costs to take into account are:
- Material and logistics costs
- Labour costs
- Any lost production times associated with unplanned recovery.
- Any costs associated with lower product quality caused by an unplanned failure.
The simulation needs to account for the timings for recovery from an unplanned failure. If the recovery is urgent, because it has severe operational consequences, then the likelihood is that lost production time or reduced product quality. If the failure is economic then it may be possible to wait until the next planned outage to conduct the corrective maintenance. In this case, operational consequences are minimal. The type of corrective maintenance should be termed “Immediate (unplanned) Corrective” or “Deferred (planned) Corrective”. We can see these two types of corrective maintenance in the taxonomy diagram above. An example of failures with economic impact only may be the recovery of standby or redundant machinery.
In these optimisation calculations, it is usual practice to not include lost production times associated with planned replacements or deferred corrective, because:
- the planned outages have already been financially accounted for in the larger operational and maintenance plans for the whole site
- the planned outage combines many tasks besides the replacement or deferred corrective tasks we are interested in. Lumping all of the outage time costs on one item distorts the calculation.
Note: if planned outage time is over or under-estimated for the work that needs to be done, then corrections need to be applied to planned outage time and budget forecasting may be skewed.
These considerations result in the need for the following metrics:
METRIC: Have we classified the failure mode consequences adequately between SHEL, Operational, Economic and low significance classes, and determined periodicity appropriately. These classifications are usually identified in Failure Modes & Effects Analysis (FMEA).
METRIC: Are planned outage times, and their periodicity optimal and are the outage times fully utilised?
METRIC: What is the percentage of immediate or deferred recoveries?
METRIC: What is the variance of planned and corrective (immediate and deferred) tasks, and are the right figures included in maintenance planning estimations. (some systems predefine maintenance in templates or task lists)
Labour costs may be higher for unplanned failure because troubleshooting to isolate the fault condition and fixing other secondary damage caused by the primary failure is likely.
If the results from the simulation were plotted, you would see a cost optimisation chart similar to the following:
We run the simulation through a series of PM periodicities (ages), where these are short, the cost of doing more scheduled removals is high. The cost of unplanned removals may also be relatively high at this point because over-maintaining a component is very likely to cause premature failures. In other texts showing optimisation of the scheduled maintenance costs, the tendency to show the effects of premature failure is often overlooked. As the periodicity is increased the scheduled removals grow less frequent and their costs fall, the cost of unplanned failure bottoms out and then increases. The optimal periodicity is where the sum of the planned and unplanned costs is at a minimum, indicated by dotted lines in the diagram above.
Getting a mathematically optimal periodicity is not the end of the story, we now need to consider ‘Maintenance Packaging’. We need to round the periodicity up or down to an underlying maintenance cadence, so this task when scheduled in a plan will coincide with other related tasks. This allows planned outages and packages of work to be sensibly scheduled. We can illustrate this using a simple example of an engine and turbocharger.
An engine may be swapped for new at a designated half-life of a mobile asset. When the engine is changed, so is the turbocharger. The turbocharger is also changed out by itself, more frequently than the engine. The periodicities of the engine and the turbocharger should be integer factors of each other. If the engine change-out periodicity is 20,000 operating hours, then the Turbo periodicity needs to be 10,000 or 5,000 hours so that wasted work is minimised. If our simulation studies indicate that turbo’s mathematical optimal is 11, 506 hours and that the assets most frequent common periodicity over every maintenance task fr the whole asset is 1000 hours, then the consideration should be to extend both the engine and the turbo to 22,000 and 11,000 hours, alternatively, keep them both at their previous periodicities.
METRIC: Are all task periodicities packaged so they are in line with planned outage cadences and periodicities for associated components. The underlying set of periodicities should align to minimise through-life costs.
Where the consequences of failure are insignificant then we should classify this finding against the part in our FMEA, to show that it has been considered. This may seem obvious, but it does provide an explicit indication that we should not expend any further resources or time analysing these parts unless there is a major change made either to the asset or its use.
Let us assume that the simulation we have described has been completed and that the optimal periodicity for planned removals or restoration has been calculated and applied before we proceed to the next stage. We should now look at the timeliness of the scheduled tasks.
Timeliness of scheduled tasks
We can look at two dimensions
- Whether the task was completed early, on time or late.
- What was the state of the component at the replacement event, had it failed or was it still operational?
This is best viewed as a matrix:
Taking each quadrant in the matrix as numbered in the figure above:
1. Possible premature failure. It is possible that there will be a small number of early failures, given time to failure is a distribution, where the tails never reach absolute zero. However, if the numbers of events start to creep up, we should use the Crow-AMSAA technique to determine whether we are observing a growing quality problem. Then if the severity of the failure warrants, conduct root-cause analysis (RCA) to determine a preventable set of causes. If we look up the probability of failure for the periodicity age on a weibull cumulative density graph, we can expect that a small percentage of the components are likely to fail. but if we observe the percentage of actual failures grows above this figure, coupled a change of reliability trend seen in the Crow-AMSAA then it is likely we are experiencing a new development of premature failure. The cost of premature failures is the loss of economic life, increased rate of corrective maintenance and possible lost production.
2. Avoidable unplanned failure. This is where the component has failed after its age has exceeded its periodicity time, and the planned replacement is too late. The cost is the difference between an unplanned and planned removal event. This is a process violation. Sometimes it is necessary to defer planned maintenance, but the costs and risks should be balanced when making these decisions.
3. Wasted Economic Life. This is where a working component has been changed before the due periodicity. The cost is a proportion of the discarded economic life of the changed component, and if this practice of early change-outs is continued it increases the through-life costs with too much work being done. A known cause of premature failure is over-maintenance, and so there may be a relationship here that may be found during RCA.
4. Gambling. This is where the planned replacement is late, on an operational component. This might be thought of as a win as we have squeezed some more economic life from this particular component. However, when taken over a fleet of components throughout their lives, and we trust our optimisation, it is more likely we will lose out and increase through life costs. Betting against the house inevitably loses.
METRIC: Scheduled Replacement timeliness shall be measured and converted to costs deviating from optimal cost by applying the matrix rules.
Weighting the cost function
Calculating the cost of the matrix process variation may also be finessed. This is best illustrated by considering that if you executed the maintenance at 5% of the timeliness target the penalty would be uniformly proportional to 50%. This method of apportioning cost is not optimal. There should be a graduated scale of cost versus deviance. The use of a sigmoid function provides a means of weighting the cost of variance to lessen penalisation closer to optimal and maximise penalisation the greater the variance. This is illustrated below.
Where x equals zero on the x-axis, is equivalent to executing the maintenance at its optimised periodicity. The sigmoid function can be applied to weight the cost of variation, an explanation can be found here.
Managing to many metrics
We have outlined and suggested a number of low-level metrics for measuring the scheduled replacement or restoration tasks. One risk is that as we cover the whole taxonomy of maintenance tasks, we will generate a significant number of metrics, that become difficult to manage. There are two mitigations to this
- These metrics need to be automated and derived automatically from the maintenance system, without manual effort.
- The metrics need to be implemented in a hierarchical manner with higher levels summarising lower-level metrics we have defined here.
How we summarise and produce a hierarchy of metrics will be included in the end blog of this thread.
In this blog, we have started the validation for why we should classify our maintenance tasks in a CMMS according to the taxonomy in the diagram above. This taxonomy was discussed in greater detail in an earlier blog. The reasons why we need to break down tasks are:
- The periodicities for each type of task are calculated differently
- The metrics for the low-level applicability, effectiveness and efficiency are different for each type of task
- The applicability for tasks also depends on the patterns of failure, as discussed in the previous blog on Weibull analysis.
This blog is part of a thread proposing a number of low-level metrics that should be applied to measuring the applicability, effectiveness and efficiency of scheduled restoration of restoration type maintenance tasks. Please feedback any other metrics you use for different types of maintenance task
In the next blog, we will look at metrics used for on-condition maintenance, including predictive maintenance.