Part 3, Tying the task metrics to a higher level
In the previous two blogs, we have defined low-level metrics for maintenance tasks. One of the critical success factors in deploying metrics is having a manageable number to set as organisational goals, to avoid being unfocused or overwhelmed. The metrics we have defined need to be part of a system that feeds higher-level metrics or performance indicators. This blog will discuss higher-level metrics and show how the task metrics can be linked.
There are several commonly used high-level maintenance metrics used in industry
- Safety Compliance: the number of accidents, incidents or near misses, number of days off work injured.
- The percentage difference between planned maintenance and unplanned corrective maintenance
- Maintenance compliance: The timeliness of planned maintenance, which may include some measure of deferred maintenance
- A measure of reliability and maintainability: Often these may include MTBF and MTTR, but there are disadvantages to these measurements. We will question the use of mean values.
- Measures of OEE Operational Equipment Effectiveness: OEE is the product between availability, equipment performance and product quality. Product quality is easy to define in a manufacturing situation, but even in mining mixing different grades of ore, or crushing to a required specification is often done.
- Measures of maintenance resource utilisation and conformance to budgets
What is missing from this list are measures for:
- Is the right maintenance is being done (as opposed to whether the maintenance is done right)?
- Any team metrics, for example, does the team have the right mix of skills and experience, and how is the team developed?
- How complete is the master data for maintenance? This has a direct impact on whether the right maintenance is being done.
We can show how we should approach calculating these higher-level metrics
Safety or SHEL Compliance
SHEL stands for Safety, Health, Environmental & Legislative compliance. These are all factors that an organisation cannot afford to make mistakes because the ultimate consequences are existential.
Safety is mandatory and non-negotiable and should be considered from two broad perspectives:
- Operational safety where day to day working safety is considered. Common metrics include recording and trending the number of reported accidents, incidents or near misses, lost time due to accidents and injury incidence rates
- Functional safety, where there needs to be an understanding of what assumptions and possibly maintenance are linked to the intrinsic safety of an asset or a machine. For example with rotating machinery, how Overspeed checks must be done, for pressure vessels how relief safety systems must be periodically tested, fire detection and suppression systems may have statutory tests. The maintenance master data must contain all of this data, and the associated equipment tagged to prevent things like changing test periodicities, which may be mandated by law, standards or organisational requirements. Maintenance managers must also be aware of safety implicated equipment Planned Maintenance must not be deferred and needs to be completed on time.
Besides health and safety, the environment and legislative compliance should also be included. Pollution and emissions have growing statutory requirements that must be complied with, and some emissions reporting is mandatory in some countries. Maintenance has a huge
Safety culture is based on a willingness to accept mistakes as learning opportunities where any person can be confident in reporting their own without being denigrated and instantly punished. Obviously, gross negligence or deliberate acts must have disciplinary consequences, but the ethos needs to be on learning from mistakes, instead of energy being expended in covering them up.
Safety training and awareness for staff should be measured and tracked as well as attendance at safety briefs where previous accidents incidents and near misses are analysed and discussed, so lessons can be drawn.
How complete is the master data?
We hinted in a previous blog that most CMMS systems are not designed to contain much of the data we require to manage maintenance in a principled and scientific manner. The CMMS holds
- Equipment registers to identify equipment and their physical position
- maintenance tasks and their scheduling data
- Plans for maintenance
- Estimates for levels of effort, material and times taken for doing maintenance
- A history of when maintenance was done, and how long it took and what resources were necessary
- Budgetary and cost data.
We need other data including
- Machinery criticality, several factors may be taken into account, but the major factors for maintenance are the likelihood or frequency of failure, impact or consequences of failure and detectability of the failures occurring. These three basic factors are calculated in an FMEA RPN to determine the criticality of possible failures of components.
- A justification for the maintenance tasks we do. The data recorded in FMEAs and RCM along with RCM age analysis provides this justification
- Root cause analysis data, which also forms the basis of defect elimination
- Documents and drawing data
- Data used in and derived from On-condition maintenance or Predictive maintenance
- Failure recording, including how failure presents, sudden, intermittent or gradual. What if any symptoms are available?
- Reliability, availability and maintainability data. Using several methods including Weibull, FMEA
- Effects and consequences of failure (FMEA data). A scale showing SHEL (Safety Health, Environmental & legislative), Operational, economic or low significance should be classified for each functional failure
In a previous blog, we covered the possible use of graph type data management solutions that suit the nature of data necessary for maintenance, because the data is rich in relationships.
A mapping between maintenance type tasks in existence or planned, and what data is necessary to justify that task is necessary, the completeness of the data could be given a simple scale of completeness
Is the right maintenance being done?
We have previously shown that the shape parameter in the Weibull analysis is a great indicator to help us decide what maintenance tasks are applicable. The Table is reproduced here
Besides the Weibull shape determinant, further prerequisites are necessary to apply maintenance task types as follows:
- On condition.
- Is there a time delay between the inception of failure and functional failure?
- Is the detection of the failure possible, leaving sufficient time (P-F interval) to avoid failure consequences, ideally to plan and predispose resources for recovery?
- Is the variance in P-F interval small enough to be practical?
- Is the task practical and cost-effective? Cost-effectiveness is a balance between the costs of failure, versus the costs of inspection, sampling or processing data? The frequency of inspection is a fraction of the P-F interval. The downside costs of failure are determined by the frequency of failure.
- Scheduled replacement or restoration
- Does the failure have a wear-out failure pattern?
- Is the task cost-effective? The costs of avoided failure are a combination of frequency of failure and impact of failure. The cost of conducting maintenance is the planned change costs, plus if the task is a scheduled replacement, the discarded remaining life of the part. The formula for calculating this may be: 1 — (Weibull scale age / the part’s age at change) * the material cost of the part
The metric is to determine the percentage of active tasks that have the appropriate technical justifications and evidence present, along with the calculations that determine the currently used periodicity.
Is the maintenance being done right? (maintenance compliance)
In previous blogs in this thread, we have discussed the costs of timeliness of the scheduled maintenance. For scheduled replacement or restoration, we used the concept of the timeliness waste matrix.
For on-condition, we used a confusion matrix for the accuracy of diagnostics and looked at a cone of accuracy over the prognostic remaining useful life. Timeliness of the on-condition maintenance is also important, if inspections or samples are done late (periodicities > ½ the P-F interval) then detection may be too late to plan recovery or maybe missed entirely. This incurs the costs of an avoidable unplanned failure with the associated disruption.
All of these metrics can be arithmetically combined and supplied to a higher level for maintenance compliance
Reliability and maintainability
Many maintenance systems keep trends of MTBF and MTTR to measure reliability and maintainability trends.
The Mean Time Before Failure is the average of the times to failure of a repairable component or machine. By repairable we can replace a machine’s broken part with a new one, and we assume the newly fitted component has the same intrinsic reliability as the previous component. The Mean Time To Repair is the average of all the times taken to repair or replace the non-functional part.
These measures have the advantages of being relatively simple to calculate and are universally well known. However, there are disadvantages and limitations that all maintenance and reliability professional should be aware of.
Problems with using mean or average measures
Both MTBF and MTTR metrics use the mean or average of the time data they collect, but we need to ask ourselves is using the mean always the most appropriate measure for the ‘middle tendency’ or ‘expectation’ for sets of numeric or continuous data? For example, why do government authorities worldwide use the median, instead of the mean, when presenting wages data?
The answer is that the mean measure is sensitive to outliers and is not as robust where the data set is skewed (not uniformly distributed, or the distribution is not symmetrical about the mean). Comparatively, the Median measure is less sensitive to outliers and more robust in these circumstances. If we trend an MTBF we need to include a window over several previous readings to calculate the average. If the underlying reliability changes and the time to failures are different from the most recent to the previous set of data, then the trend has a lag before settling down to the new truer reading. An exponentially weighted moving average could be used to compensate for this, but this complicates the calculation and does not entirely solve the lag problem. I prefer to use the Crow-AMSAA reliability growth chart, which shows a log scale of the cumulative trend of the age against the log of the event count. An increasing slope of the trend for the latest events compared to the historical events shows worsening reliability (as shown in the figure below). If the slope gets shallower it indicates reliability is improving. I believe we need to wait for at least three consecutive similar events or data points need to be recorded before we can have adequate certainty that a new trend is established. This is the quickest way to observe a new trend compared with the lag involved with a moving window average calculation.
Some people believe that the MTBF is the age that a component will last, this is not the case.
If we are given an MTBF without any other information, we would need to assume a constant failure rate for the times to failure. This does not align data with the normal (or gaussian) distribution; it aligns with the exponential distribution where the meantime to failure equals the point at which 63% of a population of parts will likely have failed.
Because given a raw MTBF tells us nothing about the variance of the times to failure, the Maintenance Guru prefers to use Weibull analysis or Crow-AMSAA. We have discussed the use of Weibull in a previous blog here. A more meaningful measure of reliability for the working maintenance or reliability engineer would be to use the B-20 likelihood of failure derived from the Weibull Cumulative Distribution Function illustrated in the figure below. The B-20 measure is the time to failure where 20% of the population of the parts are likely to have failed. The B-20 is also useful because if the shape characteristic of the Weibull is small < 1, then the reading accentuates how much worse premature failure is compared with a wear-out failure (Weibull shape > 2.5). This measure then helps the reliability engineer focus more on eliminating the causes of premature failure.
Einstein said “we should strive for the simplest solution, but no simpler”. The maintenance Guru believes MTBF is an oversimplification for reliability engineering.
The blog would recommend that MTBF or MTTR may be retained if changes to more meaningful metrics may be too difficult to persuade others who expect them, but that specialist maintenance and reliability folk use Weibull and B-20. As explained in previous blogs, the MTBF should never be used to calculate the periodicities of scheduled tasks. Using the Median measurement for MTTR will reduce the impact of outliers and provide a better measure for estimating future maintenance work durations for planning and budgeting.
Operational Equipment Effectiveness (OEE)
OEE is a metric made up of the product of availability, performance and quality. Availability is the measure of time that an asset can be profitably utilised, minus planned outage time against that time when the asset is unavailable for productive work for other reasons. Other reasons may include:
- Avoidable delays in the operational cycle
- Unavailability due to unplanned failure (Failures that have an operational effect)
Not all assets can be run profitably 24/7, consider the profitability whole fleet of busses running at 4.00 am, but a small minority of the fleet will still run. The definition of the baseline availability therefore may need to be carefully defined. It may be a case that the total scheduled running hours of profitable utilisation should be used against the actual number of hours of profitable hours achieved is used.
Availability is also impacted by inefficiencies in unplanned failure recovery. The availability of resources and logistics delay times may need to be broken down and recorded so that analysis and efficiency improvements may be conducted later.
Performance can be defined by the capacity to do useful work. Capacity may be reduced by the inefficiency of machinery or ageing effects that reduce performance. Predictive maintenance is often applied to measuring machinery performance and efficiency. Fuel or energy costs per unit of output is an effective way of measuring machinery efficiency. Efficiency can also be restored by scheduled restoration tasks.
Quality is the easiest to think about for a production line. Any product that is out of specification and rejected (possibly by a customer) is a low-quality issue that means the work completed and value-added in production is wasted. If a customer rejects a product because of low quality, it may involve loss of reputation and avoidable costs due to compensation.
In some situations, the quality of the product may be difficult to define. In mining, the quality of the ore may be specified and mixing higher grade ore with lower grades may be necessary to meet customer specifications. In freight transport, it may equate to the condition of whatever is being carried at its destination after being exposed to the environment
Maintenance resource utilisation, planning and budgets
The maintenance team utilisation may be split between reactive corrective and planned preventative work. The aim should be to increase planned preventative work which should decrease the overall maintenance spend.
Over time we are looking to achieve a high utilisation rate with the maintenance team, with as little variance as possible whilst achieving the required safety.
Overtime is a means of dealing with unexpected demands, or maintenance may be subcontracted, especially for major machinery outages such as a machine mid-life update. Overreliance on overtime indicates workload may be too high, or estimations for work planning may be too low. Overtime should be recorded and trended.
In a mature maintenance management system, we should be able to exploit the statistics derived from the Weibull analysis, providing the data for the probability of failure over the age of the components. This is input into RAM (Reliability, Availability, Maintainability) simulation. This is where a discrete-event-monte-carlo simulation can be used to forecast future failure and maintenance events. If the simulated events are extracted costs can be assigned as a Life-Cycle cost capability. Using simulations is far more accurate than assuming fixed failure rates used in a spreadsheet. We can discuss the use of RAM and LCC simulations in a future blog.
Another output of the RAM modelling is to provide a forecast for maintenance resources over time. If peaks and troughs are indicated in the simulation forecast proactive actions by the planners can be taken to bring forward, stagger or defer work to flatten the demand out. Actual demand can be trended with forecast demand, and any deviations can trigger analysis and investigation.
The use of simulations does not provide a crystal ball to accurately predict that a single component is going to fail (with high certainty) on a certain date.
All models are wrong, but some are useful.
The RAM and LCC simulations should be used to determine likely demand for whole populations of components or assets, split into time buckets (monthly) for the likely number of events in those periods. The more granular the population of components and the smaller the time slices, the more inaccurate will be the forecast. A simple way to find a good trade-off is to simulate the previous year and see how accurate the predictions are to the actual events, although past performance is not a guarantee of future accuracy.
The thinking and use of the simulations have emergent benefits, as it enables sensitivity analysis to be conducted to determine priorities for improvement, conduct what-if scenarios and the process of running the simulation capability increases knowledge of how the operations and maintenance systems work and interact.
The planning and LCC simulations can be used to help set the maintenance budgets and then track spend against the plan and the forecast. Deviation could trigger investigative action to see what is underlying the deviation. The organisation will likely set a tolerance that the maintenance manager could work inside.
Spare Parts and inventory management
Spares management and logistics is a separate discipline to maintenance management, but there is a close dependency and reliance. Common metrics for spares and inventory management include:
- trending any stockouts and resultant delays this causes
- Inventory record accuracy
- Turnover times, to minimise sunk capital invested in static inventory
- Managing lead times for supply, especially when this is associated with planning horizons and the P-F intervals in on-condition maintenance.
In this blog we have presented several commonly used maintenance metrics, highlighted problems and improvements that may be made suggested some extra metrics and shown how to combine the lower-level maintenance-task metrics to higher levels. We have again shown the importance of having master data and as we have added further detail to the data, it should reinforce the desirability to use graph-based data management systems we covered in this blog.
I would like to hear about your experiences with metrics, anything novel and valuable you have experienced and used. What other metrics do you use in your organisation?
In the next blog, we will cover a useful chart type called the Jack-knife diagram that plots a scatter chart of an asset’s components based on FMEA RPN or risk criteria. This chart may be used to help prioritise improvement work and complements the metrics we have covered here.