[Part 3] How AI is Changing the IoT Based Predictive Maintenance: Inference, Evaluation, and Optimization

Contributors Anurag Bhatia, Saurabh

This section is the third—and final—part of our predictive maintenance series. We will quickly go through inference, evaluation, and optimization possibilities.

Outlining Inference

An inference pipeline needs to support both historical (i.e., batch) as well as live (i.e., online) inference.

Batch inference on historical data (sometimes referred to as back-testing) is necessary to decide whether model performance is up to the mark such as running harness for result comparison and applying optimization (e.g., good enough to be deployed in production). Another aspect of the model is production generating predictions (whether and when to alarm) on live data e.g., the requirement of real-time alerts capability).

Building Evaluation Strategy

This section will help to formulate the metrics required and steps to move forward on the evaluation techniques.

Standard metrics

Challenge: Since it’s not a typical classification or a regression problem, there are no widely agreed upon, industry-wide off-the-shelf performance metrics.

Solution: We need to frame our predictive maintenance problem as a classification one to use the usual model performance metrics.

Steps involved

Step 1: Decide the appropriate length of the lead_time (i.e., the time gap between the timestamp of a model alarm and that of the downtime—if any—after the alarm) depending on the use case involved.

A predicted observation (Source)

Implication: If lead_time is too small (e.g., 10 minutes) it is likely not useful since there will not be enough time for maintenance engineers to go out in the field to check and do any necessary repair work.

Step 2: Categorize each alarm (i.e., anomaly) into whether or not it was useful.

True Positive: An alarm that was followed by an actual downtime within the lead_time. This can also be called “correct alarms.”

False Positive: An alarm that was not followed by an actual downtime within the lead_time. This can also be called “false alarms.”

Step 3: Categorize each outage (i.e., downtime) into whether it was caught beforehand by the model prediction (alarm) or not (missed).

True Negative: A case when neither a downtime happened nor was there a corresponding model alarm to begin with.

False Negative: A situation where the downtime did happen, but the model failed to catch it (i.e., no prior alarm went out).

confusion matrix

Having followed these steps, we can now pick and choose one or more of the following model performance metrics relevant for classification:

  • Confusion matrix
  • Precision
  • Recall
  • f1-score

Custom metrics

Depending on the use-case and input data, we might need to explore custom metrics. Here is a relevant example from a less technical and more business perspective:

Cost-Benefit Analysis: Without getting into the technicalities of machine learning and its related jargon, the business audience will be more interested in answering the following questions:

a) What exactly is the value add here that comes from the solution being implemented? (e.g., What percentage of the total downtimes in a month was the model able to predict beforehand? Accordingly, how many did it miss?)

Implication: If the model is not catching enough proportion of the downtimes, then it is not adding value to begin with.

b) At what cost does this value add come?

To be more specific, how many times does the maintenance engineer need to go out in a day (for example) to respond to every alarm?

Implications:

Even if the model is catching most—or all — of the downtimes but alerting several times a day, the cost may be prohibitive. Also, there is a real danger that, over time, end users will lose faith in the model predictions and stop taking the alarms seriously. Hence, a custom metric like “number of alarms per day (or hour)” becomes relevant in such situations.

Let’s assume the lead_time was 5 days, the number of alarms per day, in this case, should be < 0.2 (i.e. ⅕) to be any better than just a random guess.

Reason: Each alarm will get a time slot of five days to check whether there was a downtime for that specific machine or not and accordingly be called correct and false. This means one alarm at a regular interval every five days will indeed catch all downtimes, but that is just a random baseline—and the machine learning model has to be judged in that context.

Here are the optimization techniques for such outcomes

The custom metric of “number of alarms per day” is necessary—but not necessarily sufficient in all situations. Even if the overall average number of alarms is low (which is good), the frequency of the alarm may be disproportionately high on some days while non-existent on most other days:

series of no. of alarms (not optimized)

Another pitfall to look out for is that the threshold set by the model might be too low and hence alarming a little too often:

observation score and threshold plot

In such cases, it is important to check the underlying distribution of reconstruction errors in our anomaly detection model. The threshold for alarming is often set (e.g., by formulas like “mean + 3-times standard_deviation”) assuming that the underlying distribution is normal (Gaussian) in shape, but this assumption may not hold true in some cases. Here is a good example:

error distribution

In other words, it’s a bit like the Tweedie distribution but certainly not Gaussian. In such cases, we either have the option of choosing another way to arrive at the threshold or force the underlying error distribution to first become closer to a Gaussian one and thereafter proceed with the same way/formula to set the threshold as mentioned above. Here is an example of this threshold optimization approach by changing the error distribution (i.e., selectively choosing some error values while ignoring others):

Distribution transformation

Since the new distribution is relatively much more (though still not completely) in sync with the assumption used while calculating the threshold, now the threshold value, as well as the alarms’ frequency, are likely to become more optimized and meaningful.

Observation score and threshold plot
Optimized alerts

Note: Alarm frequency has come down significantly as a result of threshold optimization. In this specific case, there was no adverse impact on downtimes predicted while the improvement (in terms of reduction in false alarms) was significantly higher.

Conclusion

With the optimization strategy of the IoT-based predictive maintenance model, I am concluding this discussion. Now we are at the stage where we have an end-to-end understanding of predictive maintenance capability building — from the data ingestion to the failure alerts.

--

--