[Part 2] How AI is Changing the IoT Based Predictive Maintenance: Datasets and ML Models

Contributors Anurag Bhatia, Saurabh

This is the second part of our predictive maintenance series. In this article, we will take a quick snapshot of the datasets one might have to deal with and possible ML models.

A Variety of Datasets

In this section, we have broken down the datasets by the type of analysis on the possible entities.

Sensor Analysis: The following factors need to be taken into consideration in analyzing the sensor behavior:

Total number of sensors

Implication:

  • Can we build sensor-specific models?

Mapping of sensors to specific locations (i.e., issues/groups)

Implications:

  • If the sensor count is too high, think about focussing on group-based modeling.
  • If the group count is also high, think about group-biasing (i.e., prioritize those groups which are more prone to outages).

Whether sensors are digital or analog in nature?

Implications:

  • Transitions and latency are important for digital sensors while trend and time-series analysis can be more relevant for analog sensors.
  • For digital sensors, every transition from 0 to 1 and vice-versa can be used for feature extraction.

Within each location, which sensors are expected to be in sync with each other?

Implication:

  • Monitoring lag (if any) between in-sync sensors can be used for building latency models (e.g., Unusual lag at some timestamp could be a red flag for anomalous behavior for that part of the machine).

Do the sensors keep emitting and sending data even when the machine is not being operated (e.g., during the night when the factory is closed)?

Implications:

  • Is the data during non-operational hours relevant for any modeling exercise?
  • What if the working hours start from 9 AM and we have included a 3-hour rolling mean step as part of our feature engineering?

Downtime Analysis: As an initial exercise to understand the data in the context of the business problem, here are some initial questions to look for answers to

What is the long-term trend of downtimes (i.e., outages)? Is the number of downtimes going up/down or consistently staying within a range? Is it too volatile?

Is there any seasonality to the data (e.g., monthly, quarterly, etc.)?

Example of seasonality: snapshot

Implication:

Some months could easily be more vulnerable to machine faults than others. Depending on the industry, there are usually specific periods in a calendar year when the machine’s utilization is at full capacity, while they are under-utilized at other times (e.g., the summer season for rides in amusement parks, harvest season in case of agriculture, etc.). So, the key question from a modeling perspective is: can we apply time-series methods and algorithms to build models (e.g., ARIMA)?

Can we ascertain what caused each downtime to begin with? Was it preventable? Or was it due to outside factors beyond our control (e.g., weather)?

Implication: It is important to focus only on those downtimes that can be attributed to maintenance-related issues and thus, could have been detected beforehand.

Are the downtimes spread all over the place? Or are they concentrated in some pockets (i.e., groups) that seem more vulnerable than others (e.g., more frequent downtimes)?

Machine-wise example of hours lost in repair:

sample data description snapshot

Implication: Some machines seem more vulnerable to breakdowns than others. Can we apply the 80:20 rule? Should we apply group bias?

What is the extent of disruption for different types of downtimes (e.g., Work on the machine is resumed in < 10 minutes for some problems, while it takes > 3 hours to get the repair done in case of another set of issues)?

Implication: Assuming the business goal is to reduce downtime hours (thereby preventing revenue loss), it’s better to focus on those problems that are either recurring or take a lot more time to repair and resume operations.

Snapshot of distribution-plot of downtimes (x-axis) (i.e., various up-times lost due to faults)

Distribution of failures by duration

Implication: Not all types of faults mean the same in terms of up-times. Some faults are often repaired within minutes while a few others are more likely to take hours to do so. The focus needs to be prioritized accordingly.

Telemetry Analysis: IoT mnemonics often send out telemetry messages that might be stored in another database or accessible through another API. Here are some factors to keep in mind while analyzing telemetry data:

Frequency of emitted messages (e.g., the usual number of messages emitted every hour while the machine is in operation)

Rate of telemetry messages emitted:

count of telemetry messages

Implication: The rate of messages emitted usually stays confined within a range, but it might display spikes on certain occasions. In such cases, it is important to check whether there is a correlation between the time period of those spikes and machine outages (i.e., Can the unusually high number of telemetry messages be considered a signal while building predictive maintenance models?).

Carefully analyze the schema of the telemetry message. For example, is there a color code column in the schema based on different problems at the part of the machine where the sensor is located?

What is the nature of the telemetry data?

Implication: Can NLP or text-similarity measures be used for the analysis of the text in telemetry data?

The Kind of Models We Can Build

Statistical Model (Percentile)

A simple statistical model based on the distribution of the aggregated sensor transitions. There are two major steps used to produce the training artifacts that can be further used for inference:

  • Create the model from the original aggregated sensor dataset and identify the type of distribution (e.g., normal, chi-square, binomial, Poisson, and Tweedie).
  • Tweek the identified distribution to the normal distribution if the distribution is not normal. The example below states the distribution found as bimodal, and it requires a modification for the upper bound threshold value.
Bi-modal distribution on single sensor behavior

A mean value in the distribution above gives the middle value, which divides the bimodal distribution into two unimodal distributions. In the figure below the unimodal distribution on the right side is picked and used as a normally distributed dataset for anomaly detection. The threshold value is calculated, keeping precision and recall in sync. The value of the threshold lies two or three standard deviations away from the mean value to the right of the unimodal distribution.

Sliced unimodal distribution on single sensor behavior

Classification Model (XGBoost)

Following the documentation from XGBoost, it is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (aka, GBDT, GBM) that solves many data science problems in a fast and accurate way. The same code runs on major distributed environments (i.e., Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

In the case of a predictive maintenance XGboost classifier, strategic groupings of the sensor value output predict failure. It is not easy to identify the relevant feature that correlates the independent variables (sensor data) to the dependent variables (failures).

The strategic groupings are created to solve the purpose of feature space expansion and create new opportunities for feature engineering. Such grouping requires intensive exploratory data analysis at the initial stage that can be further automated based on results.

Each group is created to generate a single trained model for a single area of the problem. For example, pressure sensors considered in a single group handle problems related to pressure maintenance. Similarly, heat sensor groups are created to handle heat maintenance problems. These models alert if they observe anomalous behavior in each particular and for the entire group.

Sensors behavior with predicted and actual failures

The figure above shows a representative aggregated dataset with alarms marked in grey bars and actual failures marked in red bars. If these grey bars start appearing a few days before actual failures, they are considered true positive alarms. Bars in grey that come without red bars following are considered false positive alarms.

Anomaly Detection Model (FB Prophet)

The prophet is an open-source API for handling time-series in its documentation and is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality—plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend and typically handles outliers well.

In the case of predictive maintenance case studies, the model can be built on two features for each sensor: one is date timestamp [ds] and the other is sensor output by aggregated transitions. We fit the model by instantiating a new object. Any settings to the forecasting procedure are passed into the constructor as such (e.g., daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True). Then you call its method and pass in the historical data frame. Fitting should take 1–5 mins.

A single sensor example using the prophet model

Anomaly Detection Model (Autoencoder)

An encoder is a feedforward, fully connected neural network that compresses the input into a latent space representation and encodes the input as a compressed representation in a reduced dimension. The compressed input is the distorted version of the original input.

A decoder is also a feedforward network like the encoder and has a similar structure to the encoder. This network is responsible for reconstructing the input back to the original dimensions from the output generated by the encoder.

Encoder/Decoder architecture

Summary

In summary, most of the datasets consist of sensor transitions (e.g., on/off or sensor telemetry and maintenance log), and the most successful way of coming up with an alerting solution is based on anomaly detection.

What’s Left?

The final section will be helpful to conclude the predictive maintenance capability building.

[Part 3] How AI is Changing the IoT Based Predictive Maintenance: Inference, Evaluation, and Optimization

--

--