IoT Learning Algorithms and Predictive Maintenance — Part I: A Thought Experiment

Author: Dr. Taşkın Deniz, Data Scientist/Consultant @ Record Evolution GmbH

Record Evolution

Published in

IoT & Data Science

17 min readMay 28, 2018

Summary

The article tackles smart data processing of the Internet of Things (IoT) in a predictive maintenance context and relates this to recent developments in semi-supervised learning. While written with an eye towards a non-expert audience, the article references recent scientific publications. We leave it to the curious and technically oriented reader to expand their knowledge on the ideas we have sketched out (see References). We aim to be informative and open minds to stimulating discussions on IoT data analytics.

We cover the topic of IoT Learning Algorithms and Predictive Maintenance in a series of three articles. In PART I, we present a simple case study and discuss some learning algorithms related to it. In PART II, we focus on IoT data analytics and applications to IoT design for predictive maintenance. In PART III, we review recent literature on semi-supervised learning and compare the foundations of different methods. We introduce real-life cases in which few-shot learning may provide an efficient technique for smart IoT analytics and data streaming.

We don’t claim to be exhaustive — IoT is a vast and flourishing topic. Feel free to comment and start a discussion with us.

Case Study: A Temperature Anomaly

An anomaly is characterized by a deviation from a common rule, a model, a scope of behavior or a set of conventional items. In statistics, anomaly detection means “identification of items, events or observations which do not conform to an expected pattern or other items in a data set” [29]. It is important to detect anomalies because they usually mean trouble such as fraud, a rare disease, or machine breakdown. In particular, predictive maintenance and industrial IoT address the detection of anomalies that may result in high costs due to production downtime or even lead to potentially fatal events in industrial production processes. The idea is that, with the help of various types of sensors attached to the machine, one can predict a potential breakdown and eventually provide a solution to the issue. In this scenario, ideas about the optimal design of IoT — meaning the distribution of storage and computational resources according to priority and prominence of a measured state — is crucial. This requires a closer look into the properties of the data we deal with in the specific IoT context.

An IoT device, for instance, can be observing a process and reporting these observations via wireless communication. In IoT device design, devices should be optimally responding to the tasks they are assigned to, without disregarding flexibility [30]. The task can be simple — for example, it may involve temperature measurements from a specific part of a machine. The measurement and data streaming processes require a minimal number of parameters to operate, e.g. sampling and streaming frequencies. This can lead to huge volumes of streamed data. To prevent data accumulation, we need to look at data-streaming priorities (Part II of this article offers some insight on this issue). Such processes require a (preferably automated and adaptive) decision on the priority of data, namely, if data will be streamed and/or stored temporarily or even permanently. There are two basic questions at this point:

What is the measure of data importance?
How should this measure be implemented?

To illustrate this issue, we start with a case study in a predictive maintenance context. We leave out the complexity of the real problem aside. In our thought experiment (Gedankenexperiment), we have only a few sensors measuring from certain parts of a machine and communicating with a central device or a server. Nevertheless, we are aware of the fact that the complexity of a realistic setting brings along more intricate issues. For instance, there can be thousands of wireless devices collaborating in data processing [28]. This is out of the scope of the article and we will address the whole bunch of similar issues separately in the future. For the time being, we return to our immensely reduced setting.

Imagine a temperature sensor measuring in degree Celsius (°C). To have a streaming economy, let’s assume that we have implemented an on-off switch: let’s say that the sensor reports the temperature only when T<5 °C and T>40 °C. This can be managed by two simple ‘if clauses’ in a Python code. Although we can get away with such an ad hoc filter in our scenario, one can implement an algorithm (e.g. classification) to discover optimal boundaries systematically. All in all, the boundaries represent a normal domain. Outside this domain live outliers. What does an IoT device eventually do with the measurements? It just keeps recent measurements in memory and then reports a selected part of them to a server. This selection depends on the filters implemented via a learning process.

Figure 1. Anomalies can be detected or undetected by the central device in case we have ad hoc filters implemented. There are two basic kinds of relationships: temporal and spatial. Temporal means how two anomalies relate to each other in time. Spatial means how similar two anomalies are, hence, if they belong to the same cluster.

Before we proceed with the specifics of the case study, we want to give a few definitions:

state vs. phenomenon: we use state for (a part of) our indirect measurements with sensors (e.g. via Raspberry Pi). Phenomenon is used for an event reported by an external observer or a machine’s native registry (e.g. a PLC).
fatal state: the measured state that strongly correlates with interrupted production, i.e. for a considerably long amount of time.
healthy state: measured state along with an uninterrupted production that follows a known non-anomalous behavior.
unhealthy state: anomalies that occur with the uninterrupted production process but indicate that a fatal state may follow.
‘need-to-be-mended’ state: this is an unhealthy state that leads to localization of the issue and a call for maintenance.
d-anomaly: an anomaly detected by the streaming filters and reported to the central device.
u-anomaly: an anomaly undetected by the streaming filters and hence not reported to the central device.
local: in our context, this refers to an item or process in a factory or a production bench environment.
central: refers to computational resources or a model residing outside the production context, possibly combining several examples of similar processes. Indicates also a larger capacity, e.g. CPU, memory.

First and foremost, we should monitor and model the healthy state (a long period in a normal production process in Figure 1) of an industrial machine. As a result of this, when the state turns unhealthy, we would detect and localize the problem. In this case, we train a machine learning model with the whole mass of measured and stored data from a normal production process. The model runs in a central device that is trained to guarantee accuracy and robustness. This means that the trained model can detect an unhealthy state (anomaly detection). Thus, in the training phase, the streaming economy is not applicable. (Yet in the test phase, our filters can save time, storage resources, and energy.) Now, assume that we have an independent mechanism that reports unhealthy or fatal phenomena right on the spot (e.g. an external observer such as an engineer or a machine registry). These can be used as state labels. By definition, predictive maintenance attempts to forecast a ‘need-to-be-mended’ state before the production is interrupted (the fatal state).

Second, we would like to characterize a whole bunch of anomalous cases and learn them. What do we mean by characterization? We imagine an industrial process with controlled temperature changes. Let’s say the industrial machine has a cooling control system that is activated to stabilize the temperature. This simply helps to persist in a healthy state. Control systems often follow a mechanistic control model (a feedback controller such as PID [32]) and deal with a whole bunch of cases employing the same model. Thus, the resulting transient waveforms can immensely vary from state to state. In this example, we may characterize these waveforms.

In general, there are basic properties we would like to quantify:

Priority: How urgent the issue is for the maintenance process. We can quantify this with correlation to a fatal state and mean expected time of occurrence.
Importance: How important is the issue for the production process. We can quantify this externally, e.g. the cost it creates for the production process.
Frequency: How often one encounters an event that disturbs the production process. The quantification of this is self-explanatory.

Say, a fatal state is detected by an external observer following observation of an anomaly which is characterized by relatively fast large fluctuations in temperature ( the red anomaly in Figure 1). We call it a d-anomaly indicating that it has been detected by our filters. Data characterizing the d-anomaly is sent to the central model. In a few examples preceding the anomaly, we may observe a slow fluctuation amplitude increase in a frequency band ( e.g. w=0.1 Hz ). We call this a u-anomaly, meaning an undetected anomaly. These anomaly types are treated separately for two reasons. First of all, u-anomaly is by definition a case out of the scope of our trained central model because of streaming filters. Second, the event does not necessarily cause any harm by itself but it may be a precursor of an incoming event such as the fatal state or another anomaly).

In general, the temperature may have a stochastic nature in a dynamic environment (ongoing production cycle), so it generally fluctuates. These fluctuations in a healthy state can be analyzed and characterized in a training phase. In the detection or test phase, one can search for anomalies. Here we imagine anomalies reflected as certain transient waveforms in temperature trace (as shown in Figure 2). We can detect a given waveform such as slow oscillations or a ramp by wavelet analysis [31]. We can quantify two basic relationships between waveforms of a different kind: temporal and spatial. By temporal we mean how two anomalies are correlated and by spatial we mean which waveform cluster they belong to. By quantifying the temporal and spatial aspects of anomalies and by tagging them, we simply help the maintenance planning. Several d-anomalies can belong to different clusters in wavelet space. This may also be the case for u-anomalies.

On the other hand, anomalies may change teams while the game is on. The categories may vary as a result of incoming data and a learning process. A u-anomaly occurs when the temperature is in the per se safe interval. This means that the central model is blind to possible anomaly correlations unless the local device learns a new condition to report the previous u-anomaly. This means that u and d will be changing (u →d) with evolving filters. Such an evolution involves the evaluation of a measured state. In anomalous cases, it may be necessary to report the temperature backward in time when the fatal state is detected. Here we can update the streaming filters by adding another streaming condition. The purpose of the updated filter (on-off switch) is to learn the new type of anomaly (u-anomaly in Figure 1) — which may be correlated with another anomaly (d-anomaly in Figure 1) or a fatal state.

Figure 2. Waveforms associated with anomalies can vary. Different colors indicate various combinations of priority, *importance,* and *frequency*. We talk about the detection of these waves rather heuristically. However, optimal wavelet transforms are unknown in different cases. This can be part of the learning process: an optimal selection of wavelets to characterize one sub-class of waveforms.

Third, we want to establish an interaction between local and central models. This means that we would like to employ the local devices, beyond the passive reporter role, as a smart device that can evaluate the priority of the measured data it reports. This procedure can be implemented in a way that the central model helps the IoT device to learn an updated streaming filter detecting the relationship between measured anomalies. This filter may simply report a certain increase in spectral power in a characteristic band (e.g. w=0.1 Hz) or a new event in the wavelet space detected with the help of wavelet transform and implicitly an anomaly detection algorithm [31]. This process can be followed recursively (symbolically represented in Figure 3) to set new boundaries.

All in all, this is an immensely reduced picture concerning a single aspect of the IoT challenge. The deal is to reach an optimal distribution of tasks to different levels in an IoT hierarchy (a simplified picture is given in Figure 4). Some tasks need local, urgent, and important hard real-time computing, i.e. fatal anomaly detection. Some other tasks such as waveform classification can be soft real-time computing. A more general model can learn all anomalies from different contexts in a much slower time scale to provide a benchmark.

We haven’t mentioned the details of the term learning yet. In a general IoT scenario, who reports to whom and who learns what at which accuracy level is an important question. Even in this simple scenario, we can implement learning to detect optimal parameters (old filter, new filter, how local and global models interact, etc.). Besides, the parameters can evolve as the production process continues. This means, for instance, the new filter is constantly modified and hopefully optimized to obtain more accurate results.

In a realistic scenario, there will be several other devices and several other types of anomalies [8]. Besides, a given machine can operate in several different production modes. Additionally, we may detect certain anomalies specific to a physical observable (pressure or temperature). Moreover, we may even have the same machine operating in the same production mode but in a different production environment. Heterogeneity in this sense is an issue we don’t address at this point. It will be there as an aspect of a general IoT ecosystem. Although structural heterogeneity is ignored here, there may be several different anomalous states as a result of the production process which occurs with different frequencies (some of which may not be frequent enough to be classified).

This is our central claim: we need dedicated machine learning models (e.g. few-shot learning) to detect anomalies, and, if possible, learn them recursively. We should advance our techniques to implement this locally (for the price of inaccuracy) in a faster time-scale while global models (more accurate and robust) evolve in a slower time-scale and provide a benchmark. (Note that IoT aspects and machine learning techniques regarding this topic will be covered in PART II and PART III of this article series).

Figure 3. Recursively improving the anomaly detection model

Imagine now a network of such measurements, locally implemented anomaly detection algorithms, and eventually the training of a model in a computing device via integration of all the events to this central model (e.g. a deep neural network running in a resourceful central machine or the Cloud). The model will be updated and regularized to evaluate known anomalies. In case there is another anomaly of a similar sort, we can apply a similar procedure: while evaluating events in a time-scale close to the inverse measurement frequency, learning in a time scale that is a few orders of magnitude larger. We seek a hierarchy of time scales that can be reflected in task distribution.

In summary, this problem has a few stages which are time-scale separable:

1. Anomaly Detection: e.g. a fan is broken and temperature raises, this is detected by analyzing measurements from temperature sensors. The detection of this anomaly can be performed by local resources.

2. Local Learning: simply updating the streaming filter via learning in an IoT device. For instance, reporting anomalies including u-anomaly in the ignored domain.

3. Global Learning: Stereotyping the anomalies from similar observations via learning in the central model. This is where we will need current machine learning models such as few-shot learning.

4. The updated central model will be available for local devices to inherit. Transfer learning techniques can be implemented at this point.

Figure 4. Figure adapted from review article [13]. The figures show a classical view of model and data streams. — A task is called hard real-time if all its deadlines must be respected, otherwise, a critical failure occurs in the system. — In a soft real-time task, nothing catastrophic happens if a deadline is missed. Hard and soft real-time analyses can be improved via methods of few-shot learning.

A Note on Anomaly Detection Algorithms

“In data mining, an anomaly is defined as events or observations which don’t fit the scope of defined normal behavior or statistical model. In an uncorrelated noisy signal modeled by a Gaussian process, one can determine anomalous amplitudes via a ‘normality’ margin, meaning that for instance what is beyond 5-sigma can be an outlier/ anomaly. Anomalies are also referred to as outliers, novelties, noise, deviations, and exceptions.” [29]

In machine learning, there are several types of anomaly detection algorithms:

Supervised Anomaly Detection describes a setup where the data comprises of fully labeled training and test datasets. The main difficulty is that often anomalous classes are rare and hence classes are strongly imbalanced. Moreover, this setup is practically not very relevant as anomalies are not known in advance or may occur spontaneously as novelties during the test phase. [24]
Semi-supervised Anomaly Detection also uses training and test datasets, whereas training data only consists of normal data without any anomalies. The basic idea is that a model of the normal class is learned and anomalies can be detected afterward by deviating from that model. [24]
Unsupervised Anomaly Detection is the most flexible setup which does not require any labels. The idea is that an unsupervised anomaly detection algorithm scores the data solely based on the intrinsic properties of the data set. Typically, distances or densities are used to give an estimation of what is normal and what is an outlier. [24]

This topic and the closely related topic “imbalanced data” will be covered in a separate article that is coming soon.

Why Few-shot Learning Algorithms?

We believe that semi-supervised anomaly detection is just the way to go because of the structural properties of real IoT data sets (the imbalance between known and unknown). Unsupervised learning is always a solution when labeled data isn’t available. However, one should exploit transfer learning [12, 13, 20] as a healthy state — it is rather stereotypical and it is modeled by an abundance of data, meaning that we will have a plethora of labeled states. This can simply help detect less common states (which may or may not pose a threat) and characterize them (what do they look like). Later on, such new states can be learned and hence the boundaries of our universe of known anomalies can be extended. We believe few-shot learning is the method that captures the specifics of the problem at hand (A more technical review on few-shot learning will follow in Part III).

One can see the situation clearly in the picture below. By characterizing the known domain (dogs, common or rare) we can detect unknown behavior (fox or raccoon) and see if it poses a danger to our production process (fox eats chickens).

Figure 5. Underrepresented categories lead to the few-shot learning problem [19]. There are common breeds of dogs and an intelligent machine would have no difficulty distinguishing them. The machine can be trained to learn examples of rare breeds with few-shot learning methods. On the other hand, if you breed rare birds in your garden as a hobby and if a fox enters your garden now and then you would want to know, wouldn’t you? The consequences of ignoring an anomaly can be enormous.

Why do we need few-shot learning in a predictive maintenance context?

In predictive maintenance, the healthy state provides enough examples to train a model. A robust model of the normal state is crucial for anomaly detection. States can be associated with observed phenomena by an engineer or machine registry.
For successful training, a good balance between the parameter set and dataset size is typically necessary. This is often not the case when only a few test examples are provided. We need smart regularization techniques or comparable tricks to deal with this imbalance [1–7].
Anomalies are often novel and rare states. Nevertheless, some anomalies may provide enough examples to characterize them. The logical order is as follows: new example => imbalance in the training example => few-shot learning.

Few-shot learning is a powerful semi-supervised learning method that can be adapted to IoT analytics. The main idea is that we develop a central network model that copies itself to given devices and identifies a new category. The central model classes will be updated by few examples detected by the edge device and will be reused in ongoing or newly installed IoT devices.

In summary:

Central Model runs in the cloud or a central resource: Long training history, mid-term fine-tuning time scale.
Copy of the Central Model runs in the local device: The replica (possibly a thinned version) of the Central Model will be implemented in local devices.
Local computations will be performed in an IoT device so that the device will send the evaluated new information to the central resources.
All new info will be integrated into the model via few-shot learning.
The cycle is complete when models are updated.

Conclusion

In general, inheritance or transfer of knowledge is important for a smart IoT to work efficiently. Unfiltered data reported continuously wouldn’t just consume storage resources, but will also generate network traffic and create noise in the model. Big volumes of IoT data collected randomly from different locations disregarding all space and time correlations don’t necessarily lead to knowledge. We need to consider local IoT data analytics and, in particular, few-shot learning because a decision (e.g. replace the fan) may at times be made with very few examples [ 1–7, 9–12].

To reach a good level of automated predictive maintenance, we need to exploit the structural properties of IoT while patching together the peculiarities of a given context. It is crucial to select the best possible algorithm to run on a smart IoT device or a fog computing device, solely in a given smart data processing setting.

The topic of this article, applying state-of-the-art machine learning models to the IoT setting for smart data analytics and streaming, has been the concern of many recent articles [8, 13, 19]. A recent example shows that a trained central model can be inherited in a less accurate way to implement urgent local classification tasks [25]. There are advances in this direction: meta-learning was implemented in three levels to handle few-shot learning. The framework combines Concept Generator, Meta-learner and Concept Discriminator to enhance the performance of Deep Learning [26]. Ultimately, what we need is to detect and learn ‘stereotypical’ anomalies. However, state-of-the-art machine learning is not practically applicable in anomalous cases at the moment. In a recent article, the authors comment: “Despite recent advances, memory-augmented deep neural networks are still limited when it comes to life-long and one-shot learning, especially in remembering rare events.” [25]

Many important questions regarding IoT data analytics and corresponding machine learning algorithms are left unanswered in this article. What are the properties of different data sources? What is the capacity of different resources in the IoT hierarchy? How can the data be stored and in what format? What are the recent developments in machine learning with respect to the IoT problem? How do different algorithms compare? What are open problems? Stay tuned to find out more about these topics (coming soon).

Note: This article is part of a three-piece article series. See Part II and Part III here:

IoT Learning and Predictive Maintenance — Part II

IoT Learning and Predictive Maintenance — Part III