Predictive Maintenance (Alarm Prediction) Methodologies in Telecom Domain — 1

Halil Ertan
9 min readJan 13, 2020

--

Photo by Samuel Killworth on Unsplash

Predictive maintenance implementations in the telecom network domain are much more challenging compared to other domains, since there are many sub-systems talking to each other in the background, and the cause of a failure can be related to any of them. Moreover, you will have different kinds and formats of data sources, which makes preprocessing of the data a tough process. To be honest the most compelling part of all these processes is that turn different raw data sources into a Kaggle-like dataset.

In my previous writing, I briefly addressed the key points. I focus on the probable methodologies in predictive maintenance studies in the telecom domain in this writing. The datasets you are supposed to use shape the methodology in your study. Other parameters that help designate the methodology are the followings. What are the requirements? Just predict an alarm or you are supposed to give a list of related alarms which have an effect on the target alarm. What is the definition of an alarm? Are the device name and cause code are enough or additional information like shelf/slot/port of the device are required? Another important factor is the prediction range. The occurring frequency of the alarm you intend to predict is also another point you should take into consideration.

I will converge all of the above-mentioned issues into two general use cases (scenario), and mention them. The scope of this writing is about the first scenario. I will mention the second scenario in a different writing.

When I mention an alarm type, I refer to alarms with a unique cause code. For instance, predicting ‘DEVICE IS OFFLINE’ alarms means predicting alarms that occur on devices with the cause code ‘DEVICE IS OFFLINE’. If you need to predict alarms with more than one cause code, then I suggest developing a separate model for alarms with each different cause code. Alarms with different cause codes mostly have different behavioral patterns. Additionally, I imply that an alarm is an alarm on a definite device with a definite cause code in the writing. An alarm on a different device with the same cause code means a different alarm.

There may be a lot of alarms occurring on the same device with the same cause code in the same time unit of study. For instance, you receive the data hourly and predict for the next hour, so your time unit is one hour. Any alarms which occur on the same device with the same cause code in the same hour can be assessed as same, and count as one alarm. You can keep the number of the same alarms in an extra column and benefit from it in feature engineering since only one-time alarms and repetitive alarms can mean different. A compact version of the raw data (after dropping the duplicated alarms) may be more beneficial in terms of complexity and performance issues.

In the first scenario, you are given several different kinds of data sources and your target is to only predict the alarms before they occur. Additionally, you are required to give the cause code, device name, and an additional branch like port/slot information of the device in the prediction report. Most probably you will be given a dataset of past alarms and several performance metrics of devices including the metrics values of each port/slot of devices. These two datasets are totally different types, alarm data is event-based data by its nature, on the other hand, you can use performance data directly in a machine learning algorithm without any feature extraction process except a few basic statistical operations. For this reason, you need to extract features from alarm data in order to benefit from it in a supervised approach. To make it clear, I depict the alarm and performance data below.

Alarm data format — Table 1

It is a simplified version of the alarm data. There may be hundreds or thousands of alarms occurring in a prediction time range. Additionally, there will be many more columns in typical alarm data on which feature engineering is performed. Some of the alarm types like Cause_Code_1 and Cause_Code_4 occur on the slot of the device, on the other side, some of them occur directly on the device. Let's assume you try to predict alarms with the alarm types Cause_Code_1 which occur in t1 and t3. These predictions are expected to be made in t0 and t2 respectively. Moreover, notice that the two alarms in t3 are the same, so you can think of them as only one alarm in the compact version of the data and keep an extra column named total_alarm_number with value 2.

Performance data format — Table 2

As you can see, the performance data of devices are very straightforward. It includes performance metrics of devices with slot/port branches.

We come across the fundamental question, what is the model we designed going to predict? As the first option, you can assess an alarm as consisting of device name, cause code, and slot/port information. I assume you are developing a separate model for each cause code. For this reason, you actually need to predict the device and slot/port of the device. However there is a serious drawback to this option, there may be hundreds of slots/ports of a device. So if you think of these components as a unique alarm, that means thousands of hundreds of different unique alarms in the system. Furthermore, most of them occurred too rarely. I mean a specific alarm can occur on a device every week, but the same kind of alarm may occur on a certain port/slot of the same device one time in a year. It is already an imbalanced dataset, this approach will even decrease the rate of minority class much more. As the second option, you can design a model that predicts only the device name without making interpretations about the slot/port. In this manner, the complexity of the entire process will be mitigated and it will be more rational in terms of the imbalance rate of the classes.

What about slot/port prediction if you follow the second option? After predicting the device name, another model can be designed which analyses the outliers in the slot/port data of devices. The performance metrics of each slot/port of devices may be used as features in this model. You can take advantage of the alarm data for the anomaly detection model as well since it includes information about the slot/port of devices. Any anomaly detection methods like Gaussian Mixture Model(GMM) and One-Class SVM may be used in outlier detection. All in all, you can think of this approach as a multi-layer approach, firstly predict the device on which alarm is likely to occur, then detect the outlier slot/port of that device.

In the abovementioned approach which consists of two separate models, the first one is a predictive model which implements supervised learning, and the second one is the anomaly detection model which is an example of unsupervised learning. For the predictive model which makes predictions about only devices, any machine learning algorithms like Random Forest, Logistic Regression, LSTM, or Convolutional Neural Networks can be used. Since both alarm data and performance data are time series data, this prediction can be thought of as a highly imbalanced time series classification. For this reason, any features including information of previous time slots will increase the accuracy of the model. Additionally, any methodology like upsampling, downsampling, or assigning class weights should be used for handling class imbalance issue. I will not mention machine learning algorithms and data preprocessing steps in a detailed way, they are out of scope for this writing.

One of the most crucial parts which affects the prediction results directly is the feature extraction part in especially predictive model design. The main step is feature extraction from alarm data which is necessarily done if you intend to implement a supervised learning method by using event-based alarm data. Before extracting features from alarm data, a list of all related alarms for each alarm/fault we aim to predict can be formed. In order to perform this, we can collect all alarms in the previous time slots on which alarm/fault occurred, and then we set a threshold for alarm frequencies in order to take into consideration only more uncommon alarms and create a related alarm list for each alarm/fault. The number of related alarms occurred in that time slot can be used as a feature. Let's assume, you are creating the abovementioned feature for alarm A (consists of the device name and cause code), first go to the previous time slots of alarm A occurred in history, and collect all alarms in those time slots into a pool. Let's say frequency of the alarm A is 5% percent, and the truth is that most of the alarms in the pool have a high frequency and actually behave as a noise. As the next step, you can filter out all the alarms with a frequency higher than 15–20%. Then, you can select the top 10 or 20 alarms after ordering the remaining alarms in the pool by frequency. This list form the related alarms of alarm A and a number of related alarms can be a good indicator for the prediction of alarm A. The Same approach should be extended for all other alarms for which prediction will be made. Another important step is the detection of important alarm types. Some of the alarm types (alarms with a specific cause code) tend to occur more frequently compared to their normal pattern before the alarm intended to predict occurred. The number of these alarm types can also be used as a feature. Additionally, the number of all alarms occurring in the device, the number of all alarms with the same cause code, and the number of all alarms in the same region may be a good indicator. Device names sometimes consist of sub-meaningful components, and parsing the device name can be an alternative as well. Lastly, the number of repeating alarms can be an important sign that should take into consideration. All of these features can be extracted from alarm data.

On the other hand, you don't need to perform complex feature extraction operations for performance data. You can directly use some statistical values of performance metrics like max, min, avg, stddev in the prediction time slot. As the final step, you define a window size and take sum or average of feature values in the respective window. They can be used as additional features if you are using an algorithm like Logistic Regression and Random Forest which do not directly address time series data. Since it is a time series data, including time related features in supervised learning methodology will be generally beneficial.

After all stages, you will have a trainset depicted in the below table for the first predictive model.

Trainset format — Figure 1

Another question arises at this point, are you going to design a separate model for each alarm(for each device)? In the simplified table above, a few device names appear, in fact, there may be thousands of different devices in the system, so that means more than one thousand separate models. This does not seem very reasonable and sustainable. Instead, only one model can be designed for all alarms with the same cause code regardless of device. You need to make an assumption in this approach, it is a cause code-based approach and assumes all alarms with the same cause code (even on different devices) occur because of similar reasons. Devices don’t behave differently in their internal processes. This approach gives a chance for alarms that occurred very rarely or never occurred before to be predicted as well.

I will mention the other approach in my next writing.

--

--