Predictive Maintenance (Alarm Prediction) Key Points in Telecom Domain

Halil Ertan
7 min readDec 17, 2019

--

You can come across predictive maintenance implementations in lots of sectors. They mainly aim to predict the faults on devices in advance and ensure the continuity of a system that generally consists of many devices. I will talk about predictive maintenance implementations in the telecom domain. A telecom domain consists of many components and almost every component is very suitable for machine learning use cases. Churn prediction, revenue forecasting, social network analysis, customer segmentation, fraud detection, pricing recommendation, campaign targeting, customer lifetime value, sentiment analysis are a few ones to mention. I can increase the number of scenarios, however, I will focus on predictive maintenance implementations in telecom, and predictive maintenance and telecom keywords together takes us to network infrastructure. When you even narrow it to the network side, you can talk about a lot of different use cases like network utilization, capacity planning, traffic volume forecasting, call route optimization, capacity management, performance management, etc. I am not an expert in the network world. I will talk about specifically one significant issue which is the alarm (fault) prediction. It is one of the most crucial needs for big telecom companies in their network.

Most of the telecom companies use similar kinds of monitoring and storing tools of most probably the same vendors, in other saying alarms are stored in a very similar format independent of the technology of the alarms. Alarms in telecom infrastructure consist of several kinds of conditions, a probable problem, an already occurred problem, or just a warning. Some of them are even irrelevant and false alarms. Thousands of alarms occur even in a small size telecom infrastructure daily and this number can extend drastically in larger-scale telecom companies. Why is it so important to predict the alarms in advance?

Photo by rawpixel on Unsplash

Prediction of alarms in advance (for instance 24 to 48 hours before) has two main benefits for companies. Firstly, the cost of management of operation teams will decrease in short term. Operation teams can respond to failures before they occur. Replacement parts can be organized according to the alarm predictions. Let me give you an example, an employee working in an operation team starts his day with a list of occurred faults every day. Imagine that the same employee has a second list of predictions of faults that are likely to occur in the next 48 hours. So when he is on the shift in order to fix an outdoor device fault, he can simply check whether there are any other faults with a high probability of occurring in the next 48 hours around the already occurred fault. It would be very beneficial.

Secondly, mitigating the number of alarms by preventing them before occurring thanks to solid predictions directly increase service quality and customer satisfaction consequently. In spite of, most of the faults in the system do not affect the customer side directly, some of the interruptions and failures can cause a huge impact on the end-user.

Points to Consider While Making a Prediction About Failures

Precision or Recall?

Actually, the answer to this question is up to the operation team which will use the solution ultimately. As you know, there is a tradeoff between precision and recall. If the operation team wants to know the alarms before occurring as much as possible, then you focus on increasing the recall value of your model. On the other hand, if they want to know only very likely ones to occur and do not waste much time on false-positive predictions, you should concentrate on increasing the precision rates of your model. The choice will be an important factor in especially the model tuning phase.

Dependency between Technologies in the Infrastructure

If you are working on a huge network that consists of several different kinds of technologies, you should take into consideration the dependency between technologies. Some of the technologies are established on another base technology. For instance, technology B is based on technology A, and you aim to predict an alarm in technology B. Since any alarm in technology A can be an indicator for alarms in technology B, you can leverage the alarms in technology A together with alarms in technology B. On the other hand, you don’t need alarms in technology B while predicting alarms in technology A.

All Alarms are NOT Important

Most probably you will not be given feedback that which kinds of alarms are irrelevant to your study, however, you can clean those alarms from your dataset if you are informed. You can think of the alarms with a lack of some basic information like id, creation timestamp, device name, cause code as noise, and clean them from your dataset. I additionally suggest ignoring the alarms and information which is created by manual hand. While extracting features from alarms, ignoring too frequent alarms may be a good approach as well. I mean if an alarm occurs every 2–3 days, and you want to predict a more rare alarm (most probably that is the case) then facilitating that frequent alarms in the study may be pointless, focus on more rare ones which are more likely to indicate a sign for prediction. Moreover, alarms occurring due to scheduled maintenance can be assessed as noise and can be put out of the scope of the study.

Set Prediction Time Range Properly

I give much importance to this point. It affects your prediction accuracy and your coding directly. Firstly, you can not predict all alarms 24 hours before they occur. Some of the alarms have signs 5 minutes before they occur, or let's say 6 hours before they occur by its nature. For this reason, it is better to define prediction time ranges specifically to alarm types and accordingly set receiving data frequency. Secondly, you probably will study with a highly imbalanced dataset and implement some techniques which take care of imbalance in the dataset like upsampling, downsampling, or arranging with weights of classes. If the solution you design is based on supervised learning, you can decrease the imbalance ratio in the dataset by adjusting the prediction time range properly. Actually, the point is to increase the number of minority class labels. If you are making a prediction for next week and using a supervised approach, you will have much more minority class in your train set rather than making a prediction for the next 24 hours.

Probable difficulties which you might encounter

What does it mean to predict alarms? Device Name, Cause Code, Port…

Prediction use cases are basically a binary classification. However, alarm prediction is not so easy in telecom networks. While predicting an alarm, you are supposed to include device name, cause code, and most of the time an additional level which consists of port/card/slot, etc. So if you define an alarm as device name and cause code, you will need an extra step that infers which part of the device will have a problem. For this reason, you need to design a hierarchical structure that may require machine learning and data mining methods together. Furthermore, you can be asked to include a list of relational alarms which help you to decide on the prediction. In such a case, black-box methods which most machine learning methods are will not serve the purpose, and you may need to implement and combine other methods like market basket analyses, association rule mining to your main solution. I will mention the methodologies which can be used in predicting alarms in another writing.

Selection of Different Kinds of Data Sets

Collecting many and various datasets is not always the correct approach, selecting the right dataset is very important. At this point, it would be very nice of experts who are familiar with the domain to direct you to get correct datasets if you don’t have much domain info. Another issue is whether you are able to get and collect the dataset or not, even if you know what you need. Another important difficulty that is worth mentioning is that different datasets bring different data formats and new analyses together, and place an additional burden.

All alarms are NOT predictable

The separation of predictable and non-predictable alarms is actually the topic of another crucial area. The truth is that you may not even predict some type of alarms in advance in the light of supplied dataset. To make it clear, let me give you an example. You aim to predict alarms with the cause code ‘Device is Offline’ in advance, and most of these types of alarms occur because a digger cut a fiber cable during road maintenance work. If you are not supplied with a schedule of road maintenance work from the municipality, you don't have any chance to predict it. Predicting the carelessness of the digger operator is not a very deterministic scenario even if the dataset is given to you. These kinds of scenarios can be extended with stolen a part of the device, a stroke of lighting, etc. On the other hand, some alarms with the same cause code can be caused by some performance issues, so it can be predicted by using performance metrics. All in all, detecting and removing these kinds of alarms from the scope of the study will definitely increase the success rate of the study in terms of both recall and precision. It is obvious that it requires deep analyses which should be maintained together with operation teams.

Unfamiliar with network domain

Last but not least, domain knowledge is definitely one of the key parts of predictive maintenance studies in the network of telecom operators. I deeply believe in the importance of domain knowledge in data science projects especially if you are working in a very specialized field like the telecom network domain. Otherwise, you may waste time by trying to utilize completely irrelevant features in your model like including weather information in a churn analysis. You may not encounter many colleagues who are masters of both the telecom network side and machine learning side and act as a bridge in data science projects in the telecom network domain.

--

--