Fault Management In A Big Network

Hilkat
4 min readMay 10, 2020

--

Like other technological developments, fault management in a telecommunications operator changed dramatically in years.

In early 2000s, networks were growing fast, big amount of subscribers were coming and employees were working mainly on handling this expansion. Alarm monitoring was not a focused business.

Neither networks nor fault management systems were not that complicated as of today.

Not much operators had centralized operation centers. They generally had separate OMCs (Operation and Maintenance Centers). As an engineer/technician you were supposed to do data entry, change configurations, install new nodes, write scripts, and if you have time look at alarms on nodes and replace faulty cards. So, fault management was not a priority.

In a typical OMC, you would see walls covered by alarm panels with little bulbs (yes bulbs!). They were directly connected to nodes (MSC, BSC, HLR, etc.). There were also technicians look at alarms on nodes if there was a major problem. There were only consoles or not attractive UIs to reach the nodes. Sometimes a guy would come to data center and start working without informing anybody. This could happen even daytime and you would lose electricity to all your nodes at OMC. You could only understand something was going wrong by looking that red bulbs on alarm panels. This was the typical fault management for a telecommunications operator those days.

Today, thanks to improved umbrella fault management systems and centralized NOCs (Netwotk Operation Center), it is easier to detect faults, but still there are lots of things to consider.

Today, networks are much bigger and complex compared to 20 year ago. Mobile devices changed, 2G evolved to 5G, network vulnerabilities became more important, regulations forces operators to take strict measures etc.

Number of mobile (cellular) subscriptions worldwide surpassed worlds population. But workforce of telecom operators have not grown that much. Now it is time to work more efficiently and more precisely. This can only be achieved by new tools and approaches.

If you look at the network of a telecom operator today, you will see many different network elements, EMSs (Element Management Systems) and NMSs (Network Management Systems). Each EMS/NMS has its own GUI and fault management screens. If you wanted to follow alarms on those systems separately, you would need much more employees.

In order to collect alarms from dozens of EMS/NMSs and present them to NOC, you need an umbrella fault management system. There are several NBI (Northbound Interface) capabilities of EMS/NMSs.

Huawei iManager U2000 Northbound Interfaces

The picture above shows Huawei iManager U2000 EMS NBIs. Here it can be seen that, for FM (fault management) it can integrate to an NMS or umbrella management systems via CORBA, ASCII, SNMP or FILE NBIs. Some vendors develop their EMS/NMSs to have some advanced NBI features. One of the important feature is to synchronisation. In case there is a planned work on EMS, upper level NMS or umbrella management systems may lose connection to EMS for a long time. After connection establishes, it is needed to synchronise alarm list at both sides. To do this, you need to keep an active alarm list at EMS side. This list should be available everytime upper management system request it.

A telecom operator has dozens of those EMSs from several vendors. Everyday millions of alarms occur on those systems. Not all vendors have same focus on fault management part of their EMSs. An EMS may have been developed for small size operators and with basic GUIs. It works fine, but if its NBI capabilities are limited, when it comes to put it in a large telecom operator’s network, then comes problems. EMS sends alarms to upper level but sometimes alarms are dropped due to network problems os planned activities on EMS/NMSs. With a limited operational personnel, only one missing alarm can be a big and frusturating issue.

In order to handle such problems, telecom companies need strict policies for integrating EMSs to their umbrella management systems. They need to force EMS vendors to develop NBIs according to those policies. Big vendors’ EMSs generally have improved NBIs such as CORBA and ASCII. They are robust, trustable and powerful. On the other hand, they are complicated and expensive. Because not much vendors are using them.

SNMP is most commonly used protocol for an NBI. It is simple and widely used in telecommunications. But it has weaknesses too. Most vendors has an EMS with and SNMP NBI. Since SNMP is mainly uses UDP to send alarms to an upper systems, EMS does not require acknowledgement whether alarms received or not. In order to overcome this problem, instead of using SNMP trap, SNMP InformRequest should be used. That way, dropped packets are reported and delivery of alarms are assured. But this only assures alarms for very short connection problems. For longer connection outages, EMS should keep an active alarm list and send this list to upper systems when requested. By having resynchronisation capability and sending alarms with InformRequest, an SNMP integration can be as reliable as an old school CORBA.

--

--