Network Service Level Agreement (SLA) management within a Telco
Important metrics to consider
One of the driving contractual issues within a telecommunications (Telco) service provider environment is the underpinning contract that specifies the Service Level Agreements (SLA). These are important because often there are claw-backs with the client can use to penalize the Telco or alternatively use it as a gauge to decide whether to move services elsewhere.
There are a large number of metrics that exist within Information Technology (IT) which I have written about here. In a Telco the important are mostly Mean Time to Respond, Mean Time to Repair and Availability. The norm in a Tier 1 Telco is 20 minutes to respond, 4 hours to repair and 99% availability.
The first is a metric to determine the time when a client should be contacted from when a link outage has been detected. The next metric is the time before when the outage should be resolved. These are important to view within the lifecycle of an incident which I wrote about here. Finally, there is Availability which is expressed as a percentage uptime of the link.
In calculating the Availability metric consideration needs to be given to the Telco maintenance windows that are either scheduled or declared within a suitable notice period, These maintenance windows when a link could potentially be down are then excluded. Related to maintenance windows is any downtime directly ascribed to the customer such as timeous access to the customer premise equipment for repair.
Additionally a Telco will also exclude an power outage at the customer premise. This is calculated by determining whether there has been an outage of power during the link downtime by polling the Customer Premise Equipment (CPE) and checking whether the system uptime has reset. If the system uptime has reset then the link outage is flagged as power related and the time the link has been down is excluded from the SLA. Obviously this method does not cater for a notification or poll of a power incident during the event. In that case, a Internet of Things (IoT) device with a power sensor such as the Powalert needs to be used. The system uptime is typically determined using a SNMP poll of the systems MIB on the networking equipment at the customer premise.
Many network tools, of which a comprehensive lists is provided here, Use ICMP pings to the customer premise to determine availability. This method works suitably but has two major flaws. The first is that the pings can be discarded in a situation where the link is saturated with client traffic without the link being down. Secondly, the method requires a centralized poller to poll the customer service in band which requires connectivity access to the service. The result of these two flaws result in the fact that a centralized poller does not scale and is not reliable. Ironic for a system that is meant to determine reliability.
A better mechanism is to utilize Carrier Ethernet attributes and parameters. Carrier Ethernet utilizes CFM frames to determine service availability which does not require access to the service by a poller, but is managed and reported on my the actual Carrier Ethernet networking equipment itself. This method is significantly more reliable than the centralized ICMP poller method as well as been more accurate and having the ability to scale. Carrier Ethernet also has OAM functionality which contributes significantly to better link diagnosis resulting in better Mean Times to Repair.
The mechanism described as used by Carrier Ethernet can also be used in a proprietary fashion. As an example, in radio pairs it would be possible to send over the air heartbeats and measure the drops. These can be exposed as counters via SNMP and then used to determine availability. Many radios have SNMP counters that provide OIDs that report the RSS levels of associated pairs. This value can be used as an SLA measure as a link failure would result in no acceptable RSS value.
Another example of this type of mechanism is that used by Cisco for what they have in their IPSLA functionality. IPSLA is also available as metrics using SNMP counters.
Even in SDWAN deployments one of the primary software functions is for the a link within this type of deployment to have a heartbeat that can be used as described above. The SLA management functionality should be a built-in prerequisite for SD-WAN. If I was responsible for the service delivery of a hypothetical bank with 9 regions and 7000 sites I would want a simple dashboard that presents the month to date availability per region.
Please inbox me if you have any questions or comment below.