Let there be (outage) information!

Vicky Gandhi
The Tele2 Technology Blog
6 min readSep 20, 2019

by Vicky Gandhi — Architect, Product Owner & Developer

I highly recommend everyone to read our chief architect’s — Rasmus Aveskogh — blog post regarding Customer Experience ahead of this post.

Background

For any telco, being able to provide customers with high quality services is essential for the Customer Experience (subsequently lowering the churn rate). This is a fundamental truth to businesses overall of course, however telco companies have a high standard to live up to, after all our customers are paying for services which are expected to work 24 hours a day, 7 days a week, 365 days a year.

A number of years ago, our company had a clear goal, to achieve Sweden’s highest customer satisfaction ranking. Therefore, quality of experience (CX), as understood by our subscribers, was determined to be the top priority for us.

As we launched this initiative, our objective was the development of a platform that would help Com Hem to connect the dots between network performance, service data and CX, which required us to integrate real-time network performance with dynamic geospatial and topological network views, service information from CRM sources and a robust CX impact analysis layer. The result of this is IQAROS: An AI and ML — based operational excellence platform which keeps CX at the heart of all its processes, built from scratch by and for service providers.

Circumstances, some of which are foreseeable (planned network enhancements/changes/etc) and others which aren’t (power outage/force majeure/3rd party/etc), may leave customers with non-operational or unstable service(s). For those unfortunate to be impacted by these circumstances, being able to detect & provide them with ACCURATE outage information as early as possible is vital. Ideally at no point should services be affected of which their service provider is unaware of, customers need to know we’re on top of issues as soon as they’ve arisen.

As Com Hem grew and expanded into additional markets from our original cable network market, our ability to calculate customer/service impact steadily declined, largely due to (ambitious) expansion projects overshadowing technical debt. Being able to calculate which services and therefore customers that were impacted became practically non-existent in FTTx/LAN and Open Network markets.

A change needed to happen in order to

  1. Remove 15–20 years of technical debt
  2. Handle complex queries to ascertain impacted services (and therefore customers)
  3. Loosely couple BSS terms such as Service Provider and be agnostic in terms of underlying infrastructure providing the services
  4. Provide Customer-Centric outage information both reactively and proactively, through multiple channels, each with their own set of configurations and rules
  5. Enable Data-Driven operations throughout the organization based on the incoming outage information requests

The change

In the fall of 2017, our chief architect presented the IQAROS-based Impact as a Service engine at our company’s Hackathon (internally named HackaCom). Suddenly bullet points 2 & 3 above were fulfilled, allowing us to build a new system on top of the engine that could fulfill bullet points 1, 4 & 5.

Impact as a Service

A team was formed to build this system and a substantial part of the design was to ensure that this would henceforth scale, no matter how much the company would expand (or in our case, be acquired, hello Tele2!).

As anyone who’s ever done a huge system migration can tell you, it’s definitely not done overnight. To ensure Quality of Service, we built our very own “Man-In-The-Middle” proxy solution in order to have every relevant metric regarding performance, functionality, stability, response, etc. We were practically shimmering away reality, therefore code-named it Shim.

All producers and consumers were told that all they needed to do was point towards a different domain for the service, Shim would guarantee that all former API contracts would remain intact and fulfilled. Thereafter we had full control over which back-end systems were the ones responding, the monolithic or the modular.

The black arrow represents the origin call, while the red and green arrow represent the response from the back-end systems. Both responses were saved, along with performance metrics and a response was sent back to the consumer. In our early days, the picture shows the old monolith’s response being sent back

After a solid two months in which we tweaked, fixed bugs and ensured full stability, it was time for a decision.

The switch

Shim had a very handy built-in functionality. By simply changing a value in its configuration, you were able to switch which back-end systems would take the master role, effectively being the one answering all calls.

On the 30th of May 2018, we made the switch from our legacy impact system to the IQAROS based IaaS. The results, I believe, speak for themselves in the response times we were achieving.

Performance metrics PRE vs POST switch

Each request received a Customer-Centric response based on the correlation between outages and the customers services. Meaning we would only let the customer know which of their services that were affected. And this was all happening in real time for each request, compared to previously when everything was pre-generated.

All of this completed bullet points 1 & 4, but what about 5?

The future

Now that our capability of detecting affected services and therefore customers was at an all-time high, we realized that A LOT of the incoming requests had to do with outages we were yet unaware of. This became abundantly clear as all processed requests were being put on Kafka, so we could analyze them streaming but also go back “in time” whenever necessary.

Since the requests that we weren’t able to get any outage information on were a clear indication of some type of bad customer experience, we decided to enrich all requests with as much relevant data as possible, such as topological data, which service may be affected (whenever applicable), if any outage information had been available upon request and feed it into Elastic.

No matter how much data you gather, it’s useless if you don’t have a informative way to visualize it to your end users. So by sitting down and listening to their pains and blind spots, we came back with an optimal IQAROS dashboard.

Dashboard visualizing which points in our network may be having issues, the amount of customers beneath each point and how many of them have requested outage information for the past 60 minutes

By having the customer share in each topological point analyzed, reading data from Elastic, and adding in some thresholds, what we were able to visualize was not only a first-indication of where in the network there may be a error, but also how effective outage information can be.

The dashboard has become one of the most heavily used tools by our Network Operations Center (NOC), whom monitor the network 24/7, giving them the ability to react faster with root cause analysis as well as initiating the incident process.

This completes bullet point 5 and therefore all of our bullet points were finally completed!

Finishing words

This is still at the beginning of our journey, the data we capture from our network as well as other systems give us the capability of automating many things which are done manually today. And why shouldn’t we, the customer want their services to always work after all. But while it may take a bit longer doing the R&D, as well as automating the right things, my final words for this post are simple — Dare to challenge status quo, the end results might just surprise everyone!

P.S. Caroline & Hanna, I know I’ve been dodging finalizing this post for a (shamefully) long time but hopefully this is more than satisfactory, it certainly is for me. Thanks for keeping me on my toes :)

--

--