How we Built a Streaming Analytics platform for a Telecom Operator in less than 6 months

Bartosz Maciulewicz
Getindata Blog
Published in
5 min readNov 22, 2018

--

Let me warn you first— this will be a long read, but if you love Apache technologies as much as we do, you’ll enjoy this post :).

A real-life case

Some time ago we were contacted by Kcell, one of the largest telecommunication companies in Kazakhstan, with a question to help them build a new real-time platform for complex event processing (open-source technologies turned out to be the best choice). Since telecom operators gather more data related to internet services than ever before, the tool had to be flexible, scalable and mistake-free.

The task presented to us by Kcell, eventually led to a 6 months long joint project that delivered a modern stream analytics platform on top of cutting-edge technologies like Apache Flink, Kafka, Nifi and many others. To come up with a suitable architecture, we created a list of key requirements both in terms of features needed, aswell as performance standards…

No data can be wasted

First of all, it is worth mentioning the scale of the data collected. Kcell has more than 10 million subscribers and we get about 160 thousand events per second (eps) on average (and 300k eps at a peak! Well that’s a lot) from just a subset of the most important data sources amongst all others. This translates to around 23TB of data per month… at the beginning… as this number will only grow with future use cases added to the platform.

Our solution

Let’s talk about the challenges

Challenge 1: Low latency and high performance

Handling more than 160k messages per second on average and over 300k messages per second at a peak requires careful design from the very beginning.

Of course, it is impossible to foresee every bottleneck and selling the skin before we caught the bear may be premature. Nevertheless, we’ve started with a few assumptions that in a long run allowed us to fulfill our performance requirements. The most important one is to capture the data and populate it in one place. Whenever we need data enclosed in some external system, we stream it inside our platform. This allows us to achieve two things:

  • all lookups are local, possibly just to memory, therefore the resulting latency is the smallest possible;
  • we do not put any additional stress on those external systems. As we mentioned we are not owners of those systems and we are not able to ensure they can handle the load we could possibly generate;

Challenge 2: Control Stream

Our event streams are keyed by the user’s MSISDN (telephone number). Fortunately, we always look for business rules, based on events of a single user, therefore we can redistribute events based on that key and keep most of the computational state scoped to that key. However, it is a bit of a different story with the control stream. Those rules should be applied to each and every key. Therefore we broadcast that stream to all operators and store those definitions in an OperatorState. Thanks to that mix of Keyed and Operator State we are able to dynamically apply new, adjusted rules with no downtime.

Challenge 3: Changing data

The approach we chose is to store all our data in the Avro format and publish it in a schema registry. This way we can keep data with different versions of schema within the same Kafka topic. Moreover, it allows us to migrate our Flink’s state. All of that we could achieve with an arguably small effort of writing a few serializers that are able to lookup schema from Hortonworks Schema Registry.

Challenge 4: Slowly changing streams and Watermark

It is already quite some time since the need for event time processing has been noticed. Unfortunately, it is not enough just to extract the timestamp but to have a way of tracking the progress of the event-time. The mechanism that allows us to track this progress is called Watermark.

Watermark is a great mechanism, but it is important to understand it well because there might occur some edge-cases that distort its core idea. One of those was the problem of joining a slowly changing stream of control messages with a stream of events. The problem is that if there is no data arriving for some time, the Watermark does not advance and therefore whole processing is withheld. The solution is to skip taking Watermark of that slow stream into account. We can do it by generating Watermark equal to Long.MAX_VALUE. Of course, it does not come without a cost. We are not able to guarantee a finite global order of those two streams, but as these are rules applied dynamically we do not care if they are applied ten events earlier or later.

Challenge 5: Monitoring

Unfortunately, not every aspect can be covered with tests. Therefore, in order to ensure the platform works properly, we needed another tool, a monitoring system.

As you can see, we collect metrics into InfluxDB, a dedicated database for time series. For metrics visualization we use Grafana. Grafana serves us also as an alerting tool. Similarly, we collect logs from all the platform components for debugging purposes. To this end, we have introduced the widely-known Elasticsearch + Logstash + Kibana stack.

We believe it is worth building a monitoring platform from the very beginning since it gives a lot of insight into the core features.

Sum Up

While this post became very lengthy, I still feel we just scratched the surface of all interesting stuff, techniques, and knowledge we applied in this project. We completed this project phase, that was about building high quality, performance and reliability stream processing platform from the scratch … in less than 6 months. In fact, first production use cases were deployed in just 4 months, bringing tangible results for the client.

What we work on now? We implement use cases like next-generation BI and reporting, Customer 360 view, CAPEX and OPEX optimizations, predictive maintenance, revenue assurance, data monetization and many more.

--

--

Bartosz Maciulewicz
Getindata Blog

Amateur chef, bouldering enthusiast, stand-up addict and Head of BD @getindata